跳至主要內容

Chapter 10 Actor-Critic Methods

RyanLee_ljx...大约 10 分钟RL

Chapter 10 Actor-Critic Methods

In chapter 8, we introduce value approximation function, that is to replace tabular representations for state/action value with function. Similarly, in chapter 9 we use function to represent policy instead fo tabular and turn to policy-based methods. So in this chapter we combine both of them, representing both value and policy with function and incorporating both policy-based and value-based methods.

Here, an “actor” refers to a policy update step. The reason that it is called an actor is that the actions are taken by following the policy. Here, an “critic” refers to a value update step. It is called a critic because it criticizes the actor by evaluating its corresponding values. From another point of view, actor-critic methods are still policy gradient algorithms. They can be obtained by extending the policy gradient algorithm introduced in Chapter 9.

Q actor-critic (QAC)

As actor-critic is still policy gradient method, it still need metrics to be optimized. Revisit the idea of policy gradient introduced in the last lecture. A scalar metric J(θ)J(\theta), which can be vˉπ\bar{v}_\pi or rˉπ\bar{r}_\pi. The gradient-ascent algorithm maximizing J(θ)J(\theta) is

θt+1=θt+αθJ(θt)=θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)](1) \begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta J(\theta_t) \\ &= \theta_t + \alpha \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A) \right] \end{aligned} \tag{1}

The stochastic gradient-ascent algorithm is

θt+1=θt+αθlnπ(atst,θt)qt(st,at).(2) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \color{blue}{q_t(s_t, a_t)}. \tag{2}

  • If qt(st,at)q_t(s_t, a_t) is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient, which has already been introduced in Chapter 9 (but without value approximation function).

  • If qt(st,at)q_t(s_t, a_t) is estimated by TD learning, the corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be obtained by incorporating TD-based value estimation into policy gradient methods.

QAC
QAC

The critic corresponds to the value update step via the Sarsa algorithm. The action values are represented by a parameterized function q(s,a,w)q(s,a,w). The actor corresponds to the policy update step in (2).

Advantage actor-critic (A2C)

The core idea of this algorithm is to introduce a baseline to reduce estimation variance.

How to do this? We need first learn about the property of baseline:

ESη,Aπ[θlnπ(AS,θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS,θt)(qπ(S,A)b(S))],(3) \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A) \right] = \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) (q_\pi(S, A) - b(S)) \right], \tag{3}

where the additional baseline b(S)b(S) is a scalar function of SS. If the equation holds true, we only need to prove

ESη,Aπ[θlnπ(AS,θt)b(S)]=0. \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) b(S) \right] = 0.

This equation is valid because

ESη,Aπ[θlnπ(AS,θt)b(S)]=sSη(s)aAπ(as,θt)θlnπ(as,θt)b(s)=sSη(s)aAθπ(as,θt)b(s)=sSη(s)b(s)aAθπ(as,θt)=sSη(s)b(s)θaAπ(as,θt)=sSη(s)b(s)θ1=0. \begin{aligned} \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) b(S) \right] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \nabla_\theta \ln \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_\theta 1 = 0. \end{aligned}

Then such a expression has the same expectation with our previous metrics. Our next step is to choose a proper b(s)b(s) to reduce the approximation variance when we use samples to approximate the true gradient. Let:

X(S,A)θlnπ(AS,θt)[qπ(S,A)b(S)].(4) X(S, A) \doteq \nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S, A) - b(S)]. \tag{4}

Then, the true gradient is E[X(S,A)]\mathbb{E}[X(S, A)]. Since we need to use a stochastic sample xx to approximate E[X]\mathbb{E}[X], it would be favorable if the variance var(X)\text{var}(X) is small. For example, if var(X)\text{var}(X) is close to zero, then any sample xx can accurately approximate E[X]\mathbb{E}[X]. On the contrary, if var(X)\text{var}(X) is large, the value of a sample may be far from E[X]\mathbb{E}[X].Although E[X]\mathbb{E}[X] is invariant to the baseline, the variance var(X)\text{var}(X) is not. Our goal is to design a good baseline to minimize var(X)\text{var}(X). In the algorithms of REINFORCE and QAC, we set b=0b = 0, which is not guaranteed to be a good baseline.

In fact, the optimal baseline that minimizes is:

\text{var}(X)$ is$$b^*(s) = \frac{\mathbb{E}_{A \sim \pi} \left[ \|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2 q_\pi(s, A) \right]}{\mathb{E}_{A \sim \pi} \left[ \|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2 \right]}, \quad s \in \mathcal{S}. \tag{5}

Although the baseline in (5) is optimal, it is too complex to be useful in practice. If the weight θlnπ(As,θt)2\|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2 is removed from (5), we can obtain a suboptimal baseline that has a concise expression:

b(s)=EAπ[qπ(s,A)]=vπ(s),sS. b^\dagger(s) = \mathbb{E}_{A \sim \pi}[q_\pi(s, A)] = v_\pi(s), \quad s \in \mathcal{S}.

This suboptimal baseline is indeed the state value!

When b(s)=vπ(s)b(s) = v_\pi(s), the gradient-ascent algorithm in (1) becomes

\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S, A) - v_\pi(S)] \right] \\ &\doteq \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t)\delta_\pi(S, A) \right]. \end{aligned} \tag{7}$$Here,$$\delta_\pi(S, A) \doteq q_\pi(S, A) - v_\pi(S)

is called the advantage function, which reflects the advantage of one action over the others. More specifically, note that vπ(s)=aAπ(as)qπ(s,a)v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) q_\pi(s, a) is the mean of the action values. If δπ(s,a)>0\delta_\pi(s, a) > 0, it means that the corresponding action has a greater value than the mean value. The stochastic version of (7) is

θt+1=θt+αθlnπ(atst,θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst,θt)δt(st,at),(8) \begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)[q_t(s_t, a_t) - v_t(s_t)] \\ &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t, a_t), \end{aligned} \tag{8}

where st,ats_t, a_t are samples of S,AS, A at time tt. Here, qt(st,at)q_t(s_t, a_t) and vt(st)v_t(s_t) are approximations of qπ(θt)(st,at)q_{\pi(\theta_t)}(s_t, a_t) and vπ(θt)(st)v_{\pi(\theta_t)}(s_t), respectively.

The algorithm in (8) updates the policy based on the relative value of qtq_t with respect to vtv_t rather than the absolute value of qtq_t. This is intuitively reasonable because, when we attempt to select an action at a state, we only care about which action has the greatest value relative to the others. This can be further interpreted by:

θt+1=θt+αθlnπ(atst,θt)δt(st,at)=θt+αθπ(atst,θt)π(atst,θt)δt(st,at)=θt+α(δt(st,at)π(atst,θt))step sizeθπ(atst,θt) \begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) \\ &= \theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t, a_t) \\ &= \theta_t + \alpha \underbrace{\left( \frac{\color{blue}{\delta_t(s_t, a_t)}}{\pi(a_t|s_t, \theta_t)} \right)}_{\text{step size}} \nabla_\theta \pi(a_t|s_t, \theta_t) \end{aligned}

The step size is proportional to the relative value δt\color{blue}{\text{relative value } \delta_t} rather than the absolute valueqt\color{blue}{\text{absolute value} q_t} in chapter 9, which is more reasonable. It can still well balance exploration and exploitation\color{blue}{\text{well balance exploration and exploitation}}.

  • If qt(st,at)q_t(s_t, a_t) and vt(st)v_t(s_t) are estimated by Monte Carlo learning, the algorithm in (10.8) is called REINFORCE with a baseline.

  • If qt(st,at)q_t(s_t, a_t) and vt(st)v_t(s_t) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

It should be noted that the advantage function in this implementation is approximated by the TD error:

qt(st,at)vt(st)rt+1+γvt(st+1)vt(st). q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t).

This approximation is reasonable as:

qπ(st,at)=E[Rt+1+γvπ(St+1)St=st,At=at], q_\pi(s_t, a_t) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1})| S_t = s_t, A_t = a_t \right],

Thus:

qπ(st,at)vπ(st)=E[Rt+1+γvπ(St+1)vπ(St)St=st,At=at], q_\pi(s_t, a_t) - v_\pi(s_t) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) - v_\pi(S_t) | S_t = s_t, A_t = a_t \right],

This expression of using the TD error helps a lot. One merit is that we only need to use a single neural network to represent vπ(s)v_\pi(s). Otherwise, if δt=qt(st,at)vt(st)\delta_t = q_t(s_t, a_t) - v_t(s_t), we need to maintain two networks to represent vπ(s)v_\pi(s) and qπ(s,a)q_\pi(s, a), respectively. Another metric is that we can reuse TD error in both actor and critic.

When we use the TD error, the algorithm may also be called TD actor-critic. In addition, it is notable that the policy π(θt)\pi(\theta_t) is stochastic and hence exploratory. Therefore, it can be directly used to generate experience samples without relying on techniques such as ϵ\epsilon-greedy.

A2C
A2C

Off-policy actor-critic

Importance sampling

Consider a random variable XXX \in \mathcal{X}. Suppose that p0(X)p_0(X) is a probability distribution. Our goal is to estimate EXp0[X]\mathbb{E}_{X \sim p_0}[X]. Suppose that we have some i.i.d. samples {xi}i=1n\{x_i\}_{i=1}^n.

  • First, if the samples {xi}i=1n\{x_i\}_{i=1}^n are generated by following p0p_0, then the average value xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i can be used to approximate EXp0[X]\mathbb{E}_{X \sim p_0}[X] because xˉ\bar{x} is an unbiased estimate of EXp0[X]\mathbb{E}_{X \sim p_0}[X] and the estimation variance converges to zero as nn \to \infty (see the law of large numbers in Box 5.1 for more information).

  • Second, consider a new scenario where the samples {xi}i=1n\{x_i\}_{i=1}^n are not generated by p0p_0. Instead, they are generated by another distribution p1p_1. Can we still use these samples to approximate EXp0[X]\mathbb{E}_{X \sim p_0}[X]? The answer is yes. However, we can no longer use xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i to approximate EXp0[X]\mathbb{E}_{X \sim p_0}[X] since xˉEXp1[X]\bar{x} \approx \mathbb{E}_{X \sim p_1}[X] rather than EXp0[X]\mathbb{E}_{X \sim p_0}[X].

In the second scenario, EXp0[X]\mathbb{E}_{X \sim p_0}[X] can be approximated based on the importance sampling technique. In particular, EXp0[X]\mathbb{E}_{X \sim p_0}[X] satisfies:

EXp0[X]=xXp0(x)x=xXp1(x)p0(x)p1(x)xf(x)=EXp1[f(X)].(9) \mathbb{E}_{X \sim p_0}[X] = \sum_{x \in \mathcal{X}} p_0(x)x = \sum_{x \in \mathcal{X}} p_1(x) \underbrace{\frac{p_0(x)}{p_1(x)} x}_{f(x)} = \mathbb{E}_{X \sim p_1}[f(X)]. \tag{9}

Thus, estimating EXp0[X]\mathbb{E}_{X \sim p_0}[X] becomes the problem of estimating EXp1[f(X)]\mathbb{E}_{X \sim p_1}[f(X)]. Let:

fˉ1ni=1nf(xi). \bar{f} \doteq \frac{1}{n} \sum_{i=1}^n f(x_i).

Since fˉ\bar{f} can effectively approximate EXp1[f(X)]\mathbb{E}_{X \sim p_1}[f(X)], it then follows from (9) that

EXp0[X]=EXp1[f(X)]fˉ=1ni=1nf(xi)=1ni=1np0(xi)p1(xi)importance weightxi.(10) \mathbb{E}_{X \sim p_0}[X] = \mathbb{E}_{X \sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum_{i=1}^n f(x_i) = \frac{1}{n} \sum_{i=1}^n \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}} x_i. \tag{10}

Equation (10) suggests that EXp0[X]\mathbb{E}_{X \sim p_0}[X] can be approximated by a weighted average of xix_i. Here, p0(xi)p1(xi)\frac{p_0(x_i)}{p_1(x_i)} is called the importance weight. Why is it called importance sampling? Because we want to find p0p_0. If we sample a xix_i, and its probability is high under p0p_0 but low under p1p_1, it means that it appears a lot under p0p_0 but very little under the current p1p_1 sampling. Therefore, we should cherish this hard-won sample, that is, this sample is very important, and we give it a large weight.

The reason why we do not directly use p0p_0 is that it is too complex to use, such as a neural network. By contrast, (10) merely requires the values of p0(xi)p_0(x_i) for some samples instead of the complex expression and thus is much easier to implement in practice.

You can see the illustrative example provided in the tutorial to better understand it.

The off-policy policy gradient theorem

Like the previous on-policy case, we need to derive the policy gradient in the off-policy case.

Suppose β\beta is the behavior policy that generates experience samples. Our aim is to use these samples to update a target policy π\pi that can maximize the metric:

J(θ)=sSdβ(s)vπ(s)=ESdβ[vπ(S)], J(\theta) = \sum_{s \in \mathcal{S}} d_\beta(s) v_\pi(s) = \mathbb{E}_{S \sim d_\beta}[v_\pi(S)],

where dβd_\beta is the stationary distribution under policy β\beta. Here we have our theorem:

Theorem 10.1 (Off-policy policy gradient theorem)

In the discounted case where γ(0,1)\gamma \in (0, 1), the gradient of J(θ)J(\theta) is

θJ(θ)=ESρ,Aβ[π(AS,θ)β(AS)importance weightθlnπ(AS,θ)qπ(S,A)],(11) \nabla_\theta J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \underbrace{\frac{\pi(A|S, \theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S, \theta) q_\pi(S, A) \right], \tag{11}

where the state distribution ρ\rho is

ρ(s)sSdβ(s)Prπ(ss),sS, \rho(s) \doteq \sum_{s' \in \mathcal{S}} d_\beta(s') \text{Pr}_\pi(s|s'), \quad s \in \mathcal{S},

where Prπ(ss)=k=0γk[Pπk]ss=[(IγPπ)1]ss\text{Pr}_\pi(s|s') = \sum_{k=0}^\infty \gamma^k [P_\pi^k]_{s's} = [(I - \gamma P_\pi)^{-1}]_{s's} is the discounted total probability of transitioning from ss' to ss under policy π\pi.

Compared with on-policy case, off-policy version adds the importane weight, as behavior policy is not same as the target policy. For proof, see at book.

The algorithm of off-policy actor-critic

Based on the theorem, similarly, we can apply the baseline with advantage function to get the algorithm:

θt+1=θt+αθπ(atst,θt)β(atst)θlnπ(atst,θt)δt(st,at) \theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t)

and hence:

θt+1=θt+αθ(δt(st,at)β(atst))θπ(atst,θt) \theta_{t+1} = \theta_t + \alpha_\theta \left( \frac{\delta_t(s_t, a_t)}{\beta(a_t|s_t)} \right) \nabla_\theta \pi(a_t|s_t, \theta_t)

The only difference is that state adheres to another behaviral policy and add an importance weight.

Off-policy actor-critic based on importance sampling
Off-policy actor-critic based on importance sampling

Deterministic actor-critic (DPG)

Up to now, the policies used in the policy gradient methods are all stochastic since it is required that π(as,θ)>0\pi(a|s, \theta) > 0 for every (s,a)(s,a) (requirement of loglog). We can add the softmax function at the last layer of the neural network to achieve this. However, when action is continuous, the output is a deterministic value.

The deterministic policy is specifically denoted as:

a=μ(s,θ)μ(s) a = \mu(s, \theta) \doteq \mu(s)

μ\mu is a mapping from S\mathcal{S} to A\mathcal{A}. μ\mu can be represented by, for example, a neural network with the input as ss, the output as aa, and the parameter as θ\theta.We may write μ(s,θ)\mu(s, \theta) in short as μ(s)\mu(s).

The policy gradient theorems introduced before are merely valid for stochastic policies. If the policy must be deterministic, we must derive a new policy gradient theorem.

Deterministic policy gradient theorem

The gradient of J(θ)J(\theta) is:

θJ(θ)=sSη(s)θμ(s)(aqμ(s,a))a=μ(s)=ESη[θμ(S)(aqμ(S,a))a=μ(S)],(11) \begin{aligned} \nabla_\theta J(\theta) &= \sum_{s \in \mathcal{S}} \eta(s) \nabla_\theta \mu(s) (\nabla_a q_\mu(s, a))|_{a=\mu(s)} \\ &= \mathbb{E}_{S \sim \eta} \left[ \nabla_\theta \mu(S) (\nabla_a q_\mu(S, a))|_{a=\mu(S)} \right], \end{aligned} \tag{11}

where η\eta is a distribution of the states. This theorem is a summary of the results of deterministic policy.

Unlike the stochastic case, the gradient in the deterministic case shown in (11) does not involve the action random variable AA. As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy.

Based on the gradient given in Theorem, we can apply the gradient-ascent algorithm to maximize J(θ)J(\theta):

θt+1=θt+αθESη[θμ(S)(aqμ(S,a))a=μ(S)]. \theta_{t+1} = \theta_t + \alpha_\theta \mathbb{E}_{S \sim \eta} \left[ \nabla_\theta \mu(S) (\nabla_a q_\mu(S, a))|_{a=\mu(S)} \right].

The corresponding stochastic gradient-ascent algorithm is

θt+1=θt+αθθμ(st)(aqμ(st,a))a=μ(st). \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t) (\nabla_a q_\mu(s_t, a))|_{a=\mu(s_t)}.

It should be noted that this algorithm is off-policy since the behavior policy β\beta may be different from μ\mu. First, the actor is off-policy. We already explained the reason when presenting Theorem. Second, the critic is also off-policy. Special attention must be paid to why the critic is off-policy but does not require the importance sampling technique. In particular, the experience sample required by the critic is (st,at,rt+1,st+1,a~t+1)(s_t, a_t, r_{t+1}, s_{t+1}, \tilde{a}_{t+1}), where a~t+1=μ(st+1)\tilde{a}_{t+1} = \mu(s_{t+1}). The generation of this experience sample involves two policies. The first is the policy for generating ata_t at sts_t, and the second is the policy for generating a~t+1\tilde{a}_{t+1} at st+1s_{t+1}. The first policy that generates ata_t is the behavior policy since ata_t is used to interact with the environment. The second policy must be μ\mu because it is the policy that the critic aims to evaluate. Hence, μ\mu is the target policy. It should be noted that a~t+1\tilde{a}_{t+1} is not used to interact with the environment in the next time step. Hence, μ\mu is not the behavior policy. Therefore, the critic is off-policy.How to select the function q(s,a,w)q(s, a, w)? The original research work [74] that proposed the deterministic policy gradient method adopted linear functions: q(s,a,w)=ϕT(s,a)wq(s, a, w) = \phi^T(s, a)w where ϕ(s,a)\phi(s, a) is the feature vector. It is currently popular to represent q(s,a,w)q(s, a, w) using neural networks, as suggested in the deep deterministic policy gradient (DDPG) method.

DPG
DPG
评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3