Chapter 10 Actor-Critic Methods
Chapter 10 Actor-Critic Methods
In chapter 8, we introduce value approximation function, that is to replace tabular representations for state/action value with function. Similarly, in chapter 9 we use function to represent policy instead fo tabular and turn to policy-based methods. So in this chapter we combine both of them, representing both value and policy with function and incorporating both policy-based and value-based methods.
Here, an “actor” refers to a policy update step. The reason that it is called an actor is that the actions are taken by following the policy. Here, an “critic” refers to a value update step. It is called a critic because it criticizes the actor by evaluating its corresponding values. From another point of view, actor-critic methods are still policy gradient algorithms. They can be obtained by extending the policy gradient algorithm introduced in Chapter 9.
Q actor-critic (QAC)
As actor-critic is still policy gradient method, it still need metrics to be optimized. Revisit the idea of policy gradient introduced in the last lecture. A scalar metric , which can be or . The gradient-ascent algorithm maximizing is
The stochastic gradient-ascent algorithm is
If is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient, which has already been introduced in Chapter 9 (but without value approximation function).
If is estimated by TD learning, the corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be obtained by incorporating TD-based value estimation into policy gradient methods.

The critic corresponds to the value update step via the Sarsa algorithm. The action values are represented by a parameterized function . The actor corresponds to the policy update step in (2).
Advantage actor-critic (A2C)
The core idea of this algorithm is to introduce a baseline to reduce estimation variance.
How to do this? We need first learn about the property of baseline:
where the additional baseline is a scalar function of . If the equation holds true, we only need to prove
This equation is valid because
Then such a expression has the same expectation with our previous metrics. Our next step is to choose a proper to reduce the approximation variance when we use samples to approximate the true gradient. Let:
Then, the true gradient is . Since we need to use a stochastic sample to approximate , it would be favorable if the variance is small. For example, if is close to zero, then any sample can accurately approximate . On the contrary, if is large, the value of a sample may be far from .Although is invariant to the baseline, the variance is not. Our goal is to design a good baseline to minimize . In the algorithms of REINFORCE and QAC, we set , which is not guaranteed to be a good baseline.
In fact, the optimal baseline that minimizes is:
\text{var}(X)$ is$$b^*(s) = \frac{\mathbb{E}_{A \sim \pi} \left[ \|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2 q_\pi(s, A) \right]}{\mathb{E}_{A \sim \pi} \left[ \|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2 \right]}, \quad s \in \mathcal{S}. \tag{5}
Although the baseline in (5) is optimal, it is too complex to be useful in practice. If the weight is removed from (5), we can obtain a suboptimal baseline that has a concise expression:
This suboptimal baseline is indeed the state value!
When , the gradient-ascent algorithm in (1) becomes
\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S, A) - v_\pi(S)] \right] \\ &\doteq \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t)\delta_\pi(S, A) \right]. \end{aligned} \tag{7}$$Here,$$\delta_\pi(S, A) \doteq q_\pi(S, A) - v_\pi(S)
is called the advantage function, which reflects the advantage of one action over the others. More specifically, note that is the mean of the action values. If , it means that the corresponding action has a greater value than the mean value. The stochastic version of (7) is
where are samples of at time . Here, and are approximations of and , respectively.
The algorithm in (8) updates the policy based on the relative value of with respect to rather than the absolute value of . This is intuitively reasonable because, when we attempt to select an action at a state, we only care about which action has the greatest value relative to the others. This can be further interpreted by:
The step size is proportional to the rather than the in chapter 9, which is more reasonable. It can still .
If and are estimated by Monte Carlo learning, the algorithm in (10.8) is called REINFORCE with a baseline.
If and are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).
It should be noted that the advantage function in this implementation is approximated by the TD error:
This approximation is reasonable as:
Thus:
This expression of using the TD error helps a lot. One merit is that we only need to use a single neural network to represent . Otherwise, if , we need to maintain two networks to represent and , respectively. Another metric is that we can reuse TD error in both actor and critic.
When we use the TD error, the algorithm may also be called TD actor-critic. In addition, it is notable that the policy is stochastic and hence exploratory. Therefore, it can be directly used to generate experience samples without relying on techniques such as -greedy.

Off-policy actor-critic
Importance sampling
Consider a random variable . Suppose that is a probability distribution. Our goal is to estimate . Suppose that we have some i.i.d. samples .
First, if the samples are generated by following , then the average value can be used to approximate because is an unbiased estimate of and the estimation variance converges to zero as (see the law of large numbers in Box 5.1 for more information).
Second, consider a new scenario where the samples are not generated by . Instead, they are generated by another distribution . Can we still use these samples to approximate ? The answer is yes. However, we can no longer use to approximate since rather than .
In the second scenario, can be approximated based on the importance sampling technique. In particular, satisfies:
Thus, estimating becomes the problem of estimating . Let:
Since can effectively approximate , it then follows from (9) that
Equation (10) suggests that can be approximated by a weighted average of . Here, is called the importance weight. Why is it called importance sampling? Because we want to find . If we sample a , and its probability is high under but low under , it means that it appears a lot under but very little under the current sampling. Therefore, we should cherish this hard-won sample, that is, this sample is very important, and we give it a large weight.
The reason why we do not directly use is that it is too complex to use, such as a neural network. By contrast, (10) merely requires the values of for some samples instead of the complex expression and thus is much easier to implement in practice.
You can see the illustrative example provided in the tutorial to better understand it.
The off-policy policy gradient theorem
Like the previous on-policy case, we need to derive the policy gradient in the off-policy case.
Suppose is the behavior policy that generates experience samples. Our aim is to use these samples to update a target policy that can maximize the metric:
where is the stationary distribution under policy . Here we have our theorem:
Theorem 10.1 (Off-policy policy gradient theorem)
In the discounted case where , the gradient of is
where the state distribution is
where is the discounted total probability of transitioning from to under policy .
Compared with on-policy case, off-policy version adds the importane weight, as behavior policy is not same as the target policy. For proof, see at book.
The algorithm of off-policy actor-critic
Based on the theorem, similarly, we can apply the baseline with advantage function to get the algorithm:
and hence:
The only difference is that state adheres to another behaviral policy and add an importance weight.

Deterministic actor-critic (DPG)
Up to now, the policies used in the policy gradient methods are all stochastic since it is required that for every (requirement of ). We can add the softmax function at the last layer of the neural network to achieve this. However, when action is continuous, the output is a deterministic value.
The deterministic policy is specifically denoted as:
is a mapping from to . can be represented by, for example, a neural network with the input as , the output as , and the parameter as .We may write in short as .
The policy gradient theorems introduced before are merely valid for stochastic policies. If the policy must be deterministic, we must derive a new policy gradient theorem.
Deterministic policy gradient theorem
The gradient of is:
where is a distribution of the states. This theorem is a summary of the results of deterministic policy.
Unlike the stochastic case, the gradient in the deterministic case shown in (11) does not involve the action random variable . As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy.
Based on the gradient given in Theorem, we can apply the gradient-ascent algorithm to maximize :
The corresponding stochastic gradient-ascent algorithm is
It should be noted that this algorithm is off-policy since the behavior policy may be different from . First, the actor is off-policy. We already explained the reason when presenting Theorem. Second, the critic is also off-policy. Special attention must be paid to why the critic is off-policy but does not require the importance sampling technique. In particular, the experience sample required by the critic is , where . The generation of this experience sample involves two policies. The first is the policy for generating at , and the second is the policy for generating at . The first policy that generates is the behavior policy since is used to interact with the environment. The second policy must be because it is the policy that the critic aims to evaluate. Hence, is the target policy. It should be noted that is not used to interact with the environment in the next time step. Hence, is not the behavior policy. Therefore, the critic is off-policy.How to select the function ? The original research work [74] that proposed the deterministic policy gradient method adopted linear functions: where is the feature vector. It is currently popular to represent using neural networks, as suggested in the deep deterministic policy gradient (DDPG) method.

