In all previous chapters, all the methods are value-based methods. The difference between value-based and policy-based methods lies in their approach. Value-based methods generate policies implicitly and indirectly. The algorithm itself does not directly maintain a policy function. Instead, it solves for the value (state or action value) based on model-free or model-based methods. It greedily (or uses an epsilon-greedy approach) determines the action by maximizing the value function at each step (e.g., argmaxaQ(s,a)), thereby deriving the policy from the value. In contrast, policy-based methods directly represent the policy as a parameterized function π(a∣s,θ), where θ is a parameter vector (instead of previous tabular representation). The probability distribution of the policy is obtained by directly optimizing the parameter θ.
In this chapter, we will introduce the policy gradient methods.
When represented as a table, a policy is defined as optimal if it can maximize every state value. When represented by a function, a policy is defined as optimal if it can maximize certain scalar metrics. We can optimize the metrics through gradient descent/ascent method to find the optimal parameters.
We have two types of metrics for defining optimal policies.
The first metric is the average state value or simply called the average value. It is defined as:
vˉπ=s∈S∑d(s)vπ(s),
where d(s) is the weight of state s. It satisfies d(s)≥0 for any s∈S and ∑s∈Sd(s)=1. Therefore, we can interpret d(s) as a probability distribution of s.
Then, the metric can be written as:
vˉπ=ES∼d[vπ(S)].
For distribution d? we have two cases.The first and simplest case is that d is independent of the policy π. In this case, we specifically denote d as d0 and vˉπ as vˉπ0 to indicate that** the distribution is independent of the policy**. One case is to treat all the states equally important and select d0(s)=1/∣S∣. Another case is when we are only interested in a specific state s0 (e.g., the agent always starts from s0). In this case, we can design d0(s0)=1,d0(s=s0)=0.
Another case is that d is dependent on policy π, In this case, it is common to select d as dπ, which is the stationary distribution under π we talked in the last chapter.
As its name suggests, vˉπ is a weighted average of the state values. Different values of θ lead to different values of vˉπ. Our ultimate goal is to find an optimal policy (or equivalently an optimal θ) to maximize vˉπ.
There are also two equivalent expressions of average state value:
J(θ)=n→∞limE[t=0∑nγtRt+1]=E[t=0∑∞γtRt+1].
We can understand this equation easily by knowing that vπ represents the "expected future cumulative discount return". The math behind this is:
The second metric is the average one-step reward or simply called the average reward. In particular, it is defined as
rˉπ≐s∈S∑dπ(s)rπ(s)=ES∼dπ[rπ(S)],,
where dπ is the stationary distribution and
rπ(s)≐a∈A∑π(a∣s,θ)r(s,a)=EA∼π(s,θ)[r(s,A)∣s]
is the expectation of the immediate rewards. Here, r(s,a)≐E[R∣s,a]=∑rrp(r∣s,a), which represents the expected reward when taking action a at state s.
There are also two important equivalent expressions of rˉπ:
Suppose that the agent collects rewards {Rt+1}t=0∞ by following a given policy π(θ). A common metric that readers may often see in the literature is:
J(θ)=n→∞limn1E[t=0∑n−1Rt+1].
Regarding rπ: limn→∞n1E[∑t=0n−1Rt+1], we can understan it by vπ represents the "expected future cumulative discounted return". Here we can understand rπ by: it represents the "average expected return per step after the system reaches steady state". Intuitively, this can be understood as follows: in undiscounted (i.e., γ=1) or infinitely long timeframes, if we directly calculate ∑t=0∞Rt+1, this value is likely to diverge to infinity (if the reward per step is positive). To measure whether a strategy is good or bad under such long-term or infinite timeframes, we cannot compare two infinityes. Therefore, we divide the total reward by the number of time steps n, and then find the limit of n→∞. This is actually calculating the time average. According to the ergodic theorem of Markov chains, when time t is long enough, the frequency of state visits will converge to a stationary distribution dπ. Therefore, the "time average" in the limit is equal to the "spatial average" (i.e., the expected reward of each state multiplied by the stationary probability of that state ∑sdπ(s)rπ(s)). This is the physical meaning of rπ.
Similarly, the average reward rˉπ can also be written as the inner product of two vectors. In particular, let
rπdπ=[…,rπ(s),…]T∈R∣S∣,=[…,dπ(s),…]T∈R∣S∣,
Then, it is clear that:
rˉπ=s∈S∑dπ(s)rπ(s)=dπTrπ.
This expression will be useful when we derive its gradient.
It should be noticed that vˉπ(θ) and rˉπ(θ) are equivalent. In the discounted case where γ∈(0,1), it holds that:
rˉπ=(1−γ)vˉπ.
For proof, note that vˉπ(θ)=dπTvπ and rˉπ(θ)=dπTrπ, where vπ and rπ satisfy the Bellman equation vπ=rπ+γPπvπ. Multiplying dπT on both sides of the Bellman equation yields
vˉπ=rˉπ+γdπTPπvπ=rˉπ+γdπTvπ=rˉπ+γvˉπ,
Notice that for stationary distribution, we have dπTPπ=dπT.
Given the metrics introduced in the last section, we can use gradient-based methods to maximize them. To do that, we need to first calculate the gradients of these metrics. The most important theoretical result in this chapter is the following theorem.
Theorem 9.1 (Policy gradient theorem).
The gradient of J(θ) is
∇θJ(θ)=s∈S∑η(s)a∈A∑∇θπ(a∣s,θ)qπ(s,a),(1)
where η is a state distribution and ∇θπ is the gradient of π with respect to θ.
In particular, J(θ) could be vπ0ˉ,vπˉ, or rπˉ. This equality may become a strict equality or an approximation. The distribution η also varies in different scenarios.
Moreover, (1) has a compact form expressed in terms of expectation:
∇θJ(θ)=ES∼η,A∼π(S,θ)[∇θlnπ(A∣S,θ)qπ(S,A)],(2)
where ln is the natural logarithm. The proof is given below. By the definition of expectation, (1) can be rewritten as:
where α>0 is a constant learning rate. Since the true gradient in (5) is unknown, we can replace the true gradient with a stochastic gradient to obtain the following algorithm:
θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at),,(6)
where qt(st,at) is an approximation of qπ(st,at). If qt(st,at) is obtained by Monte Carlo estimation, the algorithm is called REINFORCE or Monte Carlo policy gradient, which is one of earliest and simplest policy gradient algorithms.The algorithm in (6) is important since many other policy gradient algorithms can be obtained by extending it.
Since ∇θlnπ(at∣st,θt)=π(at∣st,θt)∇θπ(at∣st,θt), we can rewrite (6) as:
So from this expression, if βt≥0, we can easily see that the model actually tries to maximize the π(at∣st,θt) to maximize J (as it follows the typical gradient ascent form: \theta_{t+1} = \theta_t + \alpha \nabla \text{objective}. \tag{7}). We can conclude that:
if βt≥0, the probability of choosing (st,at) is enhanced. That is π(at∣st,θt+1)≥π(at∣st,θt). The greater βt is, the stronger the enhancement is.
If βt<0, the probability of choosing (st,at) decreases. That is π(at∣st,θt+1)<π(at∣st,θt).
We can prove it by:
When θt+1−θt is sufficiently small, it follows from the Taylor expansion that
It is clear that π(at∣st,θt+1)≥π(at∣st,θt) when βt≥0 and π(at∣st,θt+1)<π(at∣st,θt) when βt<0.
The algorithm can strike a balance between exploration and exploitation to a certain extent. The logic flow here is:
As mentioned in 1, we need to first understand that if βt≥0, the probability of choosing (st,at) is enhanced.
For Exploitation: Because βt is proportional to the action value qt(st,at), high-value actions will generate a larger βt, thus being more strongly amplified (the probability is magnified more).
For Exploration: Because βt is inversely proportional to the probability of taking that action π(at∣st,θt), when the model unintentionally performs a low-probability but positive-value action, the small denominator results in a large βt, which will produce a drastic gradient update, rapidly increasing the probability of this previously ignored action.
Moreover, since (6) uses samples to approximate the true gradient in (5), it is important to understand how the samples should be obtained.
How to sample S?
S in the true gradient E[∇θlnπ(A∣S,θt)qπ(S,A)] should obey the distribution η which is either the stationary distribution dπ or the discounted total probability distribution ρπ. Either dπ or ρπ represents the long-term behavior exhibited under π.
How to sample A?
A in E[∇θlnπ(A∣S,θt)qπ(S,A)] should obey the distribution of π(A∣S,θt). The ideal way to sample A is to select at following π(a∣st,θt). Therefore, the policy gradient algorithm is on-policy. Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation of (6) is given in the following Algorithm. In this implementation, an episode is first generated by following π(θ). Then, θ is updated multiple times using every experience sample in the episode.