跳至主要內容

Chapter 9 Policy Gradient Methods

RyanLee_ljx...大约 8 分钟RL

Chapter 9 Policy Gradient Methods

In all previous chapters, all the methods are value-based methods. The difference between value-based and policy-based methods lies in their approach. Value-based methods generate policies implicitly and indirectly. The algorithm itself does not directly maintain a policy function. Instead, it solves for the value (state or action value) based on model-free or model-based methods. It greedily (or uses an epsilon-greedy approach) determines the action by maximizing the value function at each step (e.g., argmaxaQ(s,a)\arg\max_a Q(s, a)), thereby deriving the policy from the value. In contrast, policy-based methods directly represent the policy as a parameterized function π(as,θ)\pi(a|s, \theta), where θ\theta is a parameter vector (instead of previous tabular representation). The probability distribution of the policy is obtained by directly optimizing the parameter θ\theta.

In this chapter, we will introduce the policy gradient methods.

Metrics

When represented as a table, a policy is defined as optimal if it can maximize every state value. When represented by a function, a policy is defined as optimal if it can maximize certain scalar metrics. We can optimize the metrics through gradient descent/ascent method to find the optimal parameters.

We have two types of metrics for defining optimal policies.

Average state value

The first metric is the average state value or simply called the average value. It is defined as:

vˉπ=sSd(s)vπ(s), \bar{v}_\pi = \sum_{s \in \mathcal{S}} d(s) v_\pi(s),

where d(s)d(s) is the weight of state ss. It satisfies d(s)0d(s) \geq 0 for any sSs \in \mathcal{S} and sSd(s)=1\sum_{s \in \mathcal{S}} d(s) = 1. Therefore, we can interpret d(s)d(s) as a probability distribution of ss.

Then, the metric can be written as:

vˉπ=ESd[vπ(S)]. \bar{v}_\pi = \mathbb{E}_{S \sim d}[v_\pi(S)].

For distribution dd? we have two cases.The first and simplest case is that dd is independent of the policy π\pi. In this case, we specifically denote dd as d0d_0 and vˉπ\bar{v}_\pi as vˉπ0\bar{v}_\pi^0 to indicate that** the distribution is independent of the policy**. One case is to treat all the states equally important and select d0(s)=1/Sd_0(s) = 1/|\mathcal{S}|. Another case is when we are only interested in a specific state s0s_0 (e.g., the agent always starts from s0s_0). In this case, we can design d0(s0)=1,d0(ss0)=0.d_0(s_0) = 1, \quad d_0(s \neq s_0) = 0.

Another case is that dd is dependent on policy π\pi, In this case, it is common to select dd as dπd_{\pi}, which is the stationary distribution under π\pi we talked in the last chapter.

As its name suggests, vˉπ\bar{v}_\pi is a weighted average of the state values. Different values of θ\theta lead to different values of vˉπ\bar{v}_\pi. Our ultimate goal is to find an optimal policy (or equivalently an optimal θ\theta) to maximize vˉπ\bar{v}_\pi.

There are also two equivalent expressions of average state value:

J(θ)=limnE[t=0nγtRt+1]=E[t=0γtRt+1]. J(\theta) = \lim_{n \to \infty} \mathbb{E} \left[ \sum_{t=0}^{n} \gamma^t R_{t+1} \right] = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \right].

We can understand this equation easily by knowing that vπ\overline{v}_{\pi} represents the "expected future cumulative discount return". The math behind this is:

E[t=0γtRt+1]=sSd(s)E[t=0γtRt+1S0=s]=sSd(s)vπ(s)=vˉπ. \begin{aligned} \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \right] &= \sum_{s \in \mathcal{S}} d(s) \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s \right] \\ &= \sum_{s \in \mathcal{S}} d(s) v_\pi(s) \\ &= \bar{v}_\pi. \end{aligned}

The metric vˉπ\bar{v}_\pi can also be rewritten as the inner product of two vectors:

vπ=[,vπ(s),]TRS,d=[,d(s),]TRS. \begin{aligned} v_\pi &= [\dots, v_\pi(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}, \\ d &= [\dots, d(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}. \end{aligned}

Then, we have

vˉπ=dTvπ. \bar{v}_\pi = d^T v_\pi.

This expression will be useful when we analyze its gradient.

Average reward

The second metric is the average one-step reward or simply called the average reward. In particular, it is defined as

rˉπsSdπ(s)rπ(s)=ESdπ[rπ(S)],, \begin{aligned} \bar{r}_\pi &\doteq \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) \\ &= \mathbb{E}_{S \sim d_\pi}[r_\pi(S)], \end{aligned},

where dπd_\pi is the stationary distribution and

rπ(s)aAπ(as,θ)r(s,a)=EAπ(s,θ)[r(s,A)s] r_\pi(s) \doteq \sum_{a \in \mathcal{A}} \pi(a|s, \theta) r(s, a) = \mathbb{E}_{A \sim \pi(s, \theta)}[r(s, A) | s]

is the expectation of the immediate rewards. Here, r(s,a)E[Rs,a]=rrp(rs,a)r(s, a) \doteq \mathbb{E}[R | s, a] = \sum_{r} r p(r | s, a), which represents the expected reward when taking action aa at state ss.

There are also two important equivalent expressions of rˉπ\bar{r}_\pi:

Suppose that the agent collects rewards {Rt+1}t=0\{R_{t+1}\}_{t=0}^\infty by following a given policy π(θ)\pi(\theta). A common metric that readers may often see in the literature is:

J(θ)=limn1nE[t=0n1Rt+1]. J(\theta) = \lim_{n \to \infty} \frac{1}{n} \mathbb{E} \left[ \sum_{t=0}^{n-1} R_{t+1} \right].

Regarding rπ\overline{r}_{\pi}: limn1nE[t=0n1Rt+1]\lim_{n \rightarrow \infty} \frac{1}{n} \mathbb{E}[\sum_{t=0}^{n-1} R_{t+1}], we can understan it by vπ\overline{v}_{\pi} represents the "expected future cumulative discounted return". Here we can understand rπ\overline{r}_{\pi} by: it represents the "average expected return per step after the system reaches steady state". Intuitively, this can be understood as follows: in undiscounted (i.e., γ=1\gamma = 1) or infinitely long timeframes, if we directly calculate t=0Rt+1\sum_{t=0}^{\infty} R_{t+1}, this value is likely to diverge to infinity (if the reward per step is positive). To measure whether a strategy is good or bad under such long-term or infinite timeframes, we cannot compare two infinityes. Therefore, we divide the total reward by the number of time steps nn, and then find the limit of nn \rightarrow \infty. This is actually calculating the time average. According to the ergodic theorem of Markov chains, when time tt is long enough, the frequency of state visits will converge to a stationary distribution dπd_{\pi}. Therefore, the "time average" in the limit is equal to the "spatial average" (i.e., the expected reward of each state multiplied by the stationary probability of that state sdπ(s)rπ(s)\sum_{s} d_{\pi}(s) r_{\pi}(s)). This is the physical meaning of rπ\overline{r}_{\pi}.

Similarly, the average reward rˉπ\bar{r}_\pi can also be written as the inner product of two vectors. In particular, let

rπ=[,rπ(s),]TRS,dπ=[,dπ(s),]TRS, \begin{aligned} r_\pi &= [\dots, r_\pi(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}, \\ d_\pi &= [\dots, d_\pi(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}, \end{aligned}

Then, it is clear that:

rˉπ=sSdπ(s)rπ(s)=dπTrπ. \bar{r}_\pi = \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) = d_\pi^T r_\pi.

This expression will be useful when we derive its gradient.

It should be noticed that vˉπ(θ)\bar{v}_\pi(\theta) and rˉπ(θ)\bar{r}_\pi(\theta) are equivalent. In the discounted case where γ(0,1)\gamma \in (0, 1), it holds that:

rˉπ=(1γ)vˉπ. \bar{r}_\pi = (1 - \gamma) \bar{v}_\pi.

For proof, note that vˉπ(θ)=dπTvπ\bar{v}_\pi(\theta) = d_\pi^T v_\pi and rˉπ(θ)=dπTrπ\bar{r}_\pi(\theta) = d_\pi^T r_\pi, where vπv_\pi and rπr_\pi satisfy the Bellman equation vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pi. Multiplying dπTd_\pi^T on both sides of the Bellman equation yields

vˉπ=rˉπ+γdπTPπvπ=rˉπ+γdπTvπ=rˉπ+γvˉπ, \bar{v}_\pi = \bar{r}_\pi + \gamma d_\pi^T P_\pi v_\pi = \bar{r}_\pi + \gamma d_\pi^T v_\pi = \bar{r}_\pi + \gamma \bar{v}_\pi,

Notice that for stationary distribution, we have dπTPπ=dπTd_\pi^T P_\pi = d_\pi^T.

Gradients of the metrics

Given the metrics introduced in the last section, we can use gradient-based methods to maximize them. To do that, we need to first calculate the gradients of these metrics. The most important theoretical result in this chapter is the following theorem.

Theorem 9.1 (Policy gradient theorem).

The gradient of J(θ)J(\theta) is

θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a),(1) \nabla_\theta J(\theta) = \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta) q_\pi(s, a) \tag{1},

where η\eta is a state distribution and θπ\nabla_\theta \pi is the gradient of π\pi with respect to θ\theta.

In particular, J(θ)J(θ) could be vπ0ˉ,vπˉ,\bar{v^0_{\pi}}, \bar{v_{\pi}}, or rπˉ\bar{r_{\pi}}. This equality may become a strict equality or an approximation. The distribution ηη also varies in different scenarios.

Moreover, (1) has a compact form expressed in terms of expectation:

θJ(θ)=ESη,Aπ(S,θ)[θlnπ(AS,θ)qπ(S,A)],(2) \nabla_\theta J(\theta) = \mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)} \left[ \nabla_\theta \ln \pi(A|S, \theta) q_\pi(S, A) \right], \tag{2}

where ln\ln is the natural logarithm. The proof is given below. By the definition of expectation, (1) can be rewritten as:

θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a)=ESη[aAθπ(aS,θ)qπ(S,a)].(3) \begin{aligned} \nabla_\theta J(\theta) &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta) q_\pi(s, a) \\ &= \mathbb{E}_{S \sim \eta} \left[ \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|S, \theta) q_\pi(S, a) \right]. \end{aligned} \tag{3}

Furthermore, the gradient of lnπ(as,θ)\ln \pi(a|s, \theta) is:

θlnπ(as,θ)=θπ(as,θ)π(as,θ). \nabla_\theta \ln \pi(a|s, \theta) = \frac{\nabla_\theta \pi(a|s, \theta)}{\pi(a|s, \theta)}.

It follows that:

θπ(as,θ)=π(as,θ)θlnπ(as,θ).(4) \nabla_\theta \pi(a|s, \theta) = \pi(a|s, \theta) \nabla_\theta \ln \pi(a|s, \theta). \tag{4}

This is the log derivative trick. Substituting (4) into (3) gives:

θJ(θ)=E[aAπ(aS,θ)θlnπ(aS,θ)qπ(S,a)]=ESη,Aπ(S,θ)[θlnπ(AS,θ)qπ(S,A)]. \begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E} \left[ \sum_{a \in \mathcal{A}} \pi(a|S, \theta) \nabla_\theta \ln \pi(a|S, \theta) q_\pi(S, a) \right] \\ &= \mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)} \left[ \nabla_\theta \ln \pi(A|S, \theta) q_\pi(S, A) \right]. \end{aligned}

Monte Carlo policy gradient (REINFORCE)

With the gradient presented before, The gradient-ascent algorithm for maximizing J(θ)J(\theta) is:

θt+1=θt+αθJ(θt)=θt+αE[θlnπ(AS,θt)qπ(S,A)],,(5) \begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta J(\theta_t) \\ &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A) \right], \end{aligned} \tag{5},

where α>0\alpha > 0 is a constant learning rate. Since the true gradient in (5) is unknown, we can replace the true gradient with a stochastic gradient to obtain the following algorithm:

θt+1=θt+αθlnπ(atst,θt)qt(st,at),,(6) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_t(s_t, a_t), \tag{6} ,

where qt(st,at)q_t(s_t, a_t) is an approximation of qπ(st,at)q_\pi(s_t, a_t). If qt(st,at)q_t(s_t, a_t) is obtained by Monte Carlo estimation, the algorithm is called REINFORCE or Monte Carlo policy gradient, which is one of earliest and simplest policy gradient algorithms.The algorithm in (6) is important since many other policy gradient algorithms can be obtained by extending it.

Since θlnπ(atst,θt)=θπ(atst,θt)π(atst,θt)\nabla_\theta \ln \pi(a_t|s_t, \theta_t) = \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)}, we can rewrite (6) as:

θt+1=θt+α(qt(st,at)π(atst,θt))βtθπ(atst,θt), \theta_{t+1} = \theta_t + \alpha \underbrace{\left( \frac{q_t(s_t, a_t)}{\pi(a_t|s_t, \theta_t)} \right)}_{\beta_t} \nabla_\theta \pi(a_t|s_t, \theta_t),

which can be further written concisely as:

θt+1=θt+αβtθπ(atst,θt).(7) \theta_{t+1} = \theta_t + \alpha \beta_t \nabla_\theta \pi(a_t|s_t,\theta_t). \tag{7}

We have two interpretations about REINFORCE:

  1. So from this expression, if βt0\beta_t \geq 0, we can easily see that the model actually tries to maximize the π(atst,θt)\pi(a_t|s_t,\theta_t) to maximize JJ (as it follows the typical gradient ascent form: \theta_{t+1} = \theta_t + \alpha \nabla \text{objective}. \tag{7}). We can conclude that:
  • if βt0\beta_t \geq 0, the probability of choosing (st,at)(s_t, a_t) is enhanced. That is π(atst,θt+1)π(atst,θt)\pi(a_t|s_t, \theta_{t+1}) \geq \pi(a_t|s_t, \theta_t). The greater βt\beta_t is, the stronger the enhancement is.

  • If βt<0\beta_t < 0, the probability of choosing (st,at)(s_t, a_t) decreases. That is π(atst,θt+1)<π(atst,θt)\pi(a_t|s_t, \theta_{t+1}) < \pi(a_t|s_t, \theta_t).

We can prove it by:

When θt+1θt\theta_{t+1} - \theta_t is sufficiently small, it follows from the Taylor expansion that

π(atst,θt+1)π(atst,θt)+(θπ(atst,θt))T(θt+1θt)=π(atst,θt)+αβt(θπ(atst,θt))T(θπ(atst,θt))(substituting (7))=π(atst,θt)+αβtθπ(atst,θt)22. \begin{aligned} \pi(a_t|s_t, \theta_{t+1}) &\approx \pi(a_t|s_t, \theta_t) + (\nabla_\theta \pi(a_t|s_t, \theta_t))^T (\theta_{t+1} - \theta_t) \\ &= \pi(a_t|s_t, \theta_t) + \alpha \beta_t (\nabla_\theta \pi(a_t|s_t, \theta_t))^T (\nabla_\theta \pi(a_t|s_t, \theta_t)) \quad \text{(substituting (7))} \\ &= \pi(a_t|s_t, \theta_t) + \alpha \beta_t \| \nabla_\theta \pi(a_t|s_t, \theta_t) \|_2^2. \end{aligned}

It is clear that π(atst,θt+1)π(atst,θt)\pi(a_t|s_t, \theta_{t+1}) \geq \pi(a_t|s_t, \theta_t) when βt0\beta_t \geq 0 and π(atst,θt+1)<π(atst,θt)\pi(a_t|s_t, \theta_{t+1}) < \pi(a_t|s_t, \theta_t) when βt<0\beta_t < 0.

  1. The algorithm can strike a balance between exploration and exploitation to a certain extent. The logic flow here is:

As mentioned in 1, we need to first understand that if βt0\beta_t \geq 0, the probability of choosing (st,at)(s_t, a_t) is enhanced.

  • For Exploitation: Because βt\beta_t is proportional to the action value qt(st,at)q_t(s_t, a_t), high-value actions will generate a larger βt\beta_t, thus being more strongly amplified (the probability is magnified more).

  • For Exploration: Because βt\beta_t is inversely proportional to the probability of taking that action π(atst,θt)\pi(a_t|s_t, \theta_t), when the model unintentionally performs a low-probability but positive-value action, the small denominator results in a large βt\beta_t, which will produce a drastic gradient update, rapidly increasing the probability of this previously ignored action.

Moreover, since (6) uses samples to approximate the true gradient in (5), it is important to understand how the samples should be obtained.

  1. How to sample SS?

SS in the true gradient E[θlnπ(AS,θt)qπ(S,A)]\mathbb{E}[\nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A)] should obey the distribution η\eta which is either the stationary distribution dπd_\pi or the discounted total probability distribution ρπ\rho_\pi. Either dπd_\pi or ρπ\rho_\pi represents the long-term behavior exhibited under π\pi.

  1. How to sample AA?

AA in E[θlnπ(AS,θt)qπ(S,A)]\mathbb{E}[\nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A)] should obey the distribution of π(AS,θt)\pi(A|S, \theta_t). The ideal way to sample AA is to select ata_t following π(ast,θt)\pi(a|s_t, \theta_t). Therefore, the policy gradient algorithm is on-policy. Unfortunately, the ideal ways for sampling SS and AA are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation of (6) is given in the following Algorithm. In this implementation, an episode is first generated by following π(θ)\pi(\theta). Then, θ\theta is updated multiple times using every experience sample in the episode.

REINFORCE
REINFORCE
评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3