Chapter 9 Policy Gradient Methods

RyanLee_ljx...大约 8 分钟

Chapter 9 Policy Gradient Methods

In all previous chapters, all the methods are value-based methods. The difference between value-based and policy-based methods lies in their approach. Value-based methods generate policies implicitly and indirectly. The algorithm itself does not directly maintain a policy function. Instead, it solves for the value (state or action value) based on model-free or model-based methods. It greedily (or uses an epsilon-greedy approach) determines the action by maximizing the value function at each step (e.g., $\arg\max_a Q(s, a)$ ), thereby deriving the policy from the value. In contrast, policy-based methods directly represent the policy as a parameterized function $\pi(a|s, \theta)$ , where $\theta$ is a parameter vector (instead of previous tabular representation). The probability distribution of the policy is obtained by directly optimizing the parameter $\theta$ .

In this chapter, we will introduce the policy gradient methods.

Metrics

When represented as a table, a policy is defined as optimal if it can maximize every state value. When represented by a function, a policy is defined as optimal if it can maximize certain scalar metrics. We can optimize the metrics through gradient descent/ascent method to find the optimal parameters.

We have two types of metrics for defining optimal policies.

Average state value

The first metric is the average state value or simply called the average value. It is defined as:

\bar{v}_\pi = \sum_{s \in \mathcal{S}} d(s) v_\pi(s),

where $d(s)$ is the weight of state $s$ . It satisfies $d(s) \geq 0$ for any $s \in \mathcal{S}$ and $\sum_{s \in \mathcal{S}} d(s) = 1$ . Therefore, we can interpret $d(s)$ as a probability distribution of $s$ .

Then, the metric can be written as:

\bar{v}_\pi = \mathbb{E}_{S \sim d}[v_\pi(S)].

For distribution $d$ ? we have two cases.The first and simplest case is that $d$ is independent of the policy $\pi$ . In this case, we specifically denote $d$ as $d_0$ and $\bar{v}_\pi$ as $\bar{v}_\pi^0$ to indicate that** the distribution is independent of the policy**. One case is to treat all the states equally important and select $d_0(s) = 1/|\mathcal{S}|$ . Another case is when we are only interested in a specific state $s_0$ (e.g., the agent always starts from $s_0$ ). In this case, we can design $d_0(s_0) = 1, \quad d_0(s \neq s_0) = 0.$

Another case is that $d$ is dependent on policy $\pi$ , In this case, it is common to select $d$ as $d_{\pi}$ , which is the stationary distribution under $\pi$ we talked in the last chapter.

As its name suggests, $\bar{v}_\pi$ is a weighted average of the state values. Different values of $\theta$ lead to different values of $\bar{v}_\pi$ . Our ultimate goal is to find an optimal policy (or equivalently an optimal $\theta$ ) to maximize $\bar{v}_\pi$ .

There are also two equivalent expressions of average state value:

J(\theta) = \lim_{n \to \infty} \mathbb{E} \left[ \sum_{t=0}^{n} \gamma^t R_{t+1} \right] = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \right].

We can understand this equation easily by knowing that $\overline{v}_{\pi}$ represents the "expected future cumulative discount return". The math behind this is:

\begin{aligned} \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \right] &= \sum_{s \in \mathcal{S}} d(s) \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s \right] \\ &= \sum_{s \in \mathcal{S}} d(s) v_\pi(s) \\ &= \bar{v}_\pi. \end{aligned}

The metric $\bar{v}_\pi$ can also be rewritten as the inner product of two vectors:

\begin{aligned} v_\pi &= [\dots, v_\pi(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}, \\ d &= [\dots, d(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}. \end{aligned}

Then, we have

\bar{v}_\pi = d^T v_\pi.

This expression will be useful when we analyze its gradient.

Average reward

The second metric is the average one-step reward or simply called the average reward. In particular, it is defined as

\begin{aligned} \bar{r}_\pi &\doteq \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) \\ &= \mathbb{E}_{S \sim d_\pi}[r_\pi(S)], \end{aligned},

where $d_\pi$ is the stationary distribution and

r_\pi(s) \doteq \sum_{a \in \mathcal{A}} \pi(a|s, \theta) r(s, a) = \mathbb{E}_{A \sim \pi(s, \theta)}[r(s, A) | s]

is the expectation of the immediate rewards. Here, $r(s, a) \doteq \mathbb{E}[R | s, a] = \sum_{r} r p(r | s, a)$ , which represents the expected reward when taking action $a$ at state $s$ .

There are also two important equivalent expressions of $\bar{r}_\pi$ :

Suppose that the agent collects rewards $\{R_{t+1}\}_{t=0}^\infty$ by following a given policy $\pi(\theta)$ . A common metric that readers may often see in the literature is:

J(\theta) = \lim_{n \to \infty} \frac{1}{n} \mathbb{E} \left[ \sum_{t=0}^{n-1} R_{t+1} \right].

Regarding $\overline{r}_{\pi}$ : $\lim_{n \rightarrow \infty} \frac{1}{n} \mathbb{E}[\sum_{t=0}^{n-1} R_{t+1}]$ , we can understan it by $\overline{v}_{\pi}$ represents the "expected future cumulative discounted return". Here we can understand $\overline{r}_{\pi}$ by: it represents the "average expected return per step after the system reaches steady state". Intuitively, this can be understood as follows: in undiscounted (i.e., $\gamma = 1$ ) or infinitely long timeframes, if we directly calculate $\sum_{t=0}^{\infty} R_{t+1}$ , this value is likely to diverge to infinity (if the reward per step is positive). To measure whether a strategy is good or bad under such long-term or infinite timeframes, we cannot compare two infinityes. Therefore, we divide the total reward by the number of time steps $n$ , and then find the limit of $n \rightarrow \infty$ . This is actually calculating the time average. According to the ergodic theorem of Markov chains, when time $t$ is long enough, the frequency of state visits will converge to a stationary distribution $d_{\pi}$ . Therefore, the "time average" in the limit is equal to the "spatial average" (i.e., the expected reward of each state multiplied by the stationary probability of that state $\sum_{s} d_{\pi}(s) r_{\pi}(s)$ ). This is the physical meaning of $\overline{r}_{\pi}$ .

Similarly, the average reward $\bar{r}_\pi$ can also be written as the inner product of two vectors. In particular, let

\begin{aligned} r_\pi &= [\dots, r_\pi(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}, \\ d_\pi &= [\dots, d_\pi(s), \dots]^T \in \mathbb{R}^{|\mathcal{S}|}, \end{aligned}

Then, it is clear that:

\bar{r}_\pi = \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) = d_\pi^T r_\pi.

This expression will be useful when we derive its gradient.

It should be noticed that $\bar{v}_\pi(\theta)$ and $\bar{r}_\pi(\theta)$ are equivalent. In the discounted case where $\gamma \in (0, 1)$ , it holds that:

\bar{r}_\pi = (1 - \gamma) \bar{v}_\pi.

For proof, note that $\bar{v}_\pi(\theta) = d_\pi^T v_\pi$ and $\bar{r}_\pi(\theta) = d_\pi^T r_\pi$ , where $v_\pi$ and $r_\pi$ satisfy the Bellman equation $v_\pi = r_\pi + \gamma P_\pi v_\pi$ . Multiplying $d_\pi^T$ on both sides of the Bellman equation yields

\bar{v}_\pi = \bar{r}_\pi + \gamma d_\pi^T P_\pi v_\pi = \bar{r}_\pi + \gamma d_\pi^T v_\pi = \bar{r}_\pi + \gamma \bar{v}_\pi,

Notice that for stationary distribution, we have $d_\pi^T P_\pi = d_\pi^T$ .

Gradients of the metrics

Given the metrics introduced in the last section, we can use gradient-based methods to maximize them. To do that, we need to first calculate the gradients of these metrics. The most important theoretical result in this chapter is the following theorem.

Theorem 9.1 (Policy gradient theorem).

The gradient of $J(\theta)$ is

\nabla_\theta J(\theta) = \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta) q_\pi(s, a) \tag{1},

where $\eta$ is a state distribution and $\nabla_\theta \pi$ is the gradient of $\pi$ with respect to $\theta$ .

In particular, $J(θ)$ could be $\bar{v^0_{\pi}}, \bar{v_{\pi}},$ or $\bar{r_{\pi}}$ . This equality may become a strict equality or an approximation. The distribution $η$ also varies in different scenarios.

Moreover, (1) has a compact form expressed in terms of expectation:

\nabla_\theta J(\theta) = \mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)} \left[ \nabla_\theta \ln \pi(A|S, \theta) q_\pi(S, A) \right], \tag{2}

where $\ln$ is the natural logarithm. The proof is given below. By the definition of expectation, (1) can be rewritten as:

\begin{aligned} \nabla_\theta J(\theta) &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta) q_\pi(s, a) \\ &= \mathbb{E}_{S \sim \eta} \left[ \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|S, \theta) q_\pi(S, a) \right]. \end{aligned} \tag{3}

Furthermore, the gradient of $\ln \pi(a|s, \theta)$ is:

\nabla_\theta \ln \pi(a|s, \theta) = \frac{\nabla_\theta \pi(a|s, \theta)}{\pi(a|s, \theta)}.

It follows that:

\nabla_\theta \pi(a|s, \theta) = \pi(a|s, \theta) \nabla_\theta \ln \pi(a|s, \theta). \tag{4}

This is the log derivative trick. Substituting (4) into (3) gives:

\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E} \left[ \sum_{a \in \mathcal{A}} \pi(a|S, \theta) \nabla_\theta \ln \pi(a|S, \theta) q_\pi(S, a) \right] \\ &= \mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)} \left[ \nabla_\theta \ln \pi(A|S, \theta) q_\pi(S, A) \right]. \end{aligned}

Monte Carlo policy gradient (REINFORCE)

With the gradient presented before, The gradient-ascent algorithm for maximizing $J(\theta)$ is:

\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta J(\theta_t) \\ &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A) \right], \end{aligned} \tag{5},

where $\alpha > 0$ is a constant learning rate. Since the true gradient in (5) is unknown, we can replace the true gradient with a stochastic gradient to obtain the following algorithm:

\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_t(s_t, a_t), \tag{6} ,

where $q_t(s_t, a_t)$ is an approximation of $q_\pi(s_t, a_t)$ . If $q_t(s_t, a_t)$ is obtained by Monte Carlo estimation, the algorithm is called REINFORCE or Monte Carlo policy gradient, which is one of earliest and simplest policy gradient algorithms.The algorithm in (6) is important since many other policy gradient algorithms can be obtained by extending it.

Since $\nabla_\theta \ln \pi(a_t|s_t, \theta_t) = \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)}$ , we can rewrite (6) as:

\theta_{t+1} = \theta_t + \alpha \underbrace{\left( \frac{q_t(s_t, a_t)}{\pi(a_t|s_t, \theta_t)} \right)}_{\beta_t} \nabla_\theta \pi(a_t|s_t, \theta_t),

which can be further written concisely as:

\theta_{t+1} = \theta_t + \alpha \beta_t \nabla_\theta \pi(a_t|s_t,\theta_t). \tag{7}

We have two interpretations about REINFORCE:

So from this expression, if $\beta_t \geq 0$ , we can easily see that the model actually tries to maximize the $\pi(a_t|s_t,\theta_t)$ to maximize $J$ (as it follows the typical gradient ascent form: $\theta_{t+1} = \theta_t + \alpha \nabla \text{objective}$ ). We can conclude that:

if $\beta_t \geq 0$ , the probability of choosing $(s_t, a_t)$ is enhanced. That is $\pi(a_t|s_t, \theta_{t+1}) \geq \pi(a_t|s_t, \theta_t)$ . The greater $\beta_t$ is, the stronger the enhancement is.
If $\beta_t < 0$ , the probability of choosing $(s_t, a_t)$ decreases. That is $\pi(a_t|s_t, \theta_{t+1}) < \pi(a_t|s_t, \theta_t)$ .

We can prove it by:

When $\theta_{t+1} - \theta_t$ is sufficiently small, it follows from the Taylor expansion that

\begin{aligned} \pi(a_t|s_t, \theta_{t+1}) &\approx \pi(a_t|s_t, \theta_t) + (\nabla_\theta \pi(a_t|s_t, \theta_t))^T (\theta_{t+1} - \theta_t) \\ &= \pi(a_t|s_t, \theta_t) + \alpha \beta_t (\nabla_\theta \pi(a_t|s_t, \theta_t))^T (\nabla_\theta \pi(a_t|s_t, \theta_t)) \quad \text{(substituting (7))} \\ &= \pi(a_t|s_t, \theta_t) + \alpha \beta_t \| \nabla_\theta \pi(a_t|s_t, \theta_t) \|_2^2. \end{aligned}

It is clear that $\pi(a_t|s_t, \theta_{t+1}) \geq \pi(a_t|s_t, \theta_t)$ when $\beta_t \geq 0$ and $\pi(a_t|s_t, \theta_{t+1}) < \pi(a_t|s_t, \theta_t)$ when $\beta_t < 0$ .

The algorithm can strike a balance between exploration and exploitation to a certain extent. The logic flow here is:

As mentioned in 1, we need to first understand that if $\beta_t \geq 0$ , the probability of choosing $(s_t, a_t)$ is enhanced.

For Exploitation: Because $\beta_t$ is proportional to the action value $q_t(s_t, a_t)$ , high-value actions will generate a larger $\beta_t$ , thus being more strongly amplified (the probability is magnified more).
For Exploration: Because $\beta_t$ is inversely proportional to the probability of taking that action $\pi(a_t|s_t, \theta_t)$ , when the model unintentionally performs a low-probability but positive-value action, the small denominator results in a large $\beta_t$ , which will produce a drastic gradient update, rapidly increasing the probability of this previously ignored action.

Moreover, since (6) uses samples to approximate the true gradient in (5), it is important to understand how the samples should be obtained.

How to sample $S$ ?

$S$ in the true gradient $\mathbb{E}[\nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A)]$ should obey the distribution $\eta$ which is either the stationary distribution $d_\pi$ or the discounted total probability distribution $\rho_\pi$ . Either $d_\pi$ or $\rho_\pi$ represents the long-term behavior exhibited under $\pi$ .

How to sample $A$ ?

$A$ in $\mathbb{E}[\nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A)]$ should obey the distribution of $\pi(A|S, \theta_t)$ . The ideal way to sample $A$ is to select $a_t$ following $\pi(a|s_t, \theta_t)$ . Therefore, the policy gradient algorithm is on-policy. Unfortunately, the ideal ways for sampling $S$ and $A$ are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation of (6) is given in the following Algorithm. In this implementation, an episode is first generated by following $\pi(\theta)$ . Then, $\theta$ is updated multiple times using every experience sample in the episode.

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 9 Policy Gradient Methods

Chapter 9 Policy Gradient Methods

Metrics

Average state value

Average reward

Gradients of the metrics

Monte Carlo policy gradient (REINFORCE)

预览: