Chapter 3 Optimal Policy and Bellman Optimality Equation

RyanLee_ljx...大约 3 分钟

Chapter 3 Optimal Policy and Bellman Optimality Equation

We know that RL's ultimate goal is to find the optimal policy. In this chapter we will show how we obtain optimal policy through Bellman Optimality Equation.

Optimal Policy

The state value could be used to evaluate if a policy is good or not: if

v_{\pi_{1}}(s) \ge v_{\pi_{2}}(s), \ \ \forall s \in \mathcal S

We say policy $\pi_{1}$ is 'better' than $\pi_{2}$ .

v_{\pi^*}(s) \ge v_{\pi}(s), \ \ \forall s \in \mathcal S \ \ under \forall \pi

We say policy $\pi^*$ is the optimal policy.

Here comes the questions:

Does the optimal policy exist?
Is the optimal policy unique?
Is the optimal policy stochastic or deterministic?
How to obtain the optimal policy?

Bellman Optimality Equation (BOE) will give you the answers.

The foundamental approach to improve a policy is to elevate the value of the state value. Since state value is obtained with the summation of all possible actions with the corresponding action value, an intuitive way to increase state value is just letting the agent take the high-value action with more proability.

Bellman optimality equation (BOE)

Bellman optimality equation (elementwise form):

\begin{align} v(s) &= \max_{\pi} \sum_{a} \pi(a\mid s) \Bigl( \sum_{r} p(r\mid s,a)\,r + \gamma \sum_{s'} p(s'\mid s,a)\,v(s') \Bigr), \quad \forall s \in \mathcal S,\\ &= \max_{\pi} \sum_{a} \pi(a\mid s)\,q(s,a), \quad s \in \mathcal S. \end{align}

Notes:

$p(r|s, a), p(s' |s, a)$ are known.
$v(s), v(s')$ are unknown and to be calculated.
$\pi(s)$ can be written with other $\pi(s)$ .

Bellman optimality equation (matrix-vector form):

v = \max_{\pi} (r_{\pi} + \gamma P_{\pi}v)

The expression contains two unknown elements, namely the policy $\pi$ and state value $v$ . So we need finding an approach to solve it. But before introducing the solving algorithm, we need to learn some preliminaries through some interesting exapmles.

Motivating examples

As mentioned above, BOE has two unknowns from one equation. How to solve problems like this? See the following exapmle:

提示

Example (How to solve two unknowns from one equation

Okay, we know that the way is to fix one unknown and solve the equation. Suppose we fix $v(s \prime)$ on the rightside of the equation.

\begin{align} v(s) &= \max_{\pi} \sum_{a} \pi(a\mid s) \Bigl( \sum_{r} p(r\mid s,a)\,r + \gamma \sum_{s'} p(s'\mid s,a)\,v(s') \Bigr), \quad \forall s \in \mathcal S,\\ &= \max_{\pi} \sum_{a} \pi(a\mid s)\,q(s,a), \quad s \in \mathcal S. \end{align}

We know that $\max_{\pi} \sum_{a}=1$ . We will need to solve is the maximum with different probability assigned to each action value. So a similar exapmle goes:

提示

So through that example, we will know how to gain optimal policy. That is to adopt the action with largest action value in all state.

Solve the Bellman optimality equation

Preliminaries

Fixed point: $x \in X$ is a fixed point of $f : x \to X$ if :

x = f(x)

Contraction mapping: $f$ is a contraction mapping if:

\left | \right | f(x_1)- f(x_2) \left | \right | \le \gamma \left | \right | x_1- x_2 \left | \right |, \ \ \gamma < 1

提示

So here we can introduce the important theorem:

重要

Examples like: $x = 0.5x$ , $x_{k+1} = 0.5x_k$ . Suppose that $x_0 = 10$ So $x_1 = 5, x_2 = 2.5 ...... x \to 0$

Contraction property of BOE

The Bellman Equation:

v = \max_{\pi} (r_{\pi} + \gamma P_{\pi}v)

can be regarded as the function of $v$ . So it can write like:

v=f(v)

And $f(v)$ is a contraction mapping.

So we can utilize the Contraction mapping theorem to solve BOE.

Applying the contraction mapping theorem to solve BOE

This theorem gurantees the exsitence of a unique solution to BOE, as well as the algorithm to solve that.

Iterative algorithm

Matrix vector form:

v_{k+1} \;=\; f(v_k) \;=\;\max_{\pi}\bigl(r_{\pi} + \gamma\,P_{\pi}\,v_k\bigr)

Elementwise form:

\begin{align*} v_{k+1}(s) &= \max_{\pi}\sum_{a}\pi(a\!\mid\!s)\Bigl(\sum_{r}p(r\!\mid\!s,a)\,r + \gamma\sum_{s'}p(s'\!\mid\!s,a)\,v_k(s')\Bigr),\\ &= \max_{\pi}\sum_{a}\pi(a\!\mid\!s)\,q_k(s,a),\\ &= \max_{a}q_k(s,a). \end{align*}

The procedure goes:

提示

Actions: $a_l , a_0, a_r$ represent go left, stay unchanged, and go right.

Reward: entering the target area: +1; try to go out of boundary -1.

Factors determine the optimal policy

Reward design: $r$
System model: $p(s_0|s, a), p(r|s, a)$
Discount rate: $\gamma$ , affecting whether the model is short-sighted or not.

::: tips

We can also strengthen our understanding of action value in this chapter: Given a deterministic policy, are all actions that the policy not select at a certain state have a zero action value?

From iteration of BOE, we can see that by choosing the action with the largest action value, we can update the policy.

In this process, we just set the probability of other actions as 0. But their $q$ values remain. As a result,

v_{k+1}(s) = \max_{\pi}\sum_{a}\pi(a\!\mid\!s)\,q_k(s,a)

This equation reaches the maximum.

When the policy is updated, the action at one state may be changed, which proves in return that under previous policy, the unselected action may be a better option. Its action value is not 0.

:::

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 3 Optimal Policy and Bellman Optimality Equation

预览: