Chapter 3 Optimal Policy and Bellman Optimality Equation
Chapter 3 Optimal Policy and Bellman Optimality Equation
We know that RL's ultimate goal is to find the optimal policy. In this chapter we will show how we obtain optimal policy through Bellman Optimality Equation.
Optimal Policy
The state value could be used to evaluate if a policy is good or not: if
We say policy is 'better' than .
If
We say policy is the optimal policy.
Here comes the questions:
Does the optimal policy exist?
Is the optimal policy unique?
Is the optimal policy stochastic or deterministic?
How to obtain the optimal policy?
Bellman Optimality Equation (BOE) will give you the answers.
The foundamental approach to improve a policy is to elevate the value of the state value. Since state value is obtained with the summation of all possible actions with the corresponding action value, an intuitive way to increase state value is just letting the agent take the high-value action with more proability.
Bellman optimality equation (BOE)
Bellman optimality equation (elementwise form):
Notes:
are known.
are unknown and to be calculated.
can be written with other .
Bellman optimality equation (matrix-vector form):
The expression contains two unknown elements, namely the policy and state value . So we need finding an approach to solve it. But before introducing the solving algorithm, we need to learn some preliminaries through some interesting exapmles.
Motivating examples
As mentioned above, BOE has two unknowns from one equation. How to solve problems like this? See the following exapmle:
提示

Okay, we know that the way is to fix one unknown and solve the equation. Suppose we fix on the rightside of the equation.
We know that . We will need to solve is the maximum with different probability assigned to each action value. So a similar exapmle goes:
提示

So through that example, we will know how to gain optimal policy. That is to adopt the action with largest action value in all state.

Solve the Bellman optimality equation
Preliminaries
- Fixed point: is a fixed point of if :
- Contraction mapping: is a contraction mapping if:
提示

So here we can introduce the important theorem:
重要

Examples like: , . Suppose that So
Contraction property of BOE
The Bellman Equation:
can be regarded as the function of . So it can write like:
And is a contraction mapping.
So we can utilize the Contraction mapping theorem to solve BOE.

This theorem gurantees the exsitence of a unique solution to BOE, as well as the algorithm to solve that.
Iterative algorithm
Matrix vector form:
Elementwise form:
The procedure goes:

提示

Actions: represent go left, stay unchanged, and go right.
Reward: entering the target area: +1; try to go out of boundary -1.



Factors determine the optimal policy
Reward design:
System model:
Discount rate: , affecting whether the model is short-sighted or not.
::: tips
We can also strengthen our understanding of action value in this chapter: Given a deterministic policy, are all actions that the policy not select at a certain state have a zero action value?
From iteration of BOE, we can see that by choosing the action with the largest action value, we can update the policy.
In this process, we just set the probability of other actions as 0. But their values remain. As a result,
This equation reaches the maximum.
When the policy is updated, the action at one state may be changed, which proves in return that under previous policy, the unselected action may be a better option. Its action value is not 0.
:::
