Chapter 4 Value Iteration and Policy Iteration

RyanLee_ljx...大约 3 分钟

Chapter 4 Value Iteration and Policy Iteration

In the last chapter, we study the Bellman Optimality Equation. This chapter we will introduce three model-based, iterative algorithm —— value iteration, policy iteration and truncated policy iteration —— for solving the BOE to derive the optimal policy,

Value Iteration

v = f(v) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v

In the last chapter, we know that the contraction mapping theorem suggests an iterative algorithm:

v_{k+1} = f(v_K) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v_k, \ \ k=1,2,3......

$v_0$ can be arbitrary. We know we can solve this equation with the value iteration.

The algorithm goes two steps:

Step 1 Policy Update (PU)

Given initial value $v_0(s)$ , calculate action value $q$ . For any state, choose largest action value obtaining the updated policy.

Note that this value is not state value. It is just a interimmediate value (or estimate of state value).

Step 2 Value Update (VU)

Given the new policy, calculate new iterative value $v_1(s)$ . Since the policy is deterministic, the new value is equal to the largest action value.

The procedure can be summarized as:

v_k(s) \to q(v_k(s), a) \to \pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a) \to v_{k+1}(s) = \max_{a} q(v_k(s), a)

::: tips Example for Value Iteration

:::

Policy Iteration

Step 1 Policy Evaluation (PE)

Namely calculate the true state value by solving the Bellman Equation through iterative algorithm.

v_{\pi_{k}} = r_{\pi_{k}} + \gamma P_{\pi_{k}}v_{\pi_{k}}

Note that this step is just an iteration in the policy iteration, namaly the nested iteration in the policy iteration. This is not the value iteration, Since value iteration is for solving BOE. Here the iterative algorithm is just solving BE.

Step 2 Policy Improvement (PI)

Now we gain the state value of current policy. We can obtain action value through

q_{\pi}(s, a) = \sum_{r}^{} p(r|s,a)r+\sum_{s'}^{}v_{\pi}(s')p(s'|s,a)

Then improved policy is

\pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a)

Or in one equation

\pi_{k+1} = \arg\max_{\pi} r_{\pi} + \gamma P_{\pi}v_{\pi_{k}}

Which is the same as the formula before since acquiring maximum action value is just acquiring current optimal policy.

Note that here $P$ is $P_{\pi}$ not $P_{\pi_{k}}$ in the value update in value iteration since $\pi$ is what we calculate now. It is unknown. We need to derive it through maximum the equation, which is finding the maximum action value.

Differences between Value Iteration and Policy Iteration

Value Iteration repeatedly updates state value estimates using the Bellman optimality operator (which takes a max over actions). VI maintains value estimates and implicitly (greedily) updates a policy from those values.
Policy Iteration alternates two distinct steps: policy evaluation (compute the true state value for the current policy $\pi$ ) and policy improvement (make the policy greedy with respect to the state value). Policy Iteration explicitly maintains a policy and its evaluated value. It focuses on producing a sequence of increasingly better policies by repeatedly computing accurate values for each policy then improving.

So we can summarized as follows:

Value iteration starts with an initial guess of state value, while Policy Iteration starts with a initial policy.
Value iteration stores value estimates of state value during iteration while policy stores policy and real state value under that policy.
Value Iteration updates state values by taking the maximum over actions at each step, while Policy Iteration alternates between policy evaluation and policy improvement.
Value Iteration performs one iterative process of updating state values until convergence, while Policy Iteration performs two processes: evaluation (of current policy) and improvement (of policy based on evaluated values).
Value Iteration implicitly updates the policy by choosing the best action greedily from the current value estimates, while Policy Iteration explicitly computes a policy based on evaluated values.

Truncated policy iteration algorithm

The truncated policy iteration is just one combination of the two. Or in other words, the policy iteration and value iteration are just one extreme example of truncated policy iteration.

From the illustration we can see, policy iteration operates a full iterative process to get a real state value, while value iteration just does one iteration. Truncated policy iteration just cut between them, choosing a relative suitable value of the interimmediate state value.

We can summarize it as: Truncated Policy Iteration is a family of algorithms between Value Iteration and Policy Iteration: perform a limited number (one or more) of evaluation sweeps for the current policy, then do a policy improvement. Value Iteration and Policy Iteration are extreme points of this family (in an intuitive sense): Policy Iteration does full evaluation, Value Iteration performs maximal greedy backups frequently — MPI interpolates between them.

Pseudocode: Truncated policy iteration algorithm

For the convergence:

So truncated policy iteration is actually a trade-off between Value Iteration and Policy Iteration.

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 4 Value Iteration and Policy Iteration

Chapter 4 Value Iteration and Policy Iteration

Value Iteration

Policy Iteration

Differences between Value Iteration and Policy Iteration

Truncated policy iteration algorithm

预览: