跳至主要內容

Chapter 4 Value Iteration and Policy Iteration

RyanLee_ljx...大约 3 分钟RL

Chapter 4 Value Iteration and Policy Iteration

In the last chapter, we study the Bellman Optimality Equation. This chapter we will introduce three model-based, iterative algorithm —— value iteration, policy iteration and truncated policy iteration —— for solving the BOE to derive the optimal policy,

Value Iteration

v=f(v)=maxπrπ+γPπv v = f(v) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v

In the last chapter, we know that the contraction mapping theorem suggests an iterative algorithm:

vk+1=f(vK)=maxπrπ+γPπvk,  k=1,2,3...... v_{k+1} = f(v_K) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v_k, \ \ k=1,2,3......

v0v_0 can be arbitrary. We know we can solve this equation with the value iteration.

The algorithm goes two steps:

  1. Step 1 Policy Update (PU)

Given initial value v0(s)v_0(s), calculate action value qq. For any state, choose largest action value obtaining the updated policy.

Note that this value is not state value. It is just a interimmediate value (or estimate of state value).

  1. Step 2 Value Update (VU)

Given the new policy, calculate new iterative value v1(s)v_1(s). Since the policy is deterministic, the new value is equal to the largest action value.

Value Iteration
Value Iteration
Value Iteration
Value Iteration

The procedure can be summarized as:

vk(s)q(vk(s),a)π(k+1)(as)=argmaxaq(vk(s),a)vk+1(s)=maxaq(vk(s),a) v_k(s) \to q(v_k(s), a) \to \pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a) \to v_{k+1}(s) = \max_{a} q(v_k(s), a)

Pseudocode: Value iteration algorithm
Pseudocode: Value iteration algorithm

::: tips Example for Value Iteration

:::

Policy Iteration

  1. Step 1 Policy Evaluation (PE)

Namely calculate the true state value by solving the Bellman Equation through iterative algorithm.

vπk=rπk+γPπkvπk v_{\pi_{k}} = r_{\pi_{k}} + \gamma P_{\pi_{k}}v_{\pi_{k}}

Note that this step is just an iteration in the policy iteration, namaly the nested iteration in the policy iteration. This is not the value iteration, Since value iteration is for solving BOE. Here the iterative algorithm is just solving BE.

  1. Step 2 Policy Improvement (PI)

Now we gain the state value of current policy. We can obtain action value through

qπ(s,a)=rp(rs,a)r+svπ(s)p(ss,a) q_{\pi}(s, a) = \sum_{r}^{} p(r|s,a)r+\sum_{s'}^{}v_{\pi}(s')p(s'|s,a)

Then improved policy is

π(k+1)(as)=argmaxaq(vk(s),a) \pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a)

Or in one equation

πk+1=argmaxπrπ+γPπvπk \pi_{k+1} = \arg\max_{\pi} r_{\pi} + \gamma P_{\pi}v_{\pi_{k}}

Which is the same as the formula before since acquiring maximum action value is just acquiring current optimal policy.

Note that here PP is PπP_{\pi} not PπkP_{\pi_{k}} in the value update in value iteration since π\pi is what we calculate now. It is unknown. We need to derive it through maximum the equation, which is finding the maximum action value.

Policy Iteration Process
Policy Iteration Process
Policy Iteration
Policy Iteration
Policy Iteration
Policy Iteration
Pseudocode: Policy iteration algorithm
Pseudocode: Policy iteration algorithm

Differences between Value Iteration and Policy Iteration

  • Value Iteration repeatedly updates state value estimates using the Bellman optimality operator (which takes a max over actions). VI maintains value estimates and implicitly (greedily) updates a policy from those values.

  • Policy Iteration alternates two distinct steps: policy evaluation (compute the true state value for the current policy π\pi) and policy improvement (make the policy greedy with respect to the state value). Policy Iteration explicitly maintains a policy and its evaluated value. It focuses on producing a sequence of increasingly better policies by repeatedly computing accurate values for each policy then improving.

So we can summarized as follows:

  • Value iteration starts with an initial guess of state value, while Policy Iteration starts with a initial policy.

  • Value iteration stores value estimates of state value during iteration while policy stores policy and real state value under that policy.

  • Value Iteration updates state values by taking the maximum over actions at each step, while Policy Iteration alternates between policy evaluation and policy improvement.

  • Value Iteration performs one iterative process of updating state values until convergence, while Policy Iteration performs two processes: evaluation (of current policy) and improvement (of policy based on evaluated values).

  • Value Iteration implicitly updates the policy by choosing the best action greedily from the current value estimates, while Policy Iteration explicitly computes a policy based on evaluated values.

Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI

Truncated policy iteration algorithm

The truncated policy iteration is just one combination of the two. Or in other words, the policy iteration and value iteration are just one extreme example of truncated policy iteration.

truncated policy iteration
truncated policy iteration

From the illustration we can see, policy iteration operates a full iterative process to get a real state value, while value iteration just does one iteration. Truncated policy iteration just cut between them, choosing a relative suitable value of the interimmediate state value.

We can summarize it as: Truncated Policy Iteration is a family of algorithms between Value Iteration and Policy Iteration: perform a limited number (one or more) of evaluation sweeps for the current policy, then do a policy improvement. Value Iteration and Policy Iteration are extreme points of this family (in an intuitive sense): Policy Iteration does full evaluation, Value Iteration performs maximal greedy backups frequently — MPI interpolates between them.

Pseudocode: Truncated policy iteration algorithm
Pseudocode: Truncated policy iteration algorithm

For the convergence:

Proposition (Value Improvement)
Proposition (Value Improvement)
llustration
llustration

So truncated policy iteration is actually a trade-off between Value Iteration and Policy Iteration.

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3