Finding optimal policies in Gridworld

Non-deterministic outcomes

MDP Transition Model

Agent-Environment Interaction with Rewards

Partial Transition Graph (A) 4x3 Maze

  • Transition Model Graph:
    • Each node is a state.
    • Each edge is the probability of transition

Utility for MDP

Robot in (3,3) of Maze
  • Since we have stochastic environment, we need to take into account the transition probability matrix
  • Utility of a state is the immediate reward of the state plus the expected discounted utility of the next state due to the action taken
  • Bellman’s Equations: if we choose an action a then

U(s)=R(s)+γsP(s|s,a)U(s)

Robot in (3,3) of Maze

  • Suppose robot is in state s33 and the action taken is “RIGHT”. Also assume γ=1
  • We want to compute the utility of this state: U(s33)=R(s33)+γ(P(s43|s33,)U(s43)+P(s33|s33,)U(s33)+P(s32|s33,)U(s32))
  • Substituting we get:

U(s33)=R(s33)+γ((0.8×U(s43))+(0.1×U(s33))+(0.1×U(s23)))

Policy for MDP

  • If we choose action a that maximizes future rewards, U(s) is the maximum we can get over all possible choices of actions and is represented as U(s).

  • We can write this as U(s)=R(s)+γmaxa(sP(s|s,a)U(s))

  • The optimal policy (which recommends a that maximizes U) is given by:

π(s)=argmaxa(sP(s|s,a)U(s))

  • Can the above 2 be solved directly?

    • The set of |S| equations for U(s) cannot be solved directly because they are non-linear due the presence of ‘max’ function.

    • The set of |S| equations for π(s) cannot be solved directly as it is dependent on unknown U(s).

Optimal Policy for MDP

Optimal Policy and Utilities for Maze

Value Iteration

  • To solve the non-linear equations for U(s) we use an iterative approach.
  • Steps:
    • Initialize estimates for the utilities of states with arbitrary values: U(s)0sϵS
    • Next use the iteration step below which is also called Bellman Update:

Vt+1(s)R(s)+γmaxa[sP(s|s,a)Ut(s)]sϵS

This step is repeated and updated
  • Let us apply this to the maze example. Assume that γ=1

val-iter-initial Initialize value estimates to 0

Value Iteration

  • Next we want to apply Bellman Update: Vt+1(s)R(s)+γmaxa[sP(s|s,a)Ut(s)]sϵS
  • Since we are taking max we only need to consider states whose next states have a positive utility value.
  • For the remaining states, the utility is equal to the immediate reward in the first iteration.

States to consider for value iteration

Value Iteration (t=0)

Vt+1(s33)=R(s33)+γmaxa[sP(s|s33,a)U(s)]sS

Vt+1(s33)=0.04+maxa[sP(s|s33,)Ut(s),sP(s|s33,)Ut(s),sP(s|s33,)Ut(s),sP(s|s33,)Ut(s)]

Vt+1(s33)=0.04+sP(s|s33,)Ut(s)

Vt+1(s33)=0.04+P(s43|s33,)U(s43)+P(s33|s33,)U(s33)+P(s32|s33,)Ut(s32)

Vt+1(s33)=0.04+0.8×1+0.1×0+0.1×0=0.76

Value Iteration (t=1)

val-iter-step2 (A) Initial utility estimates for iteration 2. (B) States with next state positive utility

Vt+1(s33)=0.04+P(s43|s33,)Ut(s43)+P(s33|s33,)Ut(s33)+P(s32|s33,)Ut(s32)

Vt+1(s33)=0.04+0.8×1+0.1×0.76+0.1×0=0.836

Vt+1(s23)=0.04+P(s33|s23,)Ut(s23)+P(s23|s23,)Ut(s23)=0.04+0.8×0.76=0.568

Vt+1(s32)=0.04+P(s33|s32,)Ut(s33)+P(s42|s32,)Ut(s42)+P(s32|s32,)Ut(s32) Vt+1(s32)=0.04+0.8×0.76+0.1×1+0.1×0=0.468

Value Iteration (t=2)

val-iter-step3 (A)Initial utility estimates for iteration 3. (B) States with next state positive utility

  • Information propagates outward from terminal states and eventually all states have correct value estimates
  • Notice that s32 has a lower utility compared to s23 due to the red oval state with negative reward next to s32

Optimal Policy and Utilities for Maze

Value Iteration - Convergence

  • Rate of convergence depends on the maximum reward value and more importantly on the discount factor γ.
  • The policy that we get from coarse estimates is close to the optimal policy long before U has converged.
  • This means that after a reasonable number of iterations, we could use: π(s)=argmaxa[sP(s|s,a)Vest(s)]
  • Note that this is a form of greedy policy.

value-iter-convergence Convergence of utility for the maze problem (Norvig chap 17)

  • For the maze problem, convergence is reached within 5 to 10 iterations

Policy Iteration

  • Alternates between two steps:

    • Policy evaluation: given a policy, find the utility of states
    • Policy improvement: given the utility estimates so far, find the best policy
  • The steps are as follows:

    1. Compute utility/value of the policy Uπ
    2. Update π to be a greedy policy w.r.t. Uπ: π(s)argmaxasP(s|s,a)Uπ(s)
    3. If the policy changed then return to step 1
  • Policy improves each step and converges to the optimal policy π

Policy iteration(source: sutton, chap 4)

Back to top