Finding optimal policies in Gridworld
- The environment is stochastic yet fully observable.
- Due to uncertainty, an action causes transition from state to another state with some probability. There is no dependence on previous states.
- We now have a sequential decision problem for a fully observable, stochastic environment with Markovian transition model and additive rewards consisting of:
- a set of states
. State at time is - actions
. Action at time is . - transition model describing outcome of each action in each state
- reward function
- a set of states
MDP Transition Model
- Transition Model Graph:
- Each node is a state.
- Each edge is the probability of transition
Utility for MDP
- Since we have stochastic environment, we need to take into account the transition probability matrix
- Utility of a state is the immediate reward of the state plus the expected discounted utility of the next state due to the action taken
- Bellman’s Equations: if we choose an action
then
- Suppose robot is in state
and the action taken is “RIGHT”. Also assume - We want to compute the utility of this state:
- Substituting we get:
Policy for MDP
If we choose action
that maximizes future rewards, is the maximum we can get over all possible choices of actions and is represented as .We can write this as
The optimal policy (which recommends
that maximizes U) is given by:
Can the above
be solved directly?The set of
equations for cannot be solved directly because they are non-linear due the presence of ‘max’ function.The set of
equations for cannot be solved directly as it is dependent on unknown .
Optimal Policy for MDP
Value Iteration
- To solve the non-linear equations for
we use an iterative approach. - Steps:
- Initialize estimates for the utilities of states with arbitrary values:
- Next use the iteration step below which is also called Bellman Update:
- Initialize estimates for the utilities of states with arbitrary values:
This step is repeated and updated
- Let us apply this to the maze example. Assume that
Initialize value estimates to
Value Iteration
- Next we want to apply Bellman Update:
- Since we are taking
we only need to consider states whose next states have a positive utility value. - For the remaining states, the utility is equal to the immediate reward in the first iteration.
Value Iteration (t=0)
Value Iteration (t=1)
(A) Initial utility estimates for iteration 2. (B) States with next state positive utility
Value Iteration (t=2)
(A)Initial utility estimates for iteration 3. (B) States with next state positive utility
- Information propagates outward from terminal states and eventually all states have correct value estimates
- Notice that
has a lower utility compared to due to the red oval state with negative reward next to
Value Iteration - Convergence
- Rate of convergence depends on the maximum reward value and more importantly on the discount factor
. - The policy that we get from coarse estimates is close to the optimal policy long before
has converged. - This means that after a reasonable number of iterations, we could use:
- Note that this is a form of greedy policy.
Convergence of utility for the maze problem (Norvig chap 17)
- For the maze problem, convergence is reached within 5 to 10 iterations
Policy Iteration
Alternates between two steps:
- Policy evaluation: given a policy, find the utility of states
- Policy improvement: given the utility estimates so far, find the best policy
The steps are as follows:
- Compute utility/value of the policy
- Update
to be a greedy policy w.r.t. : - If the policy changed then return to step
- Compute utility/value of the policy
Policy improves each step and converges to the optimal policy