(A)Initial utility estimates for iteration 3. (B) States with next state positive utility
Information propagates outward from terminal states and eventually all states have correct value estimates
Notice that \(s_{32}\) has a lower utility compared to \(s_{23}\) due to the red oval state with negative reward next to \(s_{32}\)
Optimal Policy and Utilities for Maze
Value Iteration - Convergence
Rate of convergence depends on the maximum reward value and more importantly on the discount factor \(\gamma\).
The policy that we get from coarse estimates is close to the optimal policy long before \(V\) has converged.
This means that after a reasonable number of iterations, we could use: \[\pi(s) = \arg \max_a \left[ \sum_{s^{'}} P(s^{'}| s,a)V_{est}(s^{'}) \right]\]
Note that this is a form of greedy policy.
value-iter-convergence
Convergence of utility for the maze problem (Norvig chap 17)
For this problem, convergence is reached within 5 to 10 iterations
import numpy as npimport pandas as pd# Gridworld parameters (AIMA Chapter 17 example)rows, cols =3, 4step_reward =-0.04gamma =0.0# Terminal states and their rewards (row, col) using 0-indexed rows from topterminal_states = { (0, 3): 1.0, # top-right (1, 3): -1.0,} # just below top-rightwalls = {(1, 1)} # wall at position (row=1, col=1)# Initialize value functionV = np.zeros((rows, cols))for (i, j), r in terminal_states.items(): V[i, j] = r# Define actionsactions = {"up": (-1, 0), "down": (1, 0), "left": (0, -1), "right": (0, 1)}action_list =list(actions.values())
Running cells with 'Python 3.11.12' requires the ipykernel package.
<a href='command:jupyter.createPythonEnvAndSelectController'>Create a Python Environment</a> with the required packages.
Or install 'ipykernel' using the command: '/usr/local/bin/python -m pip install ipykernel -U --user --force-reinstall'
def next_state_reward(state, action): i, j = state di, dj = action ni, nj = i + di, j + dj# If next state is off-grid or a wall, stay in placeif (ni <0or ni >= rows or nj <0or nj >= cols) or ((ni, nj) in walls):return state, step_reward# If next state is terminal, get its terminal rewardif (ni, nj) in terminal_states:return (ni, nj), terminal_states[(ni, nj)]# Otherwise, standard step rewardreturn (ni, nj), step_rewarddef transitions(state, action):# Stochastic model: intended 0.8, perpendicular slips 0.1 eachif action == actions["up"]: slips = [actions["left"], actions["right"]]elif action == actions["down"]: slips = [actions["right"], actions["left"]]elif action == actions["left"]: slips = [actions["down"], actions["up"]]else: # right slips = [actions["up"], actions["down"]]# Gather transitions results = []# intended s1, r1 = next_state_reward(state, action) results.append((0.8, s1, r1))# slipsfor slip in slips: s2, r2 = next_state_reward(state, slip) results.append((0.1, s2, r2))return results# Value Iterationtheta =1e-4whileTrue: delta =0 newV = V.copy()for i inrange(rows):for j inrange(cols):if (i, j) in terminal_states or (i, j) in walls:continue q_values = []for action in action_list: q =0for prob, s_prime, reward in transitions((i, j), action): q += prob * (reward + gamma * V[s_prime]) q_values.append(q) newV[i, j] =max(q_values) delta =max(delta, abs(newV[i, j] - V[i, j])) V = newVif delta < theta:break# Display the resulting value functiondf = pd.DataFrame( V, index=[f"row{r}"for r inrange(rows)], columns=[f"col{c}"for c inrange(cols)])print(df)