Value Iteration in Gridworld Example

Note

Some of the figures quote utility - this is synonymous to value

  • To solve the non-linear equations for \(v^{*}(s)\) we use an iterative approach.
  • Steps:
    • Initialize estimates for the utilities of states with arbitrary values: \(v(s) \leftarrow 0 \forall s \epsilon S\)
    • Next use the iteration step below which is also called Bellman Update:

\[v_{t+1}(s) \leftarrow R(s) + \gamma \underset{a}{ \max} \left[ \sum_{s^{'}} P(s^{'}| s,a) v_t(s^{'}) \right] \forall s \epsilon S\]

This step is repeated and updated
  • Let us apply this to the maze example. Assume that \(\gamma = 1\)

val-iter-initial

Initialize value estimates to \(0\)

Value Iteration

  • Next we want to apply Bellman Update: \[v_{t+1}(s) \leftarrow R(s) + \gamma \max_{a} \left[\sum_{s^\prime} P(s^\prime | s,a)v_t(s^\prime) \right] \forall s \epsilon S\]
  • Since we are taking \(\max\) we only need to consider states whose next states have a positive utility value.
  • For the remaining states, the utility is equal to the immediate reward in the first iteration.

States to consider for value iteration

Value Iteration (t=0)

\[ v_{t+1}(s_{33}) = R(s_{33}) + \gamma \max_a \left[\sum_{s^{'}} P(s^{'}| s_{33},a)v(s^{'}) \right] \forall s \in S \]

\[ v_{t+1}(s_{33}) = -0.04 + \max_a \left[ \sum_{s'} P(s'| s_{33},\uparrow) v_t(s'), \sum_{s'} P(s'| s_{33},\downarrow)v_t(s'), \sum_{s'} P(s'| s_{33},\rightarrow) v_t(s'), \sum_{s'} P(s'| s_{33}, \leftarrow)v_t(s') \right]\]

\[v_{t+1}(s_{33}) = -0.04 + \sum_{s^{'}} P(s^{'}| s_{33},\rightarrow) v_t(s^\prime) \]

\[v_{t+1}(s_{33}) = -0.04 + P(s_{43}|s_{33},\rightarrow)v(s_{43})+P(s_{33}|s_{33},\rightarrow)v(s_{33})+P(s_{32}|s_{33},\rightarrow)v_t(s_{32}) \]

\[v_{t+1}(s_{33}) = -0.04 + 0.8 \times 1 + 0.1 \times 0 + 0.1 \times 0 = 0.76 \]

Value Iteration (t=1)

val-iter-step2

(A) Initial utility estimates for iteration 2. (B) States with next state positive utility

\[v_{t+1}(s_{33}) = -0.04 + P(s_{43}|s_{33},\rightarrow)v_t(s_{43})+P(s_{33}|s_{33},\rightarrow)v_t(s_{33}) +P(s_{32}|s_{33},\rightarrow)v_t(s_{32}) \]

\[v_{t+1}(s_{33}) = -0.04 + 0.8 \times 1 + 0.1 \times 0.76 + 0.1 \times 0 = 0.836\]

\[v_{t+1}(s_{23}) = -0.04 + P(s_{33}|s_{23},\rightarrow)v_t(s_{23})+P(s_{23}|s_{23},\rightarrow)v_t(s_{23}) = -0.04 + 0.8 \times 0.76 = 0.568\]

\[v_{t+1}(s_{32}) = -0.04 + P(s_{33}|s_{32},\uparrow)v_t(s_{33})+P(s_{42}|s_{32},\uparrow)v_t(s_{42}) +P(s_{32}|s_{32},\uparrow)v_t(s_{32})\] \[v_{t+1}(s_{32}) = -0.04 + 0.8 \times 0.76 + 0.1 \times -1 + 0.1 \times 0= 0.468\]

Value Iteration (t=2)

val-iter-step3

(A)Initial utility estimates for iteration 3. (B) States with next state positive utility

  • Information propagates outward from terminal states and eventually all states have correct value estimates
  • Notice that \(s_{32}\) has a lower utility compared to \(s_{23}\) due to the red oval state with negative reward next to \(s_{32}\)

Optimal Policy and Utilities for Maze

Value Iteration - Convergence

  • Rate of convergence depends on the maximum reward value and more importantly on the discount factor \(\gamma\).
  • The policy that we get from coarse estimates is close to the optimal policy long before \(V\) has converged.
  • This means that after a reasonable number of iterations, we could use: \[\pi(s) = \arg \max_a \left[ \sum_{s^{'}} P(s^{'}| s,a)V_{est}(s^{'}) \right]\]
  • Note that this is a form of greedy policy.

value-iter-convergence

Convergence of utility for the maze problem (Norvig chap 17)

  • For this problem, convergence is reached within 5 to 10 iterations
import numpy as np
import pandas as pd

# Gridworld parameters (AIMA Chapter 17 example)
rows, cols = 3, 4
step_reward = -0.04
gamma = 0.0

# Terminal states and their rewards (row, col) using 0-indexed rows from top
terminal_states = {
    (0, 3): 1.0,  # top-right
    (1, 3): -1.0,
}  # just below top-right
walls = {(1, 1)}  # wall at position (row=1, col=1)

# Initialize value function
V = np.zeros((rows, cols))
for (i, j), r in terminal_states.items():
    V[i, j] = r

# Define actions
actions = {"up": (-1, 0), "down": (1, 0), "left": (0, -1), "right": (0, 1)}
action_list = list(actions.values())
Running cells with 'Python 3.11.12' requires the ipykernel package.

<a href='command:jupyter.createPythonEnvAndSelectController'>Create a Python Environment</a> with the required packages.

Or install 'ipykernel' using the command: '/usr/local/bin/python -m pip install ipykernel -U --user --force-reinstall'
def next_state_reward(state, action):
    i, j = state
    di, dj = action
    ni, nj = i + di, j + dj
    # If next state is off-grid or a wall, stay in place
    if (ni < 0 or ni >= rows or nj < 0 or nj >= cols) or ((ni, nj) in walls):
        return state, step_reward
    # If next state is terminal, get its terminal reward
    if (ni, nj) in terminal_states:
        return (ni, nj), terminal_states[(ni, nj)]
    # Otherwise, standard step reward
    return (ni, nj), step_reward


def transitions(state, action):
    # Stochastic model: intended 0.8, perpendicular slips 0.1 each
    if action == actions["up"]:
        slips = [actions["left"], actions["right"]]
    elif action == actions["down"]:
        slips = [actions["right"], actions["left"]]
    elif action == actions["left"]:
        slips = [actions["down"], actions["up"]]
    else:  # right
        slips = [actions["up"], actions["down"]]
    # Gather transitions
    results = []
    # intended
    s1, r1 = next_state_reward(state, action)
    results.append((0.8, s1, r1))
    # slips
    for slip in slips:
        s2, r2 = next_state_reward(state, slip)
        results.append((0.1, s2, r2))
    return results


# Value Iteration
theta = 1e-4
while True:
    delta = 0
    newV = V.copy()
    for i in range(rows):
        for j in range(cols):
            if (i, j) in terminal_states or (i, j) in walls:
                continue
            q_values = []
            for action in action_list:
                q = 0
                for prob, s_prime, reward in transitions((i, j), action):
                    q += prob * (reward + gamma * V[s_prime])
                q_values.append(q)
            newV[i, j] = max(q_values)
            delta = max(delta, abs(newV[i, j] - V[i, j]))
    V = newV
    if delta < theta:
        break

# Display the resulting value function
df = pd.DataFrame(
    V, index=[f"row{r}" for r in range(rows)], columns=[f"col{c}" for c in range(cols)]
)

print(df)
          col0      col1      col2      col3
row0  1.255424  1.510587  1.775947  1.000000
row1  1.053525  0.000000  1.156793 -1.000000
row2  0.863027  0.742828  0.901585  0.464972
Back to top