Policy Evaluation (Prediction)

The policy \(\pi\) is evaluated when we have produced the state-value function \(v_\pi(s)\) for all states. In other words when we know the expected discounted returns that each state can offer us.

Iterative Policy Evaluation

This method is based on the recognition that the Bellman operator is contractive. We can apply the Bellman expectation backup equations repeatedly in an iterative fashion and converge to the state-value function of the policy.

We start at \(k=0\) by initializing all state-value function (a vector) to \(v_0(s)=0\). In each iteration \(k\) we start with the state value function of the previous iteration \(v_k(s)\) and apply the Bellman expectation backup as prescribed by the one step lookahead tree below that is decorated relative to what we have seen in the Bellman expectation backup with the iteration information. This is called the synchronous backup formulation as we are updating all the elements of the value function vector at the same time.

policy-evaluation-tree Tree representation of the state-value function with one step look ahead across iterations.

The Bellman expectation backup is given by,

\[v_{k+1}(s) = \sum_{a \in \mathcal A} \pi(a|s) \left( \mathcal R_s^a + \gamma \sum_{s^\prime \in \mathcal S} \mathcal{P}^a_{ss^\prime} v_k(s^\prime) \right)\]

and in vector form,

\[\mathbf{v}^{k+1} = \mathbf{\mathcal R}^\pi + \gamma \mathbf{\mathcal P}^\pi \mathbf{v}^k\]

Note

A simple scalar contraction that illustrates the idea behind iterative policy evaluation is:

\[ x_{k+1} = c + \gamma x_k \]

where: - \(c\) is a constant (analogous to reward), - \(\gamma \in [0, 1)\) is the contraction factor (analogous to the discount factor in MDPs).

This scalar recurrence mimics the vector version:

\[ v_{k+1} = r^\pi + \gamma P^\pi v_k \]

but in one dimension, where \(P^\pi\) is just the scalar 1. Solving for the fixed point \(x = c + \gamma x\):

\[ x (1 - \gamma) = c \Rightarrow x = \frac{c}{1 - \gamma} \]

Just like in policy evaluation, this converges because the operator \(T(x) = c + \gamma x\) is a contraction mapping when \(\gamma < 1\).

# Scalar contraction example
c = 2
gamma = 0.9
x = 0

print("Scalar contraction iterations:\n")
for i in range(10):
    x = c + gamma * x
    print(f"Iteration {i+1}: x = {x}")

This will converge to \(x = \frac{2}{1 - 0.9} = 20\).

To implement the iterative approach to the MDP solution for the trivial two state problem, we start with initial guess:

\(v_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\)

Update iteratively:

Iteration 1:

\[ v_1 = r^\pi + \gamma P^\pi v_0 = \begin{bmatrix} 2 \\ 0 \end{bmatrix} + 0.9 \cdot \begin{bmatrix} 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} \]

Iteration 2:

\[ v_2 = r^\pi + \gamma P^\pi v_1 = \begin{bmatrix} 2 \\ 0 \end{bmatrix} + 0.9 \cdot \begin{bmatrix} 0 \\ 2 \end{bmatrix} = \begin{bmatrix} 2 \\ 1.8 \end{bmatrix} \]

Iteration 3:

\[ P^\pi v_2 = \begin{bmatrix} 1.8 \\ 2 \end{bmatrix} \Rightarrow v_3 = \begin{bmatrix} 2 \\ 0 \end{bmatrix} + 0.9 \cdot \begin{bmatrix} 1.8 \\ 2 \end{bmatrix} = \begin{bmatrix} 3.62 \\ 1.8 \end{bmatrix} \]

Iteration 4:

\[ P^\pi v_3 = \begin{bmatrix} 1.8 \\ 3.62 \end{bmatrix} \Rightarrow v_4 = \begin{bmatrix} 2 \\ 0 \end{bmatrix} + 0.9 \cdot \begin{bmatrix} 1.8 \\ 3.62 \end{bmatrix} = \begin{bmatrix} 3.62 \\ 3.258 \end{bmatrix} \]

Continuing this will converge to:

\[ v^\pi \approx \begin{bmatrix} 10.526 \\ 9.474 \end{bmatrix} \]

which matches the exact solution from the linear system.

Policy Evaluation - Minigrid Examples

The following examples are based on the minigrid - empty environment.

Iterative Method

import numpy as np
import gymnasium as gym
from minigrid.wrappers import FullyObsWrapper

def policy_eval_iterative(env, policy, discount_factor=0.99, epsilon=0.00001):
    """
    Evaluate a policy given an environment using iterative policy evaluation.
    
    Args:
        policy: [S, A] shaped matrix representing the policy.
        env: MiniGrid environment
        discount_factor: Gamma discount factor.
        epsilon: Threshold for convergence.
    
    Returns:
        Vector representing the value function.
    """
    # Build model to get state space size and transition probabilities
    P, R, state_dicts = build_minigrid_model(env)
    nS, nA = R.shape
    
    # Start with zeros for the value function
    V = np.zeros(nS)
    
    while True:
        delta = 0
        # For each state, perform a "full backup"
        for s in range(nS):
            v = 0
            # Look at the possible next actions
            for a, action_prob in enumerate(policy[s]):
                # For each action, look at the possible next states
                for s_next in range(nS):
                    # If transition is possible
                    if P[s, a, s_next] > 0:
                        # Calculate the expected value
                        v += action_prob * P[s, a, s_next] * (R[s, a] + discount_factor * V[s_next])
            # How much did the value change?
            delta = max(delta, np.abs(v - V[s]))
            V[s] = v
        # Stop evaluating once our value function change is below threshold
        if delta < epsilon:
            break
            
    return V

# Create the MiniGrid environment
env = gym.make("MiniGrid-Empty-5x5-v0")
env = FullyObsWrapper(env)

# Get model dimensions
P, R, state_dicts = build_minigrid_model(env)
nS, nA = R.shape

# Create a uniform random policy
random_policy = np.ones([nS, nA]) / nA

# Evaluate the policy
v = policy_eval_iterative(env, random_policy, discount_factor=0.99)

# Print results
print("State-value function using iterative policy evaluation:\n")
for idx, sdict in enumerate(state_dicts):
    pos = sdict['agent_pos']
    d   = sdict['agent_dir']
    print(f"  s={idx:2d}, pos={pos}, dir={d}:  V = {v[idx]:6.3f}")
State-value function using iterative policy evaluation:

  s= 0, pos=(1, 1), dir=0:  V =  0.923
  s= 1, pos=(1, 1), dir=3:  V =  0.862
  s= 2, pos=(1, 1), dir=1:  V =  0.923
  s= 3, pos=(2, 1), dir=0:  V =  1.050
  s= 4, pos=(1, 1), dir=2:  V =  0.862
  s= 5, pos=(1, 2), dir=1:  V =  1.048
  s= 6, pos=(2, 1), dir=3:  V =  0.961
  s= 7, pos=(2, 1), dir=1:  V =  1.060
  s= 8, pos=(3, 1), dir=0:  V =  1.204
  s= 9, pos=(1, 2), dir=0:  V =  1.060
  s=10, pos=(1, 2), dir=2:  V =  0.959
  s=11, pos=(1, 3), dir=1:  V =  1.199
  s=12, pos=(2, 1), dir=2:  V =  0.939
  s=13, pos=(2, 2), dir=1:  V =  1.267
  s=14, pos=(3, 1), dir=3:  V =  1.121
  s=15, pos=(3, 1), dir=1:  V =  1.372
  s=16, pos=(1, 2), dir=3:  V =  0.938
  s=17, pos=(2, 2), dir=0:  V =  1.269
  s=18, pos=(1, 3), dir=0:  V =  1.366
  s=19, pos=(1, 3), dir=2:  V =  1.117
  s=20, pos=(2, 2), dir=2:  V =  1.076
  s=21, pos=(2, 3), dir=1:  V =  1.547
  s=22, pos=(3, 1), dir=2:  V =  1.118
  s=23, pos=(3, 2), dir=1:  V =  1.892
  s=24, pos=(2, 2), dir=3:  V =  1.076
  s=25, pos=(3, 2), dir=0:  V =  1.554
  s=26, pos=(1, 3), dir=3:  V =  1.114
  s=27, pos=(2, 3), dir=0:  V =  1.881
  s=28, pos=(2, 3), dir=2:  V =  1.321
  s=29, pos=(3, 2), dir=2:  V =  1.398
  s=30, pos=(3, 3), dir=1:  V =  1.087
  s=31, pos=(3, 2), dir=3:  V =  1.327
  s=32, pos=(2, 3), dir=3:  V =  1.393
  s=33, pos=(3, 3), dir=0:  V =  1.088
  s=34, pos=(3, 3), dir=2:  V =  1.164
  s=35, pos=(3, 3), dir=3:  V =  1.165

Policy Evaluation using Ray

The problem here is trivial and in practice Ray is often used to parallelize computations. For more information see the Ray documentation and Ray RLlib

import torch
import ray
import numpy as np
import gymnasium as gym
from minigrid.wrappers import FullyObsWrapper

ray.init()  # start Ray (will auto-detect cores)

@ray.remote
def eval_state(s, V_old, policy_s, P, R, gamma, epsilon):
    """
    Compute the new V[s] for a single state s under `policy_s`
    V_old: torch tensor (nS,)
    policy_s: torch tensor (nA,)
    P: Transition probability array of shape (nS, nA, nS)
    R: Reward array of shape (nS, nA)
    """
    v = 0.0
    for a, π in enumerate(policy_s):
        for s_next in range(len(V_old)):
            # If transition is possible
            if P[s, a, s_next] > 0:
                # Calculate expected value
                v += π * P[s, a, s_next] * (R[s, a] + gamma * V_old[s_next])
    return float(v)

def policy_eval_ray_minigrid(env, policy, gamma=0.99, epsilon=1e-5):
    """
    Policy evaluation using Ray for parallel computation in MiniGrid
    """
    # Build model to get transitions and rewards
    P, R, state_dicts = build_minigrid_model(env)
    nS, nA = R.shape
    
    # torch tensors for GPU/CPU flexibility
    V_old = torch.zeros(nS, dtype=torch.float32)
    policy_t = torch.tensor(policy, dtype=torch.float32)

    while True:
        # launch one task per state
        futures = [
            eval_state.remote(
                s,
                V_old,
                policy_t[s],
                P,
                R,
                gamma,
                epsilon
            )
            for s in range(nS)
        ]
        # gather all new V's
        V_new_list = ray.get(futures)
        V_new = torch.tensor(V_new_list)

        # check convergence
        if torch.max(torch.abs(V_new - V_old)) < epsilon:
            break
        V_old = V_new

    return V_new

if __name__ == "__main__":
    # Create the MiniGrid environment
    env = gym.make("MiniGrid-Empty-5x5-v0")
    env = FullyObsWrapper(env)

    # Get model dimensions
    P, R, state_dicts = build_minigrid_model(env)
    nS, nA = R.shape
    
    # uniform random policy
    random_policy = np.ones((nS, nA)) / nA
    
    # evaluate policy using Ray parallel computation
    v = policy_eval_ray_minigrid(env, random_policy, gamma=0.99)
    
    # Print results in the same format as cell 1
    print("State-value function using Ray parallel evaluation:\n")
    for idx, sdict in enumerate(state_dicts):
        pos = sdict['agent_pos']
        d   = sdict['agent_dir']
        print(f"  s={idx:2d}, pos={pos}, dir={d}:  V = {v[idx]:6.3f}")
2025-04-28 15:34:13,046 INFO worker.py:1841 -- Started a local Ray instance.
State-value function using Ray parallel evaluation:

  s= 0, pos=(1, 1), dir=0:  V =  0.923
  s= 1, pos=(1, 1), dir=3:  V =  0.862
  s= 2, pos=(1, 1), dir=1:  V =  0.923
  s= 3, pos=(2, 1), dir=0:  V =  1.050
  s= 4, pos=(1, 1), dir=2:  V =  0.862
  s= 5, pos=(1, 2), dir=1:  V =  1.048
  s= 6, pos=(2, 1), dir=3:  V =  0.960
  s= 7, pos=(2, 1), dir=1:  V =  1.060
  s= 8, pos=(3, 1), dir=0:  V =  1.204
  s= 9, pos=(1, 2), dir=0:  V =  1.060
  s=10, pos=(1, 2), dir=2:  V =  0.959
  s=11, pos=(1, 3), dir=1:  V =  1.199
  s=12, pos=(2, 1), dir=2:  V =  0.939
  s=13, pos=(2, 2), dir=1:  V =  1.267
  s=14, pos=(3, 1), dir=3:  V =  1.121
  s=15, pos=(3, 1), dir=1:  V =  1.372
  s=16, pos=(1, 2), dir=3:  V =  0.938
  s=17, pos=(2, 2), dir=0:  V =  1.269
  s=18, pos=(1, 3), dir=0:  V =  1.366
  s=19, pos=(1, 3), dir=2:  V =  1.117
  s=20, pos=(2, 2), dir=2:  V =  1.076
  s=21, pos=(2, 3), dir=1:  V =  1.546
  s=22, pos=(3, 1), dir=2:  V =  1.118
  s=23, pos=(3, 2), dir=1:  V =  1.892
  s=24, pos=(2, 2), dir=3:  V =  1.076
  s=25, pos=(3, 2), dir=0:  V =  1.554
  s=26, pos=(1, 3), dir=3:  V =  1.114
  s=27, pos=(2, 3), dir=0:  V =  1.881
  s=28, pos=(2, 3), dir=2:  V =  1.321
  s=29, pos=(3, 2), dir=2:  V =  1.398
  s=30, pos=(3, 3), dir=1:  V =  1.087
  s=31, pos=(3, 2), dir=3:  V =  1.326
  s=32, pos=(2, 3), dir=3:  V =  1.393
  s=33, pos=(3, 3), dir=0:  V =  1.088
  s=34, pos=(3, 3), dir=2:  V =  1.164
  s=35, pos=(3, 3), dir=3:  V =  1.165
Back to top