Policy Evaluation (Prediction)
The policy \(\pi\) is evaluated when we have produced the state-value function \(v_\pi(s)\) for all states. In other words when we know the expected discounted returns that each state can offer us. To do so we apply the Bellman expectation backup equations repeatedly in an iterative fashion.
We start at \(k=0\) by initializing all state-value function (a vector) to \(v_0(s)=0\). In each iteration \(k\) we start with the state value function of the previous iteration \(v_k(s)\) and apply the Bellman expectation backup as prescribed by the one step lookahead tree below that is decorated relative to what we have seen in the Bellman expectation backup with the iteration information. This is called the synchronous backup formulation as we are updating all the elements of the value function vector at the same time.
Tree representation of the state-value function with one step look ahead across iterations.
The Bellman expectation backup is given by,
\[v_{k+1}(s) = \sum_{a \in \mathcal A} \pi(a|s) \left( \mathcal R_s^a + \gamma \sum_{s^\prime \in \mathcal S} \mathcal{P}^a_{ss^\prime} v_k(s^\prime) \right)\]
and in vector form,
\[\mathbf{v}^{k+1} = \mathbf{\mathcal R}^\pi + \gamma \mathbf{\mathcal P}^\pi \mathbf{v}^k\]
The following source code is instructive and standalone. It executes the policy evaluation for the Gridworld environment from the many that are part of the Gym RL python library.
# this code is from https://becomesentient.com/mdp-dynamic-programming/
import numpy as np
import gym.spaces
from gridworld import GridworldEnv
= GridworldEnv()
env
def policy_eval(policy, env, discount_factor=1.0, epsilon=0.00001):
"""
Evaluate a policy given an environment and a full description of the environment's dynamics.
Args:
policy: [S, A] shaped matrix representing the policy.
env: OpenAI env. env.P represents the transition probabilities of the environment.
env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
env.nS is a number of states in the environment.
env.nA is a number of actions in the environment.
theta: We stop evaluation once our value function change is less than theta for all states.
discount_factor: Gamma discount factor.
Returns:
Vector of length env.nS representing the value function.
"""
# Start with a random (all 0) value function
= np.zeros(env.nS)
V_old
while True:
#new value function
= np.zeros(env.nS)
V_new #stopping condition
= 0
delta
#loop over state space
for s in range(env.nS):
#To accumelate bellman expectation eqn
= 0
v_fn #get probability distribution over actions
= policy[s]
action_probs
#loop over possible actions
for a in range(env.nA):
#get transitions
= env.P[s][a]
[(prob, next_state, reward, done)] #apply bellman expectatoin eqn
+= action_probs[a] * (reward + discount_factor * V_old[next_state])
v_fn
#get the biggest difference over state space
= max(delta, abs(v_fn - V_old[s]))
delta
#update state-value
= v_fn
V_new[s]
#the new value function
= V_new
V_old
#if true value function
if(delta < epsilon):
break
return np.array(V_old)
= np.ones([env.nS, env.nA]) / env.nA
random_policy = policy_eval(random_policy, env)
v
= np.array([0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22, -20, -14, 0])
expected_v =2)
np.testing.assert_array_almost_equal(v, expected_v, decimal
print(v)
print(expected_v)