Bellman Expectation Backup
Computing the value functions given a policy
In this section we describe how to calculate the value functions by establishing a recursive relationship similar to the one we did for the return. We replace the expectations with summations over quantities such as states and actions.
Lets start with the state-value function that can be written as,
All above expectations are with respect to policy
This is perhaps one of the most important recursions in control theory - it is known as the Bellman expectation equation repeated below:
The parts of the value function above are:
- The immediate reward,
- The discounted value of the successor state
.
Similarly to the state-value function we can decompose the action-value function as,
We now face the problem that we need to compute these two value functions and we start by considering what is happening at each time step. At each time step while in state
Actions can be taken from that state
Translating what we have just described in equation form, allows us to write the state-value equation as,
This sum is easily understood if you move backwards from the action nodes of the tree to the state node. Each edge weighs with
We can now reason fairly similarly about the action-value function that can be written by taking the expectation,
The first expectation is the reward function
Successor states that can be reached from state
If you recall the agent in the Gridworld, has 80% probability to achieve its intention and make the environment to change the desired state and 20% to make the environment change to not desired states justifying the multiplicity of states given an action in the figure above.
What successor states we will transition to depends on the transition model
Substituting the
Tree that represents the action-value function after a one-step look ahead.
Bellman State-Action Value Expectation Equation
Now that we have a computable
Tree that represents the state-value function after a one-step look ahead.
With the substitution we can write the state-value function as,
Bellman State Value Expectation Equation
In the compact form (2nd line of the equation) we denote:
As we will see in a separate chapter, this equation is going to be used to iteratively calculate the converged value function of each state given an MDP and a policy. The equation is referred to as the Bellman expectation backup - it took its name from the previously shown tree like structure where we use state value functions from the leaf modes