Example of \(Q(s,a)\) Prediction

td-environment

Suppose an agent is learning to play the toy environment shown above. This is a essentially a corridor and the agent has to learn to navigate to the end of the corridor to the good terminal state \(s_{T2}\), denoted with a star.

How can we learn the optimal Q function?

The diagram is split into five blocks from top to bottom. Each block corresponds to a single episode of experiences in the environment; the first block corresponds to the first episode, the second block the second episode, and so on. Each block contains a number of columns. They are interpreted from left to right as follows:

\[Q*(s, a) = r + γQ*(s^\prime, a^\prime)\]

td-q-function-learning TD Q-function learning

Back to top