Policy Gradient Algorithms – REINFORCE

Policy Gradient and REINFORCE: Overview

Policy-based RL methods optimize the policy directly by maximizing an objective function.
REINFORCE updates action probabilities so that actions producing higher returns become more likely.
Over many episodes, the policy evolves toward behaviors yielding better performance.
REINFORCE is the foundational policy gradient algorithm.

Generalization in RL

Algorithms such as REINFORCE make no guarantee of generalization across environments.
Policies typically overfit to the environment they are trained in.
Recommended readings on RL generalization include Robert Kirk’s survey and Google Research work on policy similarity embeddings.

REINFORCE as a Policy Gradient Method

REINFORCE performs gradient ascent on the expected return:
- Increase log-probability of good actions
- Decrease log-probability of poor actions
This forms the basis of the policy gradient family of algorithms.

Components of the Algorithm

Parametrized policy \(\pi_\theta(a \mid s)\).
Objective function \(J(\pi_\theta)\) (expected discounted return).
Policy gradient update rule for \(\theta\).

Parametrized Policy

The policy is a neural network with parameters \(\theta\).
Different \(\theta\) correspond to different policies: \[\pi_{\theta_1}(a \mid s) \neq \pi_{\theta_2}(a \mid s) \quad \text{if } \theta_1 \neq \theta_2.\]
A single architecture can represent many possible policies.

Objective Function

RL objective expressed as expected return: \[J(\pi_\theta) = \mathbb{E}_{\pi_\theta}[G_t].\]
Gradient-ascent update: \[\theta \leftarrow \theta + \alpha\,\nabla_\theta J(\pi_\theta).\]
Alternatively, minimize the loss \(-\!J(\pi_\theta)\) using gradient descent.

Policy Gradient Expression

\[ \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\; v_\pi(s) \right]. \]

Contains a score term and a value estimate.
Fundamental identity behind REINFORCE and actor–critic methods.

Log-Derivative (Score Function) Trick

We often need: \[\nabla_\theta \mathbb{E}_{x \sim p(\cdot \mid \theta)}[f(x)].\]
The log-derivative identity rewrites this as: \[\nabla_\theta \mathbb{E}[f(x)] = \mathbb{E}\!\left[f(x)\,\nabla_\theta \log p(x \mid \theta)\right].\]
Enables Monte-Carlo estimation without differentiating through sampling.

Log-Derivative Trick: Steps

Start from the expectation: \[\mathbb{E}[f(x)] = \int f(x)\,p(x \mid \theta)\,dx.\]
Differentiate: \[\nabla_\theta \int f(x)p(x\mid\theta)dx = \int f(x)\nabla_\theta p(x\mid\theta)\,dx.\]
Rewrite using division by \(p(x\mid\theta)\): \[\int f(x)p(x\mid\theta) \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}dx.\]
Recognize: \[\nabla_\theta \log p(x\mid\theta) = \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}.\]
Final form: \[\nabla_\theta \mathbb{E}[f(x)] = \mathbb{E}\!\left[f(x)\,\nabla_\theta\log p(x\mid\theta)\right].\]

Monte Carlo Policy Gradient Form

Replace \(v_\pi(s)\) with the return \(G_t\): \[ \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ G_t\,\nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]. \]
This is the REINFORCE update rule.

Return Definition

\[ G_t(\tau) = \sum_{k=0}^{T-1} \gamma^k R_{t+1+k}. \]

Episodic Monte-Carlo return.
Computes discounted rewards along a trajectory.

Discount Factor as a Hyper-Parameter

\(\gamma \in [0,1]\) tunes myopic vs farsighted behavior.
Often chosen from \(\{0.9, 0.95, 0.99\}\) or optimized experimentally.
High \(\gamma\): long-term planning
Low \(\gamma\): prefers immediate rewards

Monte Carlo Approximation

The expectation over trajectories is approximated by sampling:
- Generate trajectory \(\tau \sim \pi_\theta\)
- Compute returns \(G_t(\tau)\)
- Accumulate gradient contributions
Model-free: does not require knowledge of transition probabilities or reward model.

REINFORCE Algorithm (Pseudocode)


#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true

\begin{algorithm}
\caption{Monte Carlo Policy Gradient (REINFORCE)}
\begin{algorithmic}

  \State Initialize $\alpha$
  \State Initialize $\theta$

  \For{episode = 0 \To MAX\_EPISODE}
    \State Sample trajectory $\tau$ using $\pi_\theta$
    \State $g \gets 0$

    \For{$t = 0 \To T-1$}
      \State Compute return $G_t$
      \State $g \gets g + G_t\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)$
    \EndFor

    \State $\theta \gets \theta + \alpha g$
  \EndFor

\end{algorithmic}
\end{algorithm}