Policy Gradient Algorithms – REINFORCE

Policy Gradient and REINFORCE: Overview

  • Policy-based RL methods optimize the policy directly by maximizing an objective function.
  • REINFORCE updates action probabilities so that actions producing higher returns become more likely.
  • Over many episodes, the policy evolves toward behaviors yielding better performance.
  • REINFORCE is the foundational policy gradient algorithm.

Generalization in RL

  • Algorithms such as REINFORCE make no guarantee of generalization across environments.
  • Policies typically overfit to the environment they are trained in.
  • Recommended readings on RL generalization include Robert Kirk’s survey and Google Research work on policy similarity embeddings.

REINFORCE as a Policy Gradient Method

  • REINFORCE performs gradient ascent on the expected return:
    • Increase log-probability of good actions
    • Decrease log-probability of poor actions
  • This forms the basis of the policy gradient family of algorithms.

Components of the Algorithm

  • Parametrized policy \(\pi_\theta(a \mid s)\).
  • Objective function \(J(\pi_\theta)\) (expected discounted return).
  • Policy gradient update rule for \(\theta\).

Parametrized Policy

  • The policy is a neural network with parameters \(\theta\).
  • Different \(\theta\) correspond to different policies: \[\pi_{\theta_1}(a \mid s) \neq \pi_{\theta_2}(a \mid s) \quad \text{if } \theta_1 \neq \theta_2.\]
  • A single architecture can represent many possible policies.

Objective Function

  • RL objective expressed as expected return: \[J(\pi_\theta) = \mathbb{E}_{\pi_\theta}[G_t].\]
  • Gradient-ascent update: \[\theta \leftarrow \theta + \alpha\,\nabla_\theta J(\pi_\theta).\]
  • Alternatively, minimize the loss \(-\!J(\pi_\theta)\) using gradient descent.

Policy Gradient Expression

\[ \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\; v_\pi(s) \right]. \]

  • Contains a score term and a value estimate.
  • Fundamental identity behind REINFORCE and actor–critic methods.

Log-Derivative (Score Function) Trick

  • We often need: \[\nabla_\theta \mathbb{E}_{x \sim p(\cdot \mid \theta)}[f(x)].\]
  • The log-derivative identity rewrites this as: \[\nabla_\theta \mathbb{E}[f(x)] = \mathbb{E}\!\left[f(x)\,\nabla_\theta \log p(x \mid \theta)\right].\]
  • Enables Monte-Carlo estimation without differentiating through sampling.

Log-Derivative Trick: Steps

  1. Start from the expectation: \[\mathbb{E}[f(x)] = \int f(x)\,p(x \mid \theta)\,dx.\]
  2. Differentiate: \[\nabla_\theta \int f(x)p(x\mid\theta)dx = \int f(x)\nabla_\theta p(x\mid\theta)\,dx.\]
  3. Rewrite using division by \(p(x\mid\theta)\): \[\int f(x)p(x\mid\theta) \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}dx.\]
  4. Recognize: \[\nabla_\theta \log p(x\mid\theta) = \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}.\]
  5. Final form: \[\nabla_\theta \mathbb{E}[f(x)] = \mathbb{E}\!\left[f(x)\,\nabla_\theta\log p(x\mid\theta)\right].\]

Monte Carlo Policy Gradient Form

  • Replace \(v_\pi(s)\) with the return \(G_t\): \[ \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ G_t\,\nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]. \]
  • This is the REINFORCE update rule.

Return Definition

\[ G_t(\tau) = \sum_{k=0}^{T-1} \gamma^k R_{t+1+k}. \]

  • Episodic Monte-Carlo return.
  • Computes discounted rewards along a trajectory.

Discount Factor as a Hyper-Parameter

  • \(\gamma \in [0,1]\) tunes myopic vs farsighted behavior.
  • Often chosen from \(\{0.9, 0.95, 0.99\}\) or optimized experimentally.
  • High \(\gamma\): long-term planning
  • Low \(\gamma\): prefers immediate rewards

Monte Carlo Approximation

  • The expectation over trajectories is approximated by sampling:
    • Generate trajectory \(\tau \sim \pi_\theta\)
    • Compute returns \(G_t(\tau)\)
    • Accumulate gradient contributions
  • Model-free: does not require knowledge of transition probabilities or reward model.

REINFORCE Algorithm (Pseudocode)


#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true

\begin{algorithm}
\caption{Monte Carlo Policy Gradient (REINFORCE)}
\begin{algorithmic}

  \State Initialize $\alpha$
  \State Initialize $\theta$

  \For{episode = 0 \To MAX\_EPISODE}
    \State Sample trajectory $\tau$ using $\pi_\theta$
    \State $g \gets 0$

    \For{$t = 0 \To T-1$}
      \State Compute return $G_t$
      \State $g \gets g + G_t\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)$
    \EndFor

    \State $\theta \gets \theta + \alpha g$
  \EndFor

\end{algorithmic}
\end{algorithm}