PPO and GRPO Agents in Breakout Environment

Overview

In this assignment you will extend an existing Proximal Policy Optimization (PPO) implementation for Atari Breakout in order to develop and evaluate a variant known as Group Relative Policy Optimization (GRPO). GRPO was recently proposed as a simplified policy-gradient method for large-scale language-model training, but its underlying mechanism is broadly applicable. Your task is to adapt it to a classical control environment, implement it, and conduct controlled experiments comparing PPO and GRPO.

You will work from the existing repository - please clone it and then copy and paste its contents inside your assignment-6 folder.

https://github.com/pantelis/breakout-ppo-agent

This codebase provides a working PPO agent with vectorized environments, logging, and training scripts. You will introduce the GRPO algorithm with minimal but conceptually meaningful modifications.

The assignment has four parts: (i) baseline reproduction, (ii) theoretical derivation, (iii) GRPO implementation, and (iv) experiments and analysis.

Task 1: Baseline PPO Reproduction

  1. Install and run the provided PPO agent.

  2. Familiarize yourself with the training loop, model definition, and advantage/value computation.

  3. Produce a short PPO baseline run and verify expected behavior.

  4. Set up the environment. Clone the repository, create the Python environment using the provided instructions, and verify that agent_vectorized.py runs on your machine. Confirm that training logs appear in TensorBoard.

  5. Run a baseline PPO experiment. Use the existing PPO implementation to run a shortened training schedule (for example, 1–2 hours or a fixed number of environment steps appropriate to your compute). Save the training curves and note the approximate return achieved. Deliver a figure of mean episodic return vs. environment steps for PPO.


Task 2a: GRPO Paper

Study the GRPO paper

GRPO eliminates the learned value function and instead constructs group-relative advantages.

Suppose we collect a group of G trajectories under the same policy. For each trajectory let:

\[ R_i = \sum_t r_{i,t}, \quad \mu = \frac{1}{G}\sum_{i=1}^G R_i, \quad \sigma = \sqrt{\frac{1}{G}\sum_{i=1}^G (R_i - \mu)^2 + \varepsilon}. \]

Define the group-normalized advantage:

\[ A_i = \frac{R_i - \mu}{\sigma}. \]

Substitute \(A_i\) into the clipped PPO objective and remove the value loss term to obtain the GRPO surrogate loss.


Task 2b: Implementing GRPO in the PPO Breakout Agent

Your GRPO implementation must include the following changes. The repository structure centers around agent_vectorized.py; most of your modifications will occur there.

  1. Remove the value head from the policy network, or leave it in place but ensure it is unused during training. The training step should make no reference to predicted values or value loss.

  2. Modify the rollout buffer to store complete trajectories, including:

    • Observations,
    • Actions,
    • Rewards,
    • Episode boundaries.

    Introduce a grouping mechanism: each update should operate on groups of \(G\) completed episodes. You may use the vectorized environments directly (e.g., ( G = 8 )) or regroup trajectories explicitly.

  3. Compute group-relative advantages. For each group:

    • Compute \(R_i\), \(\mu\), and \(\sigma\),
    • Compute \(A_i = (R_i - \mu)/\sigma\),
    • Assign this \(A_i\) to all time steps of trajectory \(i\) .
  4. Implement the GRPO loss:

    • Use the PPO clipped surrogate form but replace standard advantages with the group-relative \(A_i\).
    • Remove the critic loss term entirely.
    • Retain entropy regularization.
    • You dont need to include a KL penalty term.
  5. Add a command-line argument or configuration flag to choose between PPO and GRPO at runtime.


Task 3: Experimental Evaluation

Evaluate the empirical behavior of GRPO relative to PPO. Use a fixed group size, e.g., \(G = 8\) and compare learning curves for PPO and GRPO under the same budgeted number of environment steps. Produce the mean episodic return vs. environment steps.

Note: The PPO implementation is sequential and may take a long time to train. You may need to reduce the number of environment steps or training epochs for practical experimentation. 40K with 8 environments steps took 14h on a single (underutilized) 16GB GPU and 8 CPU cores with 128GB of RAM.