PPO vs A2C: CartPole Training Speed & Sample Efficiency

#ppo #a2c #gymnasium #stablebaselines3

Why A2C Often Trains Faster Than PPO (Until It Doesn't)

Most RL tutorials pick PPO as the default on-policy algorithm without questioning it. The narrative goes: PPO is stable, sample-efficient, and industry-proven. But when you benchmark it against A2C on CartPole-v1, something weird happens — A2C hits the 500-reward threshold in half the timesteps.

This wasn't what I expected. PPO's clipped surrogate objective is supposed to make better use of each batch through multiple epochs. A2C does a single gradient step per batch and moves on. Yet in practice, A2C converged in ~25k timesteps while PPO needed 50k+ with default Stable Baselines3 hyperparameters.

The answer lies in what "sample efficiency" actually means for on-policy methods. Spoiler: it's not just about reusing data.