Why A2C Often Trains Faster Than PPO (Until It Doesn't)
Most RL tutorials pick PPO as the default on-policy algorithm without questioning it. The narrative goes: PPO is stable, sample-efficient, and industry-proven. But when you benchmark it against A2C on CartPole-v1, something weird happens — A2C hits the 500-reward threshold in half the timesteps.
This wasn't what I expected. PPO's clipped surrogate objective is supposed to make better use of each batch through multiple epochs. A2C does a single gradient step per batch and moves on. Yet in practice, A2C converged in ~25k timesteps while PPO needed 50k+ with default Stable Baselines3 hyperparameters.
The answer lies in what "sample efficiency" actually means for on-policy methods. Spoiler: it's not just about reusing data.
The Benchmark Setup: CartPole-v1 With Learning Curves
Continue reading the full article on TildAlice

Top comments (0)