DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

PPO vs A2C: CartPole Training Speed & Sample Efficiency

Why A2C Often Trains Faster Than PPO (Until It Doesn't)

Most RL tutorials pick PPO as the default on-policy algorithm without questioning it. The narrative goes: PPO is stable, sample-efficient, and industry-proven. But when you benchmark it against A2C on CartPole-v1, something weird happens — A2C hits the 500-reward threshold in half the timesteps.

This wasn't what I expected. PPO's clipped surrogate objective is supposed to make better use of each batch through multiple epochs. A2C does a single gradient step per batch and moves on. Yet in practice, A2C converged in ~25k timesteps while PPO needed 50k+ with default Stable Baselines3 hyperparameters.

The answer lies in what "sample efficiency" actually means for on-policy methods. Spoiler: it's not just about reusing data.

Two young girls performing rhythmic gymnastics with ribbons in an indoor sports hall.

Photo by cottonbro studio on Pexels

The Benchmark Setup: CartPole-v1 With Learning Curves


Continue reading the full article on TildAlice

Top comments (0)