On-Policy vs Off-Policy RL: PPO vs SAC on 5 Gymnasium Tasks

#reinforcementlearnin #ppo #sac #gymnasium

The Counterintuitive Truth About Sample Efficiency

SAC should destroy PPO on sample efficiency. Off-policy algorithms reuse old experience, on-policy algorithms throw it away. The math is simple: SAC's replay buffer lets it squeeze more learning from fewer environment interactions. So why did PPO beat SAC on 2 out of 5 Gymnasium tasks in my benchmarks?

I ran both algorithms across HalfCheetah-v4, Ant-v4, Walker2d-v4, Hopper-v4, and Humanoid-v4 using Stable Baselines3 1.8.0 with Gymnasium 0.29.1 and MuJoCo 2.3.7. The results challenged my assumptions about when off-policy methods actually win.

Visual abstraction of neural networks in AI technology, featuring data flow and algorithms. — Photo by Google DeepMind on Pexels

PPO vs SAC: What the Theory Says

PPO (Schulman et al., 2017) is on-policy. After each policy update, all collected trajectories become stale. The clipped surrogate objective keeps updates conservative:

$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

Continue reading the full article on TildAlice