The Counterintuitive Truth About Sample Efficiency
SAC should destroy PPO on sample efficiency. Off-policy algorithms reuse old experience, on-policy algorithms throw it away. The math is simple: SAC's replay buffer lets it squeeze more learning from fewer environment interactions. So why did PPO beat SAC on 2 out of 5 Gymnasium tasks in my benchmarks?
I ran both algorithms across HalfCheetah-v4, Ant-v4, Walker2d-v4, Hopper-v4, and Humanoid-v4 using Stable Baselines3 1.8.0 with Gymnasium 0.29.1 and MuJoCo 2.3.7. The results challenged my assumptions about when off-policy methods actually win.
PPO vs SAC: What the Theory Says
PPO (Schulman et al., 2017) is on-policy. After each policy update, all collected trajectories become stale. The clipped surrogate objective keeps updates conservative:
$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$
Continue reading the full article on TildAlice

Top comments (0)