Why PPO Dominates Sparse Rewards (But Fails at Sample Reuse)
PPO converges in 500K steps on robotic manipulation tasks where SAC stalls at random policy for 2M steps. That's not a typo.
The on-policy vs off-policy divide isn't about algorithmic elegance — it's about sample efficiency vs stability tradeoffs that silently dictate which algorithm survives real-world deployment. Most tutorials gloss over this: they'll tell you SAC reuses old data (good for sample efficiency!) and PPO doesn't (bad!), then vaguely conclude "it depends." But the reality is messier. I've seen SAC outperform PPO by 3x on dense-reward continuous control, then completely fail on the exact same environment with sparse rewards. The difference wasn't the algorithm — it was the reward structure.
Here's the core tension: on-policy methods like PPO force fresh data collection every update, burning compute but staying stable. Off-policy methods like SAC hoard experience in replay buffers, reusing samples thousands of times — incredible sample efficiency, until the policy drifts too far and the old data becomes toxic. Mechanical Keyboard Wrist Rest helped me survive the 14-hour hyperparameter sweep that revealed this pattern.
Let's settle this with code and benchmarks.
Continue reading the full article on TildAlice
Top comments (0)