DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

PPO Training Diverges After 1M Steps: Clipping & LR Fixes

The Fix That Saved 72 Hours of Wasted GPU Time

Your PPO agent hits 500 reward at 800k steps, then crashes to 150 by 1.2M. The policy collapses, value function explodes, and you're staring at a training curve that looks like a cliff dive.

This isn't a bug in your code. It's a PPO-specific failure mode that hits when your clipping range stays constant while your policy converges. I've seen this wreck three separate robotics projects — always past 1M steps, always after initial success. The fix is surgical: decay your clipping range and learning rate together, or watch your agent unlearn everything it knows.

Here's what actually happens when PPO diverges late in training, and the two hyperparameter schedules that prevent it.

A teacher and student exchanging a high five during a classroom session in Yalova, Türkiye.

Photo by Ahmet Kurt on Pexels

Why PPO Collapses After Initial Convergence

PPO's core trick is limiting how much the policy can change per update. The clipped surrogate objective:

$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$


Continue reading the full article on TildAlice

Top comments (0)