The Fix That Saved 72 Hours of Wasted GPU Time
Your PPO agent hits 500 reward at 800k steps, then crashes to 150 by 1.2M. The policy collapses, value function explodes, and you're staring at a training curve that looks like a cliff dive.
This isn't a bug in your code. It's a PPO-specific failure mode that hits when your clipping range stays constant while your policy converges. I've seen this wreck three separate robotics projects — always past 1M steps, always after initial success. The fix is surgical: decay your clipping range and learning rate together, or watch your agent unlearn everything it knows.
Here's what actually happens when PPO diverges late in training, and the two hyperparameter schedules that prevent it.
Why PPO Collapses After Initial Convergence
PPO's core trick is limiting how much the policy can change per update. The clipped surrogate objective:
$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$
Continue reading the full article on TildAlice

Top comments (0)