DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

PPO Entropy Decay Bug: Why Exploration Dies at 500K Steps

The Bug That Killed My Agent at Step 523,000

Your PPO agent trains beautifully for 500,000 steps, hits 80% win rate, then flatlines. The policy stops exploring, gets stuck repeating the same suboptimal actions, and never recovers. You check the value loss, policy loss, KL divergence—everything looks normal. But if you plot the entropy coefficient over time, you'll see it decayed to 0.0001 while your entropy bonus weight stayed at 0.01. The agent stopped exploring because the coefficient that controls exploration vanished.

This isn't a hyperparameter tuning problem. It's a silent implementation bug in how most PPO codebases handle entropy decay.

I hit this training a MuJoCo Ant-v4 agent (Gymnasium 0.29.1, Stable Baselines3 2.2.1). The agent learned to walk forward, then stopped trying new gaits entirely. Training curves showed the policy entropy $H(\pi)$ dropping from 2.1 nats to 0.03 nats between steps 400K-600K, but the entropy coefficient scheduler had already bottomed out at step 520K. Once the coefficient hit its minimum, the entropy bonus term in the loss function became negligible:

$$L_{total} = L_{clip} + c_1 L_{value} - c_{ent} H(\pi)$$

When $c_{ent} = 0.0001$ and your base weight is 0.01, the effective entropy bonus is $0.01 \times 0.0001 = 0.000001$. At that point, the policy gradient overwhelmingly favors exploitation. The agent locks into a local optimum and stops trying new actions.


Continue reading the full article on TildAlice

Top comments (0)