The Bug That Only Shows Up After 100K Steps
Your PPO agent trains perfectly for 100,000 steps. Then suddenly, reward curves nosedive. Episode lengths go haywire. You check your hyperparameters—fine. You check your reward function—fine. You restart training from scratch and it happens again at almost the exact same step count.
This isn't a learning problem. It's a bug in how Stable Baselines3 handles parallel environment resets when you're using VecEnv wrappers.
I hit this training a robotic manipulation policy in MuJoCo. The agent learned to pick up objects cleanly, then around 120K steps everything fell apart. Took me two days to realize the environments were desyncing—some were resetting when they shouldn't, others weren't resetting when they should. The neural network was seeing nonsensical state transitions and trying to learn from them.
Why Parallel Environments Desync
Continue reading the full article on TildAlice

Top comments (0)