Stat Phantom

Posted on May 17

When Chaos Wins: Adding Noise Improved My Snake AI's Stability

#machinelearning #deeplearning #ai #chaos

Greetings all! Continuing the series where I build Rainbow DQN one component at a time on Snake. The first post covered encoding, the second covered memory, the third covered PER hurting performance. This one is about a truly WTF?! moment I stumbled into while evaluating the models.

When you evaluate a model that uses noisy networks, you turn the noise off. You're not training, so why would you keep the exploration noise active? You want the clean, deterministic policy. The model's best guess, no randomness. That's what you do, it's basically an axiom in machine learning.

So I did just that. And the evaluation scores were significantly worse than training. Not slightly. Significantly.

What Noisy Networks Do (Quick Recap)

Standard DQN uses epsilon-greedy exploration: pick a random action X% of the time, decay that percentage over training. Simple, dumb, works.

Noisy networks replace this with something smarter. Each linear layer in the network gets learnable noise parameters (sigma weights). During training, the network adds noise to its own weights, producing slightly different outputs each forward pass. The network learns how much noise to apply. Early in training, sigma values are high and the agent explores broadly. As training progresses and the agent gets more confident, sigma can shrink. For evaluation, you set sigma to zero. Clean output. Textbook.

The Evaluation Gap

Running evaluations across multiple training checkpoints, I noticed something was off. Not subtly off. The deterministic eval scores were wildly inconsistent.

Some checkpoints averaged 78. Others averaged 18. The training curve at these same points? Perfectly stable. The model was learning consistently the whole time, but deterministic evaluation was telling a completely different story depending on which checkpoint I happened to evaluate.

First instinct: it's a bug. Checked the eval pipeline, checked the checkpoint loading, checked the environment seeding. Everything was fine. The model genuinely performed this erratically when noise was turned off. So if it's not a bug... what is it?

The Bimodal Trap

The ep450K checkpoint was where it got properly weird. Deterministic eval produced a strongly bimodal distribution: roughly 25% of episodes scored near zero, while 75% scored above 80. The average landed at 59, but that number is completely meaningless when your distribution is two separate peaks with a canyon between them.

So what's going on? The deterministic policy has traps. Specific game states where the mean-weight Q-values for two or more actions are nearly identical. Without noise, the agent picks the same action every single time it hits that state. If that action happens to be the wrong one? Stuck. It loops, it crashes, it scores zero. 25% of episodes starting from certain initial states hit these traps every time.

Now. Same checkpoint, same evaluation seeds, noise turned back on:

The bimodal failure mode vanished. Gone. The p25 jumped from 2 to 59. The average climbed from 59 to 73. The standard deviation dropped from 42 to 26. The noise nudges the agent out of those deterministic traps. Not randomly, not chaotically, but because the learned noise provides just enough variation in the Q-values to stop the agent getting stuck in a degenerate action loop.

The noise isn't exploration overhead left over from training. It's a load-bearing part of the learned policy.

This wasn't a one-off. The pattern held at every checkpoint from ep50K through ep450K. Stochastic eval beat deterministic eval at every single point. Lower variance, higher consistency, fewer catastrophic zero-score episodes. The sigma values aren't residual training artifacts waiting to be zeroed out. They're doing actual work.

Why Snake Makes This Worse

Snake has a property that makes deterministic policies especially vulnerable to traps: a single wrong turn can be immediately fatal.

Picture a snake at length 100+ navigating a tight corridor of its own body. The optimal action and the second-best action might differ by a tiny margin in Q-value. Deterministic policy picks the same one every time. If that action leads into a dead end three moves later, the agent dies. Every time. From that state. Noise provides enough Q-value perturbation to occasionally pick the second-best action, which might be the one that actually survives.

In environments with more breathing room (wide open Atari levels, games where one wrong move doesn't kill you), deterministic policies don't develop these traps as severely. The longer the snake gets, the more traps exist, and the more the noise matters.

What This Means In Practice

If you're using noisy networks and evaluating with mean weights, your evaluation scores may not just be noisy. They can be structurally misleading. The deterministic policy can have failure modes that simply don't exist in the trained stochastic policy.

Before assuming deterministic eval shows the "true" performance of your agent, run a stochastic eval comparison. If the scores diverge, your agent has learned to depend on its noise.

Honest Caveats

Single architecture, single game. This was observed on C51 + dueling + noisy on Snake. Games with more forgiving state dynamics may not exhibit the same bimodal failure mode.

Noise can grow too large. At one late-stage checkpoint, sigma values had grown large enough that stochastic eval actually dropped below deterministic. There's a Goldilocks zone where noise is productive. Past that zone it becomes destructive. The finding is not "always evaluate with noise." The finding is "don't assume deterministic eval is automatically better."

Training scores remain the most reliable metric. For the ablation study, training window averages computed identically across all runs are the primary comparison, sidestepping the whole question entirely.

If you've observed similar eval divergence with noisy networks, or if you have environments where deterministic eval reliably matches training performance, I'd like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

References

Peer-Reviewed

Fortunato et al. (2018) - "Noisy Networks for Exploration" - ICLR 2018. arXiv: 1706.10295

Hessel et al. (2018) - "Rainbow: Combining Improvements in Deep Reinforcement Learning" - AAAI 2018. DOI: 10.1609/aaai.v32i1.11796

Bellemare et al. (2017) - "A Distributional Perspective on Reinforcement Learning" - ICML 2017. arXiv: 1707.06887

DEV Community