Phase transitions in neural network training: what your loss curve isn't telling you

#machinelearning #deeplearning #ai #neuralnetworks

The loss curve is the standard view into a training run. It goes down (good) or stops going down (bad) or goes back up (overfit). This mental model is useful but incomplete. There are at least two well-documented phenomena that happen inside training runs that the loss curve either hides or actively misrepresents.

Both are phase transitions. Both have practical implications for how you train and when you stop.

1. Double descent

The classical bias-variance tradeoff predicts a U-shaped test error curve: as model complexity increases, you first underfit, then hit a sweet spot, then overfit. Test error goes down, then back up.

For decades this was the mental model. Regularize. Don't overparameterize. Stay on the left side of the curve.

Then something inconvenient was documented at scale: the U-shape is only part of the picture. If you keep increasing model size past the interpolation threshold — the point where the model can exactly fit all training data — test error sometimes comes back down again. Not immediately. But eventually, larger models generalize better than the models at the "ideal" complexity.

This is the double descent curve. The interpolation threshold is a phase boundary. Models on the left side behave classically. Models on the right side behave differently — the overparameterized regime has its own generalization dynamics.

The practical implication: "don't overparameterize" may be wrong advice for large models. The sweet spot you're optimizing for in small-model regimes might not exist in the same way at scale. This is part of why scaling laws work: you can keep making models bigger and they keep getting better, past the point where classical theory says they should fail.

2. Grokking

If double descent is about model size, grokking is about training time.

The short version: a model memorizes training data (training loss low, test loss high), and you'd normally stop there. But if you keep training — sometimes for thousands of additional steps — generalization suddenly jumps. The model restructures internally from a brittle memorization solution to a clean algorithmic one.

The transition is sharp. It looks like a phase change because it probably is one.

Mechanistic interpretability work (Neel Nanda's analysis of modular arithmetic models is the clearest example) shows what's happening structurally: the memorization solution and the generalization solution coexist in the model during the transition period. The generalizing circuits grow slowly while regularization pressure erodes the memorizing circuits. When the generalizing solution becomes dominant, you see the jump.

The practical implication: training loss convergence is not the same as learning convergence. Early stopping based purely on training loss may be terminating runs that are one epoch away from grokking. Whether this generalizes beyond toy tasks to production-scale models is an open research question — the signal is harder to isolate at scale — but the principle is worth keeping in mind.

What connects them

Both phenomena share a structure: there's a phase boundary, and the interesting behavior happens after you cross it.

Classical ML intuition is built for models and training runs that stay on the near side of these boundaries. You minimize a convex loss, you regularize, you stop when validation loss bottoms out. That framework works. It just doesn't predict what happens in the regimes where modern large models actually live.

The overparameterized regime is the norm now, not the exception. GPT-style architectures are orders of magnitude past any classical interpolation threshold. The training runs are long. The models are large. The old rules don't fully apply.

This doesn't mean there are no rules. It means the rules are different, and we're still mapping them out.

The specimen angle

OVERFITS treats ML concepts as museum specimens — archived, labeled, pressed into fabric with academic plate aesthetics. Double descent and grokking both belong in the collection not just because they're interesting but because they represent the exact type of phenomenon that makes ML feel more like natural history than engineering: you observe it before you understand it, you classify it before you can explain it, and the explanation, when it comes, changes how you see everything else.

A model that remembered too much. → https://overfits.ai