Dropout is almost absurdly simple — randomly switch off neurons during training — yet it was one of the biggest anti-overfitting wins in deep learning. Here's why it works, visualized.
🎲 Watch neurons drop (toggle the rate): https://dev48v.infy.uk/dl/day20-dropout.html
What it does
On each training step, each hidden neuron is kept with probability (1−p) and zeroed out with probability p (say p=0.5). A different random subset drops every step. The demo grays out a fresh random set of neurons each pass and cuts their edges.
Why that helps
Neurons can't rely on any specific other neuron being present, so they can't co-adapt into a fragile memorized solution — each must learn a feature that's useful on its own. It's like training a huge ensemble of subnetworks that share weights. Result: a smaller train/val gap (less overfitting) — which the two accuracy curves in the demo show.
Train vs inference (the gotcha)
You drop during training only. At inference, all neurons are on. To keep the expected activations consistent, inverted dropout scales the kept activations by 1/(1−p) during training, so inference needs no change.
Modern note
With batch norm (Day 19) and huge datasets, dropout is needed less in conv nets — but it's still standard in Transformers (attention + feed-forward). It's regularization, alongside L2 (Day 17).
🔨 Built from scratch (mask = rand > p → scale by 1/(1−p) → off at eval) on the page: https://dev48v.infy.uk/dl/day20-dropout.html
Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)