Batch Normalization is one of those tricks that made deep networks suddenly trainable. The idea is simple — keep each layer's inputs in a healthy range — and the payoff is huge: faster training, higher learning rates, more stability. Here it is, visualized.
📊 Toggle BN on/off and watch: https://dev48v.infy.uk/dl/day19-batch-norm.html
The problem: internal covariate shift
As activations flow through layers, their distribution drifts and spreads. Deep layers keep chasing a moving target; some saturate, gradients vanish or explode, and training crawls (or diverges).
What BN does
After a layer, normalize each feature over the mini-batch — subtract the batch mean, divide by the batch std — so activations stay ~mean 0, std 1. Then apply two learnable parameters, γ (scale) and β (shift), so the network can rescale if it needs to. The demo shows each layer's histogram snapping back to a tidy bell.
Why you care
- Train faster and with higher learning rates (the demo's no-BN loss diverges at high LR; BN holds steady).
- Less sensitive to weight initialization; mild regularization effect.
The catches
- Train vs inference differ: at inference you use running mean/var, not the batch's.
- Batch-size dependent — tiny batches hurt it.
- Transformers prefer LayerNorm (normalize per token, not per batch).
🔨 Built from scratch (batch mean/var → normalize → γ,β → running stats) on the page: https://dev48v.infy.uk/dl/day19-batch-norm.html
Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)