Batch Normalization: Why It Made Deep Nets Trainable

#machinelearning #beginners #ai #deeplearning

Batch Normalization is one of those tricks that made deep networks suddenly trainable. The idea is simple — keep each layer's inputs in a healthy range — and the payoff is huge: faster training, higher learning rates, more stability. Here it is, visualized.

📊 Toggle BN on/off and watch: https://dev48v.infy.uk/dl/day19-batch-norm.html

The problem: internal covariate shift

As activations flow through layers, their distribution drifts and spreads. Deep layers keep chasing a moving target; some saturate, gradients vanish or explode, and training crawls (or diverges).

What BN does

After a layer, normalize each feature over the mini-batch — subtract the batch mean, divide by the batch std — so activations stay ~mean 0, std 1. Then apply two learnable parameters, γ (scale) and β (shift), so the network can rescale if it needs to. The demo shows each layer's histogram snapping back to a tidy bell.

Why you care

Train faster and with higher learning rates (the demo's no-BN loss diverges at high LR; BN holds steady).
Less sensitive to weight initialization; mild regularization effect.

The catches

Train vs inference differ: at inference you use running mean/var, not the batch's.
Batch-size dependent — tiny batches hurt it.
Transformers prefer LayerNorm (normalize per token, not per batch).

🔨 Built from scratch (batch mean/var → normalize → γ,β → running stats) on the page: https://dev48v.infy.uk/dl/day19-batch-norm.html

Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk