"No man ever steps in the same river twice, for it's not the same river and he's not the same man" - Heraclitus
When Going Deeper Made Things Worse
In my last post, we built CNNs that could see. Filters learned edges. Pooling built spatial tolerance. Stack enough layers and the network recognizes digits, faces, objects.
So the obvious next move: go deeper. More layers, more capacity, more power.
But there is a catch.
Researchers took a 20-layer network and added 36 more layers. The 56-layer network should have been better. More parameters, more room to learn. Instead, it was worse. Not just on test data, But on training data as well.
That's not overfitting. Overfitting means you're too good on training data. This was the opposite: a bigger network that couldn't even fit the data it was trained on.
Two things were broken. And fixing them required two elegant ideas.
The Noisy Room Problem
Imagine you're at a loud party, trying to follow a conversation. The room is packed, music is blasting, five other conversations are happening around you. Your brain doesn't give up, it does something remarkable. It filters out the noise, locks onto the voice you care about, and normalizes the signal so you can follow along.
You do this automatically, without thinking. But a neural network? It has no such mechanism.
Here's what actually happens inside a deep network during training. Each layer transforms its input and passes it to the next. A small shift in one layer's output gets amplified by the next layer, which gets amplified again, and again. After 20 layers, the signal has either exploded into enormous numbers that saturate neurons, or collapsed into near-zero values that carry no information.
The network is trying to learn in a room that keeps getting louder.
That's the internal covariate shift problem. The distribution of each layer's input keeps changing as weights update. Every layer is chasing a moving target.
Batch normalization is the fix to it.
Batch Normalization: Tuning Out the Noise
Before each layer processes its input, normalize it. Force it to have zero mean and unit variance. Then let the network re-scale with two learned parameters: γ (gamma) and β (beta).
# For each mini-batch:
compute mean and variance of the inputs
normalize: x_norm = (x - mean) / sqrt(variance)
# Then re-scale with learned parameters:
output = γ * x_norm + β
The network can undo the normalization if it needs to. γ and β are learned. But now every layer starts from a stable, predictable baseline. The moving target stops moving.
Going back to the party analogy: batch norm is your brain's noise-cancellation. It doesn't remove the signal, it strips out the irrelevant variation so the important information comes through clearly.
The effect on training is immediate:
Without batch norm:
Layer 5 output: mean=2.3, std=4.7
Layer 10 output: mean=18.4, std=31.2 ← signal exploding
Layer 20 output: mean=NaN ← training collapsed
With batch norm:
Layer 5 output: mean≈0, std≈1
Layer 10 output: mean≈0, std≈1
Layer 20 output: mean≈0, std≈1 ← stable all the way down
Three things happen when you add batch norm:
- Activations stay stable — no more explosions or collapses
- You can use much higher learning rates — the stable baseline means bigger steps are safe
- Weight initialization matters less — you no longer need to be as careful about starting values
One subtle thing worth knowing: batch norm uses statistics computed across the current mini-batch. At inference time, you might be predicting on a single example and no batch to compute statistics from. So during training, batch norm accumulates running averages of the mean and variance. At inference, it uses those instead.
The Vanishing Gradient: A Deeper Problem
Batch norm stabilizes the forward pass. But there's a second problem, and it lives in the backward pass.
Backpropagation multiplies derivatives together as it moves backward through the network. Each layer contributes a factor. If those factors are consistently less than 1, which they often are, the gradient shrinks with every layer it passes through.
By the time it reaches layer 1 of a 50 layer network, the gradient might be effectively zero. The early layers stop learning entirely.
This is why the 56-layer network performed worse than the 20-layer one. It wasn't a capacity problem. The early layers simply weren't getting any useful gradient signal. They were frozen.
Residual Connections: The Shortcut
"If I have seen further, it is by standing on the shoulders of giants."
— Isaac Newton
Instead of learning a full transformation, a residual block learns the difference from identity:
# Normal layer:
output = transform(x)
# Residual block:
output = transform(x) + x ← just add the input back
That + x is the skip connection. The input bypasses the learned transformation and gets added back at the output.
How it changes the chain rule.
In a normal layer, backprop applies the chain rule like this:
∂L/∂x = ∂L/∂output × F'(x)
The gradient gets multiplied by F'(x) at every layer. If that's 0.1, after 50 layers you're multiplying fifty 0.1s together, the gradient reaches layer 1 as essentially zero.
With a residual block, output = F(x) + x, so the chain rule becomes:
∂L/∂x = ∂L/∂output × (F'(x) + 1)
That + 1 comes from differentiating the skip connection x with respect to x, the derivative of a straight passthrough is always 1. Now instead of multiplying fifty 0.1s, you're multiplying fifty 1.1s. The gradient stays alive all the way back to layer 1.
Before ResNets, the practical limit for trainable networks was around 20 layers. After ResNets, researchers trained a 1,202-layer network. Not because they needed 1,202 layers, but to prove they could.
That distinction: capacity vs. trainability is one of the most important ideas in deep learning.
The Bigger Picture: How Everything Fits Together
At this point in the series, it's worth stepping back. A lot of concepts been introduced, and it can start to feel like an ever growing list of tricks. It's not. Each one solved a specific, concrete failure mode:
| Problem | What Goes Wrong | Solution |
|---|---|---|
| Dying neurons | Neurons output zero forever, stop learning | ReLU |
| Vanishing gradients | Gradients too small to reach early layers | ReLU + careful init |
| Exploding gradients | Gradients too large, training diverges | Gradient clipping, Adam |
| Slow convergence | Hard to find a good learning rate | Adam optimizer |
| Internal covariate shift | Each layer's inputs keep shifting distribution | Batch Norm |
| Degradation problem | Deeper networks perform worse than shallow ones | Skip connections (ResNet) |
These aren't redundant, they're complementary. ReLU keeps neurons alive. Adam navigates the loss landscape efficiently. Batch norm stabilizes the signal between layers. Skip connections ensure gradients reach the beginning. Each one patches a gap the others can't cover.
Together, they form the foundation that makes modern deep networks trainable. You'll see all of them again — in every architecture from here on out.
What's Next
We can now train deep networks. But depth alone doesn't solve every problem.
Images have spatial structure, CNNs exploit that. But what about sequences? Text, audio, time series, data where order matters and context can span hundreds of steps?
Post 8 introduces Recurrent Neural Networks: architectures with memory, where the output at each step depends on everything that came before. And you'll see immediately why the vanishing gradient problem, which we just solved for depth, comes back with a vengeance for long sequences.
References
- Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML.
- He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. CVPR 2016.
Top comments (0)