DEV Community

Nilavukkarasan R
Nilavukkarasan R

Posted on • Edited on

Batch Normalization and Residual Connections: Going Deeper Without Breaking

"No man ever steps in the same river twice, for it's not the same river and he's not the same man."
Heraclitus

When Deeper Made Things Worse

CNNs rethought how networks process images. Filters, weight sharing, spatial structure. The natural next step: go deeper. Early layers detect edges, middle layers combine them into shapes, deeper layers recognize objects. To go from recognizing handwritten digits to understanding complex scenes, faces, medical scans, you need that depth. More layers, more abstraction, more power.

Researchers took a 20-layer network and added 36 more layers. The 56-layer network should have been better. Instead, it was worse. Not just on test data. On training data too.

That's not overfitting. Overfitting means you're too good on training data. This was the opposite: a bigger network that couldn't even fit the data it was trained on.

Two things were broken. Fixing them required two ideas.

The Signal Drifts

Each layer transforms its input and passes it to the next. But as weights update during training, each layer's output distribution shifts. The next layer was calibrated for the old distribution. Now it's receiving something different.

A small shift in layer 3 gets amplified by layer 4, amplified again by layer 5. After 20 layers, the signal has either exploded into enormous numbers or collapsed to near zero.

Without batch norm:
  Layer 5 output:  mean=2.3,  std=4.7
  Layer 10 output: mean=18.4, std=31.2   ← exploding
  Layer 20 output: mean=NaN              ← collapsed
Enter fullscreen mode Exit fullscreen mode

Every layer is chasing a moving target. That's the problem batch normalization solves.

Batch Normalization

Before each layer processes its input, normalize it to zero mean and unit variance. Then let the network re-scale with two learned parameters (γ and β) so it can undo the normalization if needed.

x_norm = (x - mean) / sqrt(variance)
output = γ × x_norm + β
Enter fullscreen mode Exit fullscreen mode

Now every layer starts from a stable baseline. Activations stay stable (no explosions), you can use higher learning rates, and weight initialization matters less.

With batch norm:
  Layer 5:  mean≈0, std≈1
  Layer 10: mean≈0, std≈1
  Layer 20: mean≈0, std≈1    ← stable all the way down
Enter fullscreen mode Exit fullscreen mode

One detail: batch norm computes statistics from the current mini-batch during training. At inference, there's no batch, so it uses running averages accumulated during training.

The Gradient Vanishes

Batch norm fixes the forward pass. But there's a second problem in the backward pass.

Backpropagation multiplies derivatives together as it moves backward. Each layer contributes a factor. If those factors are consistently less than 1, the gradient shrinks at every layer. By the time it reaches layer 1 of a 50-layer network, the gradient is effectively zero.

This is why the 56-layer network performed worse than the 20-layer one. The early layers weren't getting any useful gradient signal. They were frozen. It's like studying so much for an exam that your brain goes blank. More preparation, worse performance. Not because you lack knowledge, but because the signal got lost somewhere along the way.

Residual Connections: The Shortcut

Instead of learning a full transformation, a residual block learns the difference from identity:

Normal layer:    output = F(x)
Residual block:  output = F(x) + x     ← add the input back
Enter fullscreen mode Exit fullscreen mode

That + x is the skip connection. The input bypasses the transformation and gets added to the output.

Why this fixes vanishing gradients: in a normal layer, the gradient gets multiplied by F'(x) at every step. If F'(x) is 0.1, after 50 layers you're multiplying fifty 0.1s together. The gradient is gone.

With a residual block, the chain rule becomes:

Normal:    ∂L/∂x = ∂L/∂output × F'(x)
Residual:  ∂L/∂x = ∂L/∂output × (F'(x) + 1)
Enter fullscreen mode Exit fullscreen mode

That + 1 comes from the skip connection. Instead of multiplying values less than 1 at every layer, the skip connection keeps each factor close to 1. The gradient stays alive all the way back to layer 1.

Gradient flow: normal network vs ResNet

On the left, a 30-layer normal network. The gradient starts at 1.0 at the output and shrinks at every layer. By layer 1, it's 0.007. The early layers are frozen. On the right, the same depth with skip connections. The gradient stays close to 1.0 across all layers because the skip provides a direct path that doesn't decay.

Before ResNets, the practical limit was around 20 layers. After, researchers trained networks with over 1,000 layers.

How Everything Fits Together

Seven posts in, it can feel like an ever-growing list of techniques. It's not. Each solved a specific failure: hidden layers for non-linearity (02), backprop for learning (03), mini-batches for scale (04), dropout for overfitting (05), convolutions for spatial data (06), batch norm for signal drift (07), skip connections for vanishing gradients (07). Each patches a gap the others can't cover. Together, they make modern deep networks trainable.

What's Next

We can now train deep networks on images. But images are static. What about data where order matters? Text, audio, time series, where what came before changes the meaning of what comes after.

A fully connected network has no concept of sequence. A CNN has no concept of time. We need an architecture with memory.

That's where recurrent neural networks come in. And the vanishing gradient problem we just solved for depth comes back for length.


Series: From Perceptrons to Transformers | Code: GitHub

Top comments (0)