Your gradient dies on the way to layer 1 (and how to save it)

#ai #machinelearning #deeplearning #beginners

Stack enough layers and something strange happens: the network trains, the last few layers learn fine, and the first layers barely move at all. Not slowly — barely at all. For years this quietly capped how deep a network anyone could actually train. The culprit is one line of arithmetic hiding inside backpropagation, and once you see it you can't unsee it. Here it is, running on a real chain of layers in your browser.

📉 Slide the depth, pick an activation, watch the gradient vanish or explode: https://dev48v.infy.uk/dl/day21-vanishing-gradients.html

Backprop is a product, not a sum

When a network learns, backpropagation figures out how the loss changes with respect to every weight, working backwards from the output to the input. The important structural fact is how the gradient travels: at every layer it gets multiplied by that layer's local factor, roughly the weight magnitude times the derivative of the activation, |w| · f'(x).

Multiplied. Not added. So the gradient that finally reaches the first layer is a long product of these per-layer factors — one for every layer in between. And products of many numbers are fragile in a way sums never are.

Below 1, it vanishes

Suppose each factor is a little under 1 — say 0.9. Sounds harmless. But 0.9 to the 50th power is about 0.005, and by 100 layers it's practically zero. The shrinkage is exponential in depth, so it sneaks up fast: a factor that looks perfectly reasonable at one layer becomes catastrophic when you compound it dozens of times.

When the gradient reaching the earliest layers is essentially zero, those layers get almost no update signal and effectively stop learning. Only the layers near the output train at all. That's the vanishing gradient problem, and in the demo you can watch it directly: with sigmoid and depth 16, the bar for layer 1 is flush with the floor of the chart at around 1e-9.

Above 1, it explodes

The mirror image is just as deadly. If each factor is greater than 1, the product grows exponentially instead of shrinking. 1.5^20 is already over 3,000; 2^20 is over a million. An exploding gradient produces enormous weight updates that overshoot wildly and send your parameters to NaN in a single step. This is especially common in recurrent networks, where the same weight matrix is applied at every timestep — a long sequence is effectively a very deep chain multiplying the same factor over and over. Drag the weight-scale slider up in the demo and the bars turn amber as the gradient rockets into the thousands.

Sigmoid and tanh make it worse on purpose

The classic activations actively push the factors below 1. The sigmoid's derivative is s·(1−s), which maxes out at just 0.25 at the center and is far smaller in the flat tails where big inputs land. So before you even consider the weights, a single sigmoid layer can multiply the gradient by at most a quarter. Stack a handful and the product is already minuscule — 0.25^19 is about 3.6e-12.

Tanh is a bit kinder — its derivative peaks at 1 — but it too saturates toward 0 for large inputs. Squashing activations in deep stacks all but guarantee vanishing. That's exactly why the demo defaults to sigmoid to show the effect.

Why deep nets stalled

For a long stretch this single phenomenon was the ceiling. People stacked many sigmoid or tanh layers, the early layers refused to learn, and "deep" networks performed no better than shallow ones. It made depth look like a dead end. The workarounds were fiddly — greedy layer-by-layer pretraining, hand-tuned learning rates, staying shallow. None of them fixed the underlying multiplication problem.

The breakthrough wasn't one magic trick. It was a cluster of fixes that each do the same job: keep the per-layer factor near 1.

The fixes

ReLU. Its derivative is either 0 (negative inputs) or exactly 1 (positive inputs) — no shrinking 0.25 cap. Every active neuron passes the gradient through undamped, so a chain of active ReLUs multiplies by 1 at each step and the product doesn't decay. This is the single biggest reason ReLU replaced sigmoid as the default hidden activation.

Weight initialization. Even with a good activation, the |w| part matters. Xavier (Glorot) init sets the weight variance to about 1/fan_in, keeping variance constant across layers for tanh-like activations. He init uses 2/fan_in — the extra factor of two compensates for ReLU zeroing half its inputs — and is the standard partner for ReLU. Both pick the starting scale so the per-layer factor lands right at 1. In the demo, the He + ReLU preset drops every factor onto the green "stable = 1" line.

Gradient clipping. For the exploding case, especially in RNNs, you measure the gradient's norm and, if it exceeds a threshold, rescale the whole vector down. Same direction, capped length. Cheap and reliable.

Batch norm and residual connections. Batch norm re-centers each layer's pre-activations into the healthy region where derivatives aren't tiny. Residual connections add the input back — y = x + F(x) — so the gradient gets a straight-through +1 path: dy/dx = 1 + dF/dx. Even if F's own gradient is small, the gradient flows around the block. That single trick is what let ResNets train hundreds of layers deep.

One idea to remember

Because backprop multiplies a factor at every layer, keeping that factor near 1 is the whole game. Below 1 it vanishes, above 1 it explodes. ReLU, He init, clipping, batch norm, residual connections — and the gates inside LSTMs for sequences — are all just different ways of pinning that factor to roughly 1 so the gradient survives the trip from output to input.

🔨 Built from a real forward pass and chain-rule product on the page — no framework: https://dev48v.infy.uk/dl/day21-vanishing-gradients.html

Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk

DEV Community