DEV Community

Nilavukkarasan R
Nilavukkarasan R

Posted on • Edited on

Backpropagation: How Neural Networks Learn From Mistakes

"The backpropagation algorithm was a key historical step in demonstrating that deep neural networks could be trained effectively."
Geoffrey Hinton

From Hand-Crafted to Learned

A network that recognizes handwritten digits has hundreds of thousands of weights. A language model has billions. You can't hand-pick billions of numbers. There has to be a way for the network to find its own weights.

That's what backpropagation does. It starts with random weights and adjusts them automatically, using the errors the network figures out which direction to nudge each weight.

Try, Miss, Adjust

Think about learning to throw darts. Your first throw misses the bullseye by a foot. You don't start over with a completely random throw. You adjust. A little less force, slightly different angle. The error (how far you missed) tells you which way to correct.

Backpropagation does the same thing, for every weight in the network, simultaneously:

1. Forward pass:   feed input through the network, get a prediction
2. Compute error:  how far off was the prediction?
3. Backward pass:  trace the error back, figure out each weight's share of blame
4. Update weights: nudge each weight to reduce the error
5. Repeat
Enter fullscreen mode Exit fullscreen mode

Step 3 is where the name comes from. The error at the output is clear: prediction minus target. But how much did each hidden neuron contribute to that error?

Think of it like a relay race where the team finishes 10 seconds too slow. The coach doesn't just blame the last runner. She works backward: the last runner lost 3 seconds, the one before lost 5, the first lost 2. Each runner's share of blame is traced back through the chain.

Backpropagation does the same thing. It starts at the output error and works backward through each layer, computing how much each weight contributed. This is the chain rule from calculus applied layer by layer. Each weight gets a gradient.

The Learning Rate

The gradient tells you which direction to move. The learning rate tells you how big a step to take.

new_weight = old_weight - learning_rate × gradient
Enter fullscreen mode Exit fullscreen mode

Too high (1.0) and the network overshoots. The loss bounces around, never settling. Too low (0.01) and training crawls. Each update barely moves the weights. A learning rate around 0.3 to 0.5 usually gives steady progress.

In the playground, try training with learning rate 0.5 and seed 123. Watch the loss drop smoothly. Then try learning rate 1.0. Watch it bounce. The learning rate is the difference between a network that converges and one that thrashes.

The Loss Curve

In Post 2, I hand-crafted weights and got 100% accuracy instantly. No learning, no process.

With backpropagation, you start with random weights. The network gets everything wrong. The loss is high. Then, epoch by epoch, the loss drops. The predictions get closer. The decision boundary shifts from random noise to something that actually separates the classes.

That curve going down is learning happening in real time.

Open the playground and train a 2-4-1 network with learning rate 0.5 and seed 123. Watch the loss curve drop.

The Same Algorithm, Any Scale

Every modern neural network learned its weights through backpropagation. Image classifiers, language models, speech recognition. The algorithm that learned 9 weights for XOR is the same one that trains GPT-4's 1.76 trillion parameters. Forward pass, compute loss, backward pass, update weights. The scale changes. The principle doesn't.

Why the Starting Point Matters

Backpropagation starts with random weights. The random seed controls which random numbers you start with. Think of tuning an old analog radio. You turn the dial looking for a clear signal. Where you start turning from (the seed) decides which station you find first. Sometimes you land on a strong station. Sometimes you get stuck between two stations, hearing nothing but static, and no small turn of the dial fixes it.

A small network (2-2-1) is like a radio with a narrow dial. The stations are packed tight, and a tiny turn jumps past the one you wanted. Very sensitive to where you start. A larger network (2-4-1) is a wider dial with more room between stations. Easier to land on a clear signal from almost any starting position.

In the playground, seed 5 with a 2-2-1 network gets stuck at 75%. Switch to 2-4-1 with the same seed and it converges to 100%. More neurons don't just add capacity. They add alternative routes to the solution.

Same seed, different architecture: 2-2-1 gets stuck, 2-4-1 converges

What's Next

We can now train networks automatically. But XOR has 4 training examples. Real datasets have thousands, millions. Computing the gradient using all examples at once is slow. And a single learning rate for every weight isn't ideal, some weights need bigger steps, others smaller.

Training a network is one thing. Training it efficiently, at scale, is a different problem. That's where optimizers come in.


References:
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors

Series: From Perceptrons to Transformers | Code: GitHub

Top comments (0)