Nilavukkarasan R

Posted on Feb 19 • Edited on Mar 1

Backpropagation: Errors Flow Backward, Knowledge Flows Forward

#ai #backpropagation #machinelearning #neuralnetworks

"The backpropagation algorithm was a key historical step in demonstrating that deep neural networks could be trained effectively."

-- Geoffrey Hinton

The Weight of the Problem

In my last post, I showed you how a multi-layer perceptron could solve XOR—something a single perceptron couldn't do. The network had 2 inputs, 2 hidden neurons, and 1 output. It worked beautifully.

But here's the catch: I hand-crafted those weights.

I sat there, adjusting numbers, testing combinations, until I found values that worked. For a tiny 2-2-1 network with 9 weights total, it took me hours of trial and error—and probably three cups of coffee I shouldn't have had.

Now imagine GPT-4. It has 1.76 trillion parameters.

If I spent one second per weight, it would take me 55,000 years to hand-craft GPT-4's weights. And that's assuming I got each one right on the first try (spoiler: I wouldn't).

This is the problem that haunted neural networks in the 1980s. We knew multi-layer networks could solve complex problems. We just didn't know how to train them.

Then, in the mid-1980s, came backpropagation. And here's the magic: it doesn't require you to know the right weights ahead of time. It learns them automatically.

The Breakthrough: Learning from Mistakes

Here's the beautiful insight: networks can learn from their mistakes. And that changes everything.

Think about learning to throw darts. Your first throw misses the bullseye by a foot. You don't randomly try a completely different throw. You adjust—a little less force, slightly different angle. You use the error (how far you missed) to guide your correction.

That's exactly what backpropagation does.

The process is simple:

Forward pass: Make a prediction with current weights
Calculate error: How wrong was the prediction?
Backward pass: Figure out which weights caused the error
Update weights: Adjust them to reduce the error
Repeat: Do this thousands of times

The magic is in step 3—figuring out which weights to blame. That's where the "backpropagation" name comes from: we propagate the error backward through the network.

The Chain Rule: Error Flows Backward

Backpropagation is often described as "just the chain rule from calculus." And it is! But let me make it concrete.

Imagine you're hiking and you want to go downhill (minimize loss). You're standing on a slope, and you need to know which direction is down.

Gradient descent is the strategy: always step in the direction that goes downhill.

Backpropagation is how you figure out which direction that is for every weight in your network.

Here's the flow:

Forward Pass (Making Predictions):
Input → Hidden Layer → Output → Loss
  x   →      h       →   ŷ    →  L

Backward Pass (Computing Gradients):
Loss → Output Error → Hidden Error → Weight Updates
  L  →      δ₂      →      δ₁      →   Δw

The error at the output layer is easy to compute: prediction - target. But how do we know how much each hidden neuron contributed to that error?

That's where the chain rule comes in. The error flows backward through the network, multiplied by weights and activation derivatives at each step. Each weight gets a gradient that tells it: "If you increase by a tiny amount, the loss will change by this much."

For the mathematically curious: I've written a detailed walkthrough with concrete numerical examples in BACKPROP_CALCULUS_EXPLAINED.md. It shows the full chain rule derivation with a 2-2-1 network solving XOR, complete with actual numbers flowing through each calculation.

Learning Rate: The Step Size

Once we know which direction to adjust each weight, we need to decide how big a step to take. That's the learning rate.

Think of it like adjusting the volume on a stereo:

Too high (learning rate = 1.0): You overshoot. The volume jumps from 2 to 10, then back to 1, then to 8. You never settle on the right level.
Too low (learning rate = 0.01): You're turning the knob so slowly it takes forever to reach the right volume.
Just right (learning rate = 0.3): You make steady progress toward the perfect volume.

In the playground, you can experiment with different learning rates and watch what happens. Too high and the loss bounces around. Too low and training crawls. Just right and you see that beautiful downward curve.

For practical tips: Check out HYPERPARAMETER_INSIGHTS.md for a deep dive into learning rates, architecture choices, and why some random seeds get stuck in local minima.

What Clicked for Me

After implementing backpropagation and watching it train on XOR, here's what became clear:

The loss curve tells the story. In Post 2, I hand-crafted weights and got 100% accuracy immediately. With backpropagation, I watched the loss start high (the network is guessing randomly) and gradually decrease as it learned. That curve going down? That's learning happening in real-time.

Initialization matters. I tried different random seeds and got wildly different results. Some converged to 100% accuracy in 2000 epochs. Others got stuck at 75% accuracy forever. The starting point matters—it's like starting a hike from different locations on a mountain. Some paths lead to the summit, others to local valleys.

It's the same algorithm everywhere. Whether it's XOR with 9 weights or GPT-4 with 1.76 trillion parameters, the algorithm is identical: forward pass, compute loss, backward pass, update weights. The scale changes, but the principle doesn't.

Automatic beats manual. Hand-crafting weights for XOR took me hours. Backpropagation learned them in seconds. For anything beyond toy problems, automatic learning isn't just better—it's the only option.

Watch It Learn: The Interactive Playground

I've built an interactive playground where you can watch backpropagation in action. It has two tabs, each showing a different aspect of learning.

GitHub Repository: perceptrons-to-transformers - 03-backpropagation

Tab 1: Training Visualization

Watch the network learn XOR from scratch. You'll see:

Loss curve decreasing over epochs (learning in action!)
Decision boundary evolving from random to correct
Final accuracy reaching 100% (when it works!)

Try this:

Train with learning rate 0.3 and seed 123 (recommended) - watch it converge smoothly
Try seed 42 with 2-2-1 architecture - watch it get stuck at 75% accuracy (local minimum!)
Switch to 2-4-1 architecture - notice how it's more robust to bad initialization

Tab 2: Gradient Flow Visualization

See the backward pass in action. This tab shows:

Forward pass step-by-step (input → hidden → output)
Backward pass step-by-step (error flowing backward)
Gradient magnitudes at each layer
Weight updates before and after one training step

This is where the "backpropagation" name becomes concrete. You literally see the error propagating backward through the network, computing gradients for each weight.

Try this:

Select different XOR test cases and watch how gradients change
Notice how gradients get smaller in earlier layers (vanishing gradient effect)
Compare gradient magnitudes with different learning rates

Running the Playground

# Clone the repository
git clone https://github.com/rnilav/perceptrons-to-transformers.git
cd perceptrons-to-transformers/03-backpropagation

# Install dependencies (if needed)
pip install -r ../requirements.txt

# Run the playground
streamlit run backprop_playground.py

Then open your browser and explore all two tabs. The playground is designed to make the abstract concrete—you can see learning happen, watch gradients flow, and understand why backpropagation works.

What This Unlocked

When Rumelhart, Hinton, and Williams published their backpropagation paper in 1986, it changed everything.

Before backpropagation, neural networks were theoretical curiosities. We knew multi-layer networks could solve complex problems, but we couldn't train them. It was like having a Ferrari with no key.

After backpropagation, neural networks became practical. Suddenly, we could:

Train networks with multiple hidden layers
Learn from large datasets automatically
Solve problems that were previously impossible

The progression is beautiful:

1958: Perceptron learns linear boundaries

1969: Minsky proves perceptrons can't solve XOR (AI winter begins)

1986: Backpropagation enables training multi-layer networks (AI winter thaws)

2012: Deep learning revolution (ImageNet breakthrough)

2017: Transformers architecture (foundation for GPT)

2023: ChatGPT and the LLM explosion

Every single one of these breakthroughs builds on backpropagation. GPT-4 is trained using backpropagation. DALL-E is trained using backpropagation. Every modern neural network you've ever used learned its weights through backpropagation.

The algorithm that learns 9 weights for XOR is the same algorithm that learns 1.76 trillion parameters for GPT-4. The scale changed, but the principle didn't.

What's Next

We can now train neural networks automatically. But there's still a lot to explore:

We can train networks now. But training them well—at scale, reliably, without overfitting—that's a different challenge. Next post, we'll see how modern optimisation algorithms solve this puzzle.

References

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press. Available at: http://neuralnetworksanddeeplearning.com/
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Available at: http://www.deeplearningbook.org/