Harshil Rami

Posted on May 9

The Paper That Taught Neural Networks to Learn Backwards

#ai #machinelearning #deeplearning #backpropogation

Last week I read the 1958 Rosenblatt paper. The one that started everything. The Perceptron, the first learning machine, the idea that memory lives in connections and not addresses. And at the very end of that paper, almost as a footnote, Rosenblatt wrote that "some system, more advanced in principle than the perceptron, seems to be required."

This is that system.

Rumelhart, Hinton, and Williams. 1986. Four pages in Nature. And somewhere in those four pages, the answer to the question Rosenblatt had left open for twenty-eight years.

What Was Actually Broken

To understand why this paper matters, you need to understand what the Perceptron could not do. And the answer is not XOR, even though that is what everyone says. XOR is a symptom. The real problem was deeper.

In Rosenblatt's Perceptron, the feature detectors, the A-units sitting between the input and the output, were connected randomly and then frozen. Nobody trained them. The only thing that learned was the final layer, the response units deciding which class to pick. Which means the Perceptron could only learn to combine features that already existed in the raw input. It could not discover new features on its own.

Think about what this means. If you want the Perceptron to recognise faces, someone has to hand-engineer the features: edges, curves, symmetry. The network cannot figure out that symmetry is a useful thing to look for. It can only learn to weight the features you already gave it.

Rumelhart, Hinton, and Williams called these "hidden units." Units that sit between the input and the output, that are not told what to do, that have to figure out on their own what they should represent. Training hidden units is the problem. And the reason nobody had solved it is that you cannot directly measure how wrong a hidden unit is. You only know how wrong the output is. The error signal exists at the top. The hidden units are somewhere in the middle.

Backpropagation is the answer to one question: how do you take an error signal that exists only at the output, and use it to train units that you cannot directly observe?

The Architecture: A Factory With Floors

Before the math, the picture.

Imagine a factory with multiple floors. Raw materials come in at the ground floor. Each floor transforms what it receives and passes the result upward. The finished product comes out at the top floor. You have a quality inspector at the top who compares the finished product to what was ordered and measures how wrong it is.

Now the problem: the quality inspector knows the final product is wrong. But which floor made the mistake? And by exactly how much did each floor contribute to the wrongness?

This is the problem backpropagation solves. And it solves it in two passes.

The forward pass is the factory running normally. Input comes in at the bottom, each layer transforms it, output comes out at the top. Simple.

The backward pass is blame flowing in reverse. The top floor gets blamed proportionally to how wrong the output was. It then passes blame to the floor below it, saying: here is how much each of you contributed to my mistake. Each floor does the same, passing blame further down, all the way to the bottom. Every connection in the network receives a precise measure of how much it contributed to the final error. Then every weight adjusts itself to be slightly less wrong next time.

That is backpropagation. The forward pass computes what the network thinks. The backward pass computes who is responsible for being wrong.

The Forward Pass: Equations 1 and 2

Let us get precise. In the paper, Rumelhart et al. define the network with two equations that govern how information moves forward through the network.

The total input to unit j is a weighted sum of everything coming from below:

x(j) = Σ y(i) · w(ij)

Where y(i) is the output of unit i in the layer below, and w(ij) is the weight of the connection from i to j. Every unit below j sends its output upward, multiplied by the strength of its connection. Unit j adds them all up.

Then the output of unit j is a non-linear function of that total input:

y(j) = 1 / (1 + e^(-x(j)))

This is the sigmoid function. It takes any real number and squashes it into a value between 0 and 1. The reason you need this non-linearity is critical: without it, stacking multiple layers is mathematically equivalent to having just one layer. The non-linearity is what allows each layer to do something genuinely different from the layer below it.

The paper says any bounded differentiable function will work here. In 1986, they used sigmoid. Today we use ReLU, which is simply max(0, x). Simpler to compute, faster to train, does not suffer from the vanishing gradient problem that sigmoid creates in deep networks. But the principle is identical: a non-linearity that lets each layer transform its input in a way that the next layer cannot simply undo.

The Error: Equation 3

After the forward pass, you have the network's output. You compare it to what the output should have been. The error is defined as:

E = (1/2) Σ (y(j) - d(j))²

Where y(j) is what the network produced for output unit j, and d(j) is what it should have produced. The (1/2) out front is a convenience: when you differentiate this, the 2 from the square and the 1/2 cancel cleanly.

This is mean squared error. Today we often use cross-entropy loss for classification problems because it has better gradient properties near zero and one. But the backpropagation algorithm works identically regardless of what loss function you choose. The only requirement is that the loss is differentiable with respect to the output.

The goal is to minimise E. And the way to minimise E is gradient descent: find which direction in weight space makes E increase, and move in the opposite direction. To do this, you need the partial derivative of E with respect to every single weight in the network. This is what the backward pass computes.

The Backward Pass: Equations 4 Through 7

This is the heart of the paper. And it is where most explanations lose people, because they introduce notation and concepts at the same time, leaving you holding two unfamiliar things at once. Let me do this differently.

I am going to carry one concrete example through every step. The network predicted 0.9. The correct answer was 0. The network is very wrong. We need to figure out which weights caused this, and by exactly how much. That is the only question the backward pass is answering.

Before anything else: what is the chain rule doing here?

The chain rule from calculus says: if A affects B, and B affects C, then you can figure out how A affects C by multiplying how A affects B by how B affects C.

In our network, a weight affects a unit's total input. That total input affects the unit's output through the sigmoid. That output flows upward and eventually affects the final error. The chain rule lets us connect the first link (weight) to the last link (error) by multiplying all the steps in between. This is all we are doing, four times in a row, layer by layer, going backwards.

Step 1: How wrong is the output, and in which direction?

Start at the top. The output unit produced 0.9. The correct answer was 0. The first thing we need is a precise measure of how the error changes when the output changes.

The answer is simply: prediction minus target.

∂E / ∂y = y - d

In our example: 0.9 minus 0 equals 0.9. Large and positive. This tells us: if the output goes up even slightly, the error gets worse. The network needs to push this output down.

Step 2: How much did the total input to that unit contribute to the error?

The output of a unit is the sigmoid of its total input. So before we can ask "which weights caused this," we need to go one step back: how sensitive is the error to the total input arriving at the output unit?

This is where the chain rule enters. The error depends on the output, and the output depends on the total input. So:

∂E / ∂x = (∂E / ∂y) · (∂y / ∂x)

The second term, how does the sigmoid output change when its input changes, has a beautiful closed form. The derivative of the sigmoid is:

∂y / ∂x = y · (1 - y)

In our example the output was 0.9, so this is 0.9 times 0.1 which equals 0.09. Putting it together:

∂E / ∂x = 0.9 × 0.09 = 0.081

This combined number is what Rumelhart et al. call the error signal for this unit. It captures both how wrong the output was and how sharply the sigmoid was responding at that point.

One thing worth pausing on. Notice what happens when the sigmoid output is very close to 0 or very close to 1. The term y · (1 - y) becomes very small. A 0.99 output gives 0.99 times 0.01 which is 0.0099. This means the error signal almost vanishes when units are saturated near the extremes of the sigmoid. Blame barely reaches the weights below. This is the vanishing gradient problem, and it is why deep networks trained with sigmoid struggled for decades until ReLU replaced it. ReLU does not saturate: its derivative is simply 1 for any positive input. The blame flows through cleanly.

Step 3: Which weights caused the error, and by how much?

Now we have the error signal at the output unit. We need to turn this into a gradient for each weight connecting into that unit.

The total input to a unit is just a weighted sum of the outputs from the layer below. So if we change one weight by a tiny amount, the total input changes by exactly the output of the unit that weight came from. Nothing more. This means:

∂E / ∂w = (∂E / ∂x) · y

In concrete terms: if the error signal at the output unit is 0.081, and the hidden unit connected to it had an output of 0.6, then the gradient for that weight is 0.081 times 0.6 which equals 0.049. This weight needs to decrease by an amount proportional to 0.049 to reduce the error.

The elegance here is striking. The gradient for any weight is just two numbers multiplied together: what the layer above says about the error, and what the layer below actually produced. That is it.

Step 4: Passing blame to the hidden layer.

You now have gradients for all the weights in the final layer. But you need to do the same thing for every hidden layer below. To do that, you need to know the error signal for each hidden unit, the same way you computed it for the output unit in step 1.

A hidden unit connects upward to multiple output units. Each of those connections contributed to the final error. So the total blame assigned to a hidden unit is the sum of blame from every unit it connects to above it, weighted by the strength of each connection:

∂E / ∂y(hidden) = Σ (∂E / ∂x(above)) · w

If a hidden unit connects to three output units with weights 0.5, 0.3, and 0.8, and those output units have error signals 0.081, 0.04, and 0.02, then the blame reaching the hidden unit is:

(0.081 × 0.5) + (0.04 × 0.3) + (0.02 × 0.8) = 0.0405 + 0.012 + 0.016 = 0.069

And this is the key to the whole algorithm. Once you have this blame signal at the hidden unit, you repeat steps 2 and 3 exactly as before: multiply by the sigmoid derivative to get the error signal, then multiply by the outputs from the layer below to get weight gradients.

The algorithm cascades backwards through the entire network, one layer at a time. Every weight in the network receives a precise gradient telling it exactly how much it contributed to the final error. This is why it is called backpropagation.

The Weight Update: Equations 8 and 9

Once you have the gradients, you update the weights. The simplest version is vanilla gradient descent:

Δw = -ε · (∂E / ∂w)

Where ε is the learning rate, a small number like 0.01 that controls how large each step is. The negative sign is because you want to move in the direction that reduces E, which is the opposite of the gradient direction.

But Rumelhart et al. immediately pointed out the problem with vanilla gradient descent: it is slow, and it oscillates. If the gradient keeps pointing in the same direction, you want to pick up speed. If it keeps reversing, you want to slow down.

Their solution adds momentum:

Δw(t) = -ε · (∂E / ∂w(t)) + α · Δw(t-1)

The second term carries forward a fraction α of the previous weight update. If the gradient has been pointing in the same direction for several steps, the weight updates accumulate and the learning accelerates. If the gradient reverses, the momentum term and the gradient term partially cancel, dampening the oscillation.

This is the ancestor of every modern optimizer. Adam, RMSprop, AdaGrad, they are all elaborate answers to the same question Rumelhart and Hinton were asking in 1986: how do you make gradient descent faster and more stable? Momentum was the first answer. It is still inside every optimizer you use today.

What the Experiments Actually Proved

This is the part I want to spend time on, because most blogs about backprop skip it entirely and just talk about the algorithm. The experiments in this paper are where the real idea lives.

Experiment 1: Symmetry detection.

The task: given a binary input vector, is it symmetrical about its midpoint? This cannot be solved without hidden units, and the paper proves it. You cannot add up individual inputs and get symmetry. You need to compare positions across the midpoint, which requires an intermediate representation.

Rumelhart et al. trained a network with two hidden units on this task. After training, they inspected what the hidden units had learned. The weights were symmetric about the midpoint with opposite signs. For a symmetric input, both hidden units received zero net input and stayed off, causing the output unit (which had a positive bias) to fire, signalling symmetry. For any non-symmetric input, at least one hidden unit fired and suppressed the output.

The network was never told "look for symmetry across the midpoint." It discovered that structure because discovering it was the most efficient way to reduce the error. That is representation learning.

Experiment 2: The family tree.

This one is remarkable. Two isomorphic family trees, one English and one Italian, 104 relationships expressed as triples: (person, relationship, person). The network was trained on 100 of the 104 triples and asked to complete the fourth term when given the first two.

The network had to compress information about 24 people and 12 relationships into 6 hidden units per group. After training, Rumelhart et al. examined what those 6 units had learned to encode. Unit 1 distinguished English from Italian people. Unit 2 encoded which generation a person belonged to. Unit 6 encoded which branch of the family they came from.

Nobody told the network that nationality, generation, and family branch were relevant features. The network discovered them because they were the most compact and useful way to represent the information it needed to produce correct outputs. And because it had learned these structural features, it generalised correctly to the 4 triples it had never seen during training.

This is the proof of concept for representation learning. It is the same thing that happens when a modern neural network learns that edges are useful in layer 1, textures in layer 2, and object parts in layer 3. Nobody tells it to look for those things. It discovers them because they are the most efficient path to reducing error.

What They Admitted the Paper Cannot Do

Here is the thing about this paper that I respect enormously. The last two paragraphs contain a list of everything they knew was wrong.

The error surface may contain local minima. Gradient descent is not guaranteed to find the global minimum. They say that in practice the network rarely gets badly stuck, and that adding more connections than strictly necessary tends to smooth out the landscape. This is still true. Overparameterised networks generalise better than theory predicts. Nobody fully understands why.

The learning procedure is not biologically plausible. Backpropagation requires exact weight symmetry between the forward and backward passes, precise storage of intermediate activations, and a global error signal propagated backwards. Brains do none of these things in any obvious way. Rumelhart and Hinton knew this in 1986 and said so plainly. They hoped studying backpropagation would eventually lead to something more biologically realistic. Forty years later, that search is still ongoing.

And the scaling problem. The paper does not dwell on it, but the footnote is there. The procedure works on the tasks they tried. It is not obvious how it scales. The answer turned out to be: it scales remarkably well with more data and more compute, in ways nobody in 1986 could have anticipated. But the doubt was honest and appropriate.

The Modern Version: What loss.backward() Is Actually Doing

When you write this in PyTorch:

import torch
import torch.nn as nn

# A simple two-layer network, the same structure Rumelhart et al. used
model = nn.Sequential(
    nn.Linear(input_size, hidden_size),   # weights w_ij, equation 1
    nn.Sigmoid(),                          # equation 2 - they used sigmoid
    nn.Linear(hidden_size, output_size),
    nn.Sigmoid()
)

# Equation 3: mean squared error
criterion = nn.MSELoss()

# Equation 9: SGD with momentum, directly from the paper
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for inputs, targets in dataloader:
    # Forward pass: equations 1 and 2 running layer by layer
    outputs = model(inputs)

    # Equation 3: compute error
    loss = criterion(outputs, targets)

    # Backward pass: equations 4 through 7, computed automatically
    # This is the entire backward pass of the paper, done in one line
    optimizer.zero_grad()
    loss.backward()

    # Equation 9: update weights using accumulated gradients
    optimizer.step()

Every single line maps directly to something in the 1986 paper. nn.Linear is equation 1. nn.Sigmoid is equation 2. nn.MSELoss is equation 3. loss.backward() is equations 4 through 7, the entire backward pass, computed automatically using autograd. optimizer.step() with momentum is equation 9.

The difference is that today we would swap sigmoid for ReLU (faster, avoids vanishing gradients), MSELoss for CrossEntropyLoss for classification tasks (better gradient signal), and SGD with momentum for Adam (adaptive learning rates per parameter). But the underlying algorithm, forward pass, compute error, backward pass, update weights, is identical to what Rumelhart, Hinton, and Williams described in four pages in 1986.

One more thing worth noting. The paper says they accumulated gradients over all training cases before updating weights. Today we call that batch gradient descent. The alternative they mention, updating after every case, is stochastic gradient descent. Modern training uses mini-batch gradient descent, a middle ground they could not fully explore in 1986 because they did not have the compute. The insight was already there. The scale was not.

What This Paper Actually Did

The Perceptron told us that connections can learn. But it could only learn to weight features you already knew about.

Backpropagation told us that hidden units can learn too. And when hidden units learn, they discover features nobody designed. Nationality. Generation. Symmetry. Edges. Textures. Syntax. Meaning. Every layer of every deep network you have ever used is discovering features that its designers did not explicitly specify, using a procedure that is recognisably the same four equations from this paper.

The Perceptron answered Rosenblatt's question about how memory influences behavior. Backpropagation answered the harder question: how does a system figure out what to remember in the first place.

Everything since, every architecture, every optimizer, every training trick, is an elaboration on the answer Rumelhart, Hinton, and Williams put into four pages in Nature in October 1986.

They did not know it would scale to a trillion parameters. They did not know it would learn to write and reason and generate images. They just knew it could learn that Colin's aunt is Sophia, without being told to look for aunts.

That was enough to change everything.

This is part of my series reading foundational AI papers from scratch. Next up: Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998. The paper that took what Rumelhart and Hinton built and asked: what if the architecture itself could encode structure?

DEV Community