The XOR Problem: Why Neural Nets Needed Hidden Layers

#ai #beginners #deeplearning #machinelearning

In 1969, a single math problem froze AI funding for a decade. The problem was XOR — and the reason it mattered is the single most important idea in deep learning: why you need a hidden layer. I built a tiny network from scratch that you can watch fail at XOR, then flip on a hidden layer and watch it win.

⊕ Interactive demo (train it live): https://dev48v.infy.uk/dl/day5-xor.html

This is Day 5 of my DeepLearningFromZero series — neural nets built from scratch.

XOR: on when the inputs disagree

(0,0) → 0     (0,1) → 1
(1,0) → 1     (1,1) → 0

Plot those four points and the 1s sit on one diagonal, the 0s on the other. It's the simplest logic function that is not linearly separable — you cannot draw a single straight line with both 1s on one side.

A single neuron draws one line — and fails

A perceptron computes w·x + b and thresholds it: that's one straight decision boundary. Rotate or shift it however you like, one of the four points always lands on the wrong side:

out = sigmoid(w0 * a + w1 * b + bias);   // one line — permanently stuck

In the demo, run "no hidden layer" mode and the loss flatlines near 0.25 forever. It literally cannot win.

This killed AI funding

Minsky & Papert's 1969 book Perceptrons proved a single-layer network can't do XOR. The result was so deflating it helped trigger the first "AI winter." The irony: the limitation wasn't neural nets — it was nets with no hidden layer.

A hidden layer = combine multiple lines

Add a layer of hidden neurons and each learns its own line. The output neuron then combines them ("above line A AND below line B") to carve out the diagonal region XOR needs:

const h = W1.map(row => sigmoid(dot(row, x) + b1));  // several lines
const out = sigmoid(dot(W2, h) + b2);                // combine them

Stacked nonlinear layers can represent shapes a single layer never could.

Backprop trains every layer at once

Hidden layers were useless until backpropagation: compute the output error, then use the chain rule to send a share of the blame back to every weight in every layer:

const dOut = (out - y) * out * (1 - out);
const dH = W2.map((w, i) => dOut * w * h[i] * (1 - h[i]));  // chain rule into hidden

Flip the demo to "hidden layer" and the loss dives toward 0. XOR, solved.

The lesson of all deep learning

XOR is the toy version of the whole field. Stacking nonlinear layers lets a network represent functions a single layer can't reach. Every deep net since — vision, language, GPT — is this exact trick scaled up to millions of neurons.

Train it yourself: toggle the hidden layer, drag the learning rate, and watch the decision boundary bend.