In the last post, we talked about 2012 — the year deep learning stopped being an academic curiosity and started winning. But before we can appreciate why that moment mattered, we need to understand what these networks actually do, mechanically, when you feed them a number. Not the hand-wavy "it's like a brain" version. The actual math, the actual shapes, the actual reason a stack of matrix multiplications can approximate almost any function you throw at it.
So let's go back to the simplest possible neural network — one that isn't even really a network yet — and build up from there.
The perceptron: one line, one decision
Strip away everything, and a perceptron does exactly one thing: it draws a straight line (or a plane, or a hyperplane, depending on dimension) and asks which side are you on?
Given an input vector x and a weight vector w, the perceptron computes:
ŷ = sign(w · x)
That's the whole model. Multiply, sum, check the sign. If w · x is positive, predict one class; if negative, predict the other.
The geometry here is worth sitting with, because it explains both the perceptron's power and its ceiling. The equation w · x = 0 defines a hyperplane, and w is the vector perpendicular to that hyperplane. For any point not sitting exactly on the boundary, w · x is proportional to that point's signed distance from it — positive on one side, negative on the other. The perceptron isn't doing anything conceptually deep; it's just measuring which side of a line you fell on and reporting the sign.
This is also exactly why a single perceptron cannot solve XOR. Plot the four XOR points and you'll see the two classes sitting diagonally opposite each other — there is no single straight line that separates them. The perceptron's decision boundary is always a hyperplane, no matter how you tune the weights, so a problem like XOR is simply out of reach for it. This limitation, formalized in Minsky and Papert's 1969 book Perceptrons, is a big part of why AI research funding dried up for most of the 1970s — the "AI winter" wasn't caused by hype dying naturally, it was caused by a proof that the simplest version of this idea had a hard ceiling.
From one decision to a network of decisions
The fix turns out to be almost embarrassingly simple: stack more of them, and swap the sign function for something smoother.
A multi-layer perceptron (MLP) takes the same building block — weighted sum, then a function applied to it — and arranges many of them into layers. An input layer supplies the raw data, one or more hidden layers transform it, and an output layer produces the final prediction. Each hidden neuron receives a weighted sum of everything in the previous layer, applies a non-linearity, and passes the result forward.
That non-linearity is not optional decoration — it's the entire reason depth matters. Here's the argument: if every layer were purely linear, then a three-layer network would compute
ŷ = W₃(W₂(W₁x)) = (W₃W₂W₁)x
and W₃W₂W₁ is just... another matrix. Multiply as many linear layers together as you like, and you still only get a single linear transformation. Depth would be a complete waste of compute. The non-linear activation function sandwiched between each linear step is what stops this collapse from happening and gives depth an actual reason to exist.
Picking a non-linearity: sigmoid, tanh, ReLU
Not all activation functions are created equal, and the field's history is basically a story of people discovering why their current favorite has a problem, and fixing it.
Sigmoid was the first popular choice — an S-shaped curve squashing everything into (0, 1), with a conveniently simple derivative: f(x)(1 − f(x)). The catch is that this derivative maxes out at just 0.25, and collapses toward zero the moment you move more than a few units away from 0. The function saturates.
Tanh is sigmoid's cousin, squashing into (−1, 1) instead, with the nice property of being zero-centered — which tends to produce better-behaved gradients downstream. But it saturates too, for the same structural reason.
ReLU (rectified linear unit) is almost insultingly simple: max(0, x). No exponentials, dirt cheap to compute, and its derivative is either exactly 1 (for positive inputs) or exactly 0 (for negative ones). No saturation on the positive side, ever. This is the default choice for hidden layers in most modern architectures — not because it's clever, but because it doesn't get in its own way.
Watching data actually move through the network
Let's make this concrete instead of abstract. Take a tiny network: 3 inputs, one hidden layer with 2 ReLU neurons, one sigmoid output neuron.
x = [1, 2, 3]
W1 = [ 0.1 0.2 -0.1 ] b1 = [ 0.1 ]
[ 0.3 -0.2 0.05] [-0.1 ]
W2 = [ 0.5 -0.3 ] b2 = [ 0.2 ]
Hidden pre-activation, z1 = W1·x + b1:
z1[0] = 0.1(1) + 0.2(2) - 0.1(3) + 0.1 = 0.3
z1[1] = 0.3(1) - 0.2(2) + 0.05(3) - 0.1 = -0.05
Apply ReLU — the negative value gets clipped to zero, no exceptions:
a1 = [0.3, 0]
Output pre-activation, z2 = W2·a1 + b2:
z2 = 0.5(0.3) + (-0.3)(0) + 0.2 = 0.35
Apply sigmoid:
ŷ = 1 / (1 + e^-0.35) ≈ 0.587
That's it — that's the entire forward pass. Linear combination, non-linearity, linear combination, non-linearity, prediction. Everything a feedforward network does, at any scale, is this pattern repeated more times with bigger matrices.
Which raises the obvious question: why bother writing it as matrix multiplication instead of just... doing this arithmetic neuron by neuron? Two reasons. First, it maps directly onto the kind of vectorized computation GPUs are built for — instead of looping over neurons, you do one matrix multiply. Second, and more importantly for what's coming next, matrix notation gives you clean derivatives. Once you write a layer as ŷ = Wx, matrix calculus hands you ∂ŷ/∂W = xᵀ and ∂ŷ/∂x = Wᵀ — two identities that turn out to be the entire mathematical engine behind training the network.
Dimension bookkeeping, while we're here, is simple: a layer mapping an n-dimensional input to an m-dimensional output has a weight matrix of shape (m × n). The output dimension is always just the number of neurons in that layer — nothing more mysterious than that.
Teaching the network: backpropagation
Forward pass gets you a prediction. It says nothing about how to improve it. That's backpropagation's job, and it answers one specific question: how much did each weight contribute to the final error, and which direction should I nudge it?
The mechanism is two passes. Forward, to compute the prediction and the loss. Backward, to walk from that loss back through the network, layer by layer, figuring out each weight's share of the blame — using the chain rule the entire way.
The chain rule is deceptively simple: if g = f(h(x)), then dg/dx = df/dh · dh/dx. A neural network is nothing but a deeply nested function — layer inside layer inside layer — so the chain rule is the only reason it's even possible to compute how a change to some deeply buried weight ripples all the way forward to affect the final loss.
A concrete example makes this less abstract. Suppose:
a = 2x₁ → a = 2
c = a + b → c = 11 (with b = 9)
e = c² → e = 121
g = e + 3 → g = 124
To get ∂g/∂x₁, you don't need to re-derive anything from scratch — you multiply the local derivatives along the path:
∂g/∂x₁ = (∂g/∂e) · (∂e/∂c) · (∂c/∂a) · (∂a/∂x₁)
= 1 · 2c · 1 · 2
= 1 · 22 · 1 · 2 = 44
Each piece is trivial on its own. The chain rule is what lets you chain them into an answer for a variable buried four steps deep — and it's also why backpropagation can reuse intermediate results instead of recomputing everything from scratch for every single weight, which is what makes it efficient enough to actually train networks with millions of parameters.
One more thing worth being precise about: backpropagation is not the training algorithm. It's just the mechanism for computing the gradient. Gradient descent (or a variant of it) is the separate step that actually uses that gradient to update the weights.
The sign function's fatal flaw
This is also the moment where the perceptron's original activation function — sign — permanently disqualifies itself. Its derivative is 0 everywhere except at x = 0, where it's undefined. Plug a derivative of 0 into a chain-rule product, and the entire product becomes 0. No gradient signal reaches any weight upstream. Gradient descent has nothing to work with. This is precisely why the field moved to smooth, differentiable activations — first sigmoid, later ReLU — that have a well-defined, non-zero derivative over a useful range.
The problem hiding inside the chain rule: vanishing and exploding gradients
Here's the uncomfortable part. Backpropagation computes a weight's gradient as a product of many local derivatives — one per layer standing between that weight and the loss. If those local derivatives are consistently less than 1, the product shrinks exponentially with depth. If they're consistently greater than 1, it explodes exponentially. Depth, the thing that makes networks powerful, is also the thing that makes this problem worse the deeper you go.
This is exactly why sigmoid struggles in deep networks. Its derivative tops out at 0.25 and collapses toward zero as inputs move away from the origin. Stack even a modest number of sigmoid layers, and you're multiplying several numbers that are each at most 0.25 — the gradient reaching early layers vanishes almost immediately.
ReLU's derivative, by contrast, is exactly 1 for any active (positive) unit. Multiplying by 1 doesn't shrink anything. This single property — not saturating on the positive side — is a large part of why ReLU became the default and why genuinely deep networks became trainable in the first place.
In practice, vanishing gradients look like a loss curve that drops for a few iterations and then goes nearly flat, even though nothing has technically broken — no NaNs, no divergence, just early layers that have effectively stopped learning. Exploding gradients look like the opposite: wild swings in the loss, sometimes an increase in loss despite gradient descent trying to minimize it, and eventually numerical garbage. Both are tangled up with the learning rate η — too high, and you're closer to exploding; too low, and you get something that looks a lot like vanishing, just from a different cause. Getting η right is the difference between a loss curve that decreases smoothly and one that either stalls or spirals.
Why this matters before anything else
Everything modern — batch normalization, residual connections, better initialization schemes, adaptive optimizers — exists because someone ran into one of these exact problems: a linear collapse, a dead gradient, a vanishing signal, an unstable step size. None of it is solving a new problem. It's all solving this problem, the one sitting quietly inside a network as simple as three inputs and two hidden neurons.
Which is really the whole point of starting here. The perceptron's straight line and the modern deep network's messy, high-dimensional decision surface are running on the same underlying machinery: weighted sums, non-linearities, and a chain rule quietly multiplying its way backward through every layer. Understand that machinery at this scale, and the more sophisticated architectures stop looking like magic — they start looking like reasonable answers to problems you already know exist.
Next up: loss functions and optimization — what actually happens after backpropagation hands you a gradient.
Top comments (0)