DEV Community: Harshil Rami

The Paper That Taught Neural Networks to Learn Backwards

Harshil Rami — Sat, 09 May 2026 10:36:44 +0000

Last week I read the 1958 Rosenblatt paper. The one that started everything. The Perceptron, the first learning machine, the idea that memory lives in connections and not addresses. And at the very end of that paper, almost as a footnote, Rosenblatt wrote that "some system, more advanced in principle than the perceptron, seems to be required."

This is that system.

Rumelhart, Hinton, and Williams. 1986. Four pages in Nature. And somewhere in those four pages, the answer to the question Rosenblatt had left open for twenty-eight years.

What Was Actually Broken

To understand why this paper matters, you need to understand what the Perceptron could not do. And the answer is not XOR, even though that is what everyone says. XOR is a symptom. The real problem was deeper.

In Rosenblatt's Perceptron, the feature detectors, the A-units sitting between the input and the output, were connected randomly and then frozen. Nobody trained them. The only thing that learned was the final layer, the response units deciding which class to pick. Which means the Perceptron could only learn to combine features that already existed in the raw input. It could not discover new features on its own.

Think about what this means. If you want the Perceptron to recognise faces, someone has to hand-engineer the features: edges, curves, symmetry. The network cannot figure out that symmetry is a useful thing to look for. It can only learn to weight the features you already gave it.

Rumelhart, Hinton, and Williams called these "hidden units." Units that sit between the input and the output, that are not told what to do, that have to figure out on their own what they should represent. Training hidden units is the problem. And the reason nobody had solved it is that you cannot directly measure how wrong a hidden unit is. You only know how wrong the output is. The error signal exists at the top. The hidden units are somewhere in the middle.

Backpropagation is the answer to one question: how do you take an error signal that exists only at the output, and use it to train units that you cannot directly observe?

The Architecture: A Factory With Floors

Before the math, the picture.

Imagine a factory with multiple floors. Raw materials come in at the ground floor. Each floor transforms what it receives and passes the result upward. The finished product comes out at the top floor. You have a quality inspector at the top who compares the finished product to what was ordered and measures how wrong it is.

Now the problem: the quality inspector knows the final product is wrong. But which floor made the mistake? And by exactly how much did each floor contribute to the wrongness?

This is the problem backpropagation solves. And it solves it in two passes.

The forward pass is the factory running normally. Input comes in at the bottom, each layer transforms it, output comes out at the top. Simple.

The backward pass is blame flowing in reverse. The top floor gets blamed proportionally to how wrong the output was. It then passes blame to the floor below it, saying: here is how much each of you contributed to my mistake. Each floor does the same, passing blame further down, all the way to the bottom. Every connection in the network receives a precise measure of how much it contributed to the final error. Then every weight adjusts itself to be slightly less wrong next time.

That is backpropagation. The forward pass computes what the network thinks. The backward pass computes who is responsible for being wrong.

The Forward Pass: Equations 1 and 2

Let us get precise. In the paper, Rumelhart et al. define the network with two equations that govern how information moves forward through the network.

The total input to unit j is a weighted sum of everything coming from below:

x(j) = Σ y(i) · w(ij)

Where y(i) is the output of unit i in the layer below, and w(ij) is the weight of the connection from i to j. Every unit below j sends its output upward, multiplied by the strength of its connection. Unit j adds them all up.

Then the output of unit j is a non-linear function of that total input:

y(j) = 1 / (1 + e^(-x(j)))

This is the sigmoid function. It takes any real number and squashes it into a value between 0 and 1. The reason you need this non-linearity is critical: without it, stacking multiple layers is mathematically equivalent to having just one layer. The non-linearity is what allows each layer to do something genuinely different from the layer below it.

The paper says any bounded differentiable function will work here. In 1986, they used sigmoid. Today we use ReLU, which is simply max(0, x). Simpler to compute, faster to train, does not suffer from the vanishing gradient problem that sigmoid creates in deep networks. But the principle is identical: a non-linearity that lets each layer transform its input in a way that the next layer cannot simply undo.

The Error: Equation 3

After the forward pass, you have the network's output. You compare it to what the output should have been. The error is defined as:

E = (1/2) Σ (y(j) - d(j))²

Where y(j) is what the network produced for output unit j, and d(j) is what it should have produced. The (1/2) out front is a convenience: when you differentiate this, the 2 from the square and the 1/2 cancel cleanly.

This is mean squared error. Today we often use cross-entropy loss for classification problems because it has better gradient properties near zero and one. But the backpropagation algorithm works identically regardless of what loss function you choose. The only requirement is that the loss is differentiable with respect to the output.

The goal is to minimise E. And the way to minimise E is gradient descent: find which direction in weight space makes E increase, and move in the opposite direction. To do this, you need the partial derivative of E with respect to every single weight in the network. This is what the backward pass computes.

The Backward Pass: Equations 4 Through 7

This is the heart of the paper. And it is where most explanations lose people, because they introduce notation and concepts at the same time, leaving you holding two unfamiliar things at once. Let me do this differently.

I am going to carry one concrete example through every step. The network predicted 0.9. The correct answer was 0. The network is very wrong. We need to figure out which weights caused this, and by exactly how much. That is the only question the backward pass is answering.

Before anything else: what is the chain rule doing here?

The chain rule from calculus says: if A affects B, and B affects C, then you can figure out how A affects C by multiplying how A affects B by how B affects C.

In our network, a weight affects a unit's total input. That total input affects the unit's output through the sigmoid. That output flows upward and eventually affects the final error. The chain rule lets us connect the first link (weight) to the last link (error) by multiplying all the steps in between. This is all we are doing, four times in a row, layer by layer, going backwards.

Step 1: How wrong is the output, and in which direction?

Start at the top. The output unit produced 0.9. The correct answer was 0. The first thing we need is a precise measure of how the error changes when the output changes.

The answer is simply: prediction minus target.

∂E / ∂y = y - d

In our example: 0.9 minus 0 equals 0.9. Large and positive. This tells us: if the output goes up even slightly, the error gets worse. The network needs to push this output down.

Step 2: How much did the total input to that unit contribute to the error?

The output of a unit is the sigmoid of its total input. So before we can ask "which weights caused this," we need to go one step back: how sensitive is the error to the total input arriving at the output unit?

This is where the chain rule enters. The error depends on the output, and the output depends on the total input. So:

∂E / ∂x = (∂E / ∂y) · (∂y / ∂x)

The second term, how does the sigmoid output change when its input changes, has a beautiful closed form. The derivative of the sigmoid is:

∂y / ∂x = y · (1 - y)

In our example the output was 0.9, so this is 0.9 times 0.1 which equals 0.09. Putting it together:

∂E / ∂x = 0.9 × 0.09 = 0.081

This combined number is what Rumelhart et al. call the error signal for this unit. It captures both how wrong the output was and how sharply the sigmoid was responding at that point.

One thing worth pausing on. Notice what happens when the sigmoid output is very close to 0 or very close to 1. The term y · (1 - y) becomes very small. A 0.99 output gives 0.99 times 0.01 which is 0.0099. This means the error signal almost vanishes when units are saturated near the extremes of the sigmoid. Blame barely reaches the weights below. This is the vanishing gradient problem, and it is why deep networks trained with sigmoid struggled for decades until ReLU replaced it. ReLU does not saturate: its derivative is simply 1 for any positive input. The blame flows through cleanly.

Step 3: Which weights caused the error, and by how much?

Now we have the error signal at the output unit. We need to turn this into a gradient for each weight connecting into that unit.

The total input to a unit is just a weighted sum of the outputs from the layer below. So if we change one weight by a tiny amount, the total input changes by exactly the output of the unit that weight came from. Nothing more. This means:

∂E / ∂w = (∂E / ∂x) · y

In concrete terms: if the error signal at the output unit is 0.081, and the hidden unit connected to it had an output of 0.6, then the gradient for that weight is 0.081 times 0.6 which equals 0.049. This weight needs to decrease by an amount proportional to 0.049 to reduce the error.

The elegance here is striking. The gradient for any weight is just two numbers multiplied together: what the layer above says about the error, and what the layer below actually produced. That is it.

Step 4: Passing blame to the hidden layer.

You now have gradients for all the weights in the final layer. But you need to do the same thing for every hidden layer below. To do that, you need to know the error signal for each hidden unit, the same way you computed it for the output unit in step 1.

A hidden unit connects upward to multiple output units. Each of those connections contributed to the final error. So the total blame assigned to a hidden unit is the sum of blame from every unit it connects to above it, weighted by the strength of each connection:

∂E / ∂y(hidden) = Σ (∂E / ∂x(above)) · w

If a hidden unit connects to three output units with weights 0.5, 0.3, and 0.8, and those output units have error signals 0.081, 0.04, and 0.02, then the blame reaching the hidden unit is:

(0.081 × 0.5) + (0.04 × 0.3) + (0.02 × 0.8) = 0.0405 + 0.012 + 0.016 = 0.069

And this is the key to the whole algorithm. Once you have this blame signal at the hidden unit, you repeat steps 2 and 3 exactly as before: multiply by the sigmoid derivative to get the error signal, then multiply by the outputs from the layer below to get weight gradients.

The algorithm cascades backwards through the entire network, one layer at a time. Every weight in the network receives a precise gradient telling it exactly how much it contributed to the final error. This is why it is called backpropagation.

The Weight Update: Equations 8 and 9

Once you have the gradients, you update the weights. The simplest version is vanilla gradient descent:

Δw = -ε · (∂E / ∂w)

Where ε is the learning rate, a small number like 0.01 that controls how large each step is. The negative sign is because you want to move in the direction that reduces E, which is the opposite of the gradient direction.

But Rumelhart et al. immediately pointed out the problem with vanilla gradient descent: it is slow, and it oscillates. If the gradient keeps pointing in the same direction, you want to pick up speed. If it keeps reversing, you want to slow down.

Their solution adds momentum:

Δw(t) = -ε · (∂E / ∂w(t)) + α · Δw(t-1)

The second term carries forward a fraction α of the previous weight update. If the gradient has been pointing in the same direction for several steps, the weight updates accumulate and the learning accelerates. If the gradient reverses, the momentum term and the gradient term partially cancel, dampening the oscillation.

This is the ancestor of every modern optimizer. Adam, RMSprop, AdaGrad, they are all elaborate answers to the same question Rumelhart and Hinton were asking in 1986: how do you make gradient descent faster and more stable? Momentum was the first answer. It is still inside every optimizer you use today.

What the Experiments Actually Proved

This is the part I want to spend time on, because most blogs about backprop skip it entirely and just talk about the algorithm. The experiments in this paper are where the real idea lives.

Experiment 1: Symmetry detection.

The task: given a binary input vector, is it symmetrical about its midpoint? This cannot be solved without hidden units, and the paper proves it. You cannot add up individual inputs and get symmetry. You need to compare positions across the midpoint, which requires an intermediate representation.

Rumelhart et al. trained a network with two hidden units on this task. After training, they inspected what the hidden units had learned. The weights were symmetric about the midpoint with opposite signs. For a symmetric input, both hidden units received zero net input and stayed off, causing the output unit (which had a positive bias) to fire, signalling symmetry. For any non-symmetric input, at least one hidden unit fired and suppressed the output.

The network was never told "look for symmetry across the midpoint." It discovered that structure because discovering it was the most efficient way to reduce the error. That is representation learning.

Experiment 2: The family tree.

This one is remarkable. Two isomorphic family trees, one English and one Italian, 104 relationships expressed as triples: (person, relationship, person). The network was trained on 100 of the 104 triples and asked to complete the fourth term when given the first two.

The network had to compress information about 24 people and 12 relationships into 6 hidden units per group. After training, Rumelhart et al. examined what those 6 units had learned to encode. Unit 1 distinguished English from Italian people. Unit 2 encoded which generation a person belonged to. Unit 6 encoded which branch of the family they came from.

Nobody told the network that nationality, generation, and family branch were relevant features. The network discovered them because they were the most compact and useful way to represent the information it needed to produce correct outputs. And because it had learned these structural features, it generalised correctly to the 4 triples it had never seen during training.

This is the proof of concept for representation learning. It is the same thing that happens when a modern neural network learns that edges are useful in layer 1, textures in layer 2, and object parts in layer 3. Nobody tells it to look for those things. It discovers them because they are the most efficient path to reducing error.

What They Admitted the Paper Cannot Do

Here is the thing about this paper that I respect enormously. The last two paragraphs contain a list of everything they knew was wrong.

The error surface may contain local minima. Gradient descent is not guaranteed to find the global minimum. They say that in practice the network rarely gets badly stuck, and that adding more connections than strictly necessary tends to smooth out the landscape. This is still true. Overparameterised networks generalise better than theory predicts. Nobody fully understands why.

The learning procedure is not biologically plausible. Backpropagation requires exact weight symmetry between the forward and backward passes, precise storage of intermediate activations, and a global error signal propagated backwards. Brains do none of these things in any obvious way. Rumelhart and Hinton knew this in 1986 and said so plainly. They hoped studying backpropagation would eventually lead to something more biologically realistic. Forty years later, that search is still ongoing.

And the scaling problem. The paper does not dwell on it, but the footnote is there. The procedure works on the tasks they tried. It is not obvious how it scales. The answer turned out to be: it scales remarkably well with more data and more compute, in ways nobody in 1986 could have anticipated. But the doubt was honest and appropriate.

The Modern Version: What loss.backward() Is Actually Doing

When you write this in PyTorch:

import torch
import torch.nn as nn

# A simple two-layer network, the same structure Rumelhart et al. used
model = nn.Sequential(
    nn.Linear(input_size, hidden_size),   # weights w_ij, equation 1
    nn.Sigmoid(),                          # equation 2 - they used sigmoid
    nn.Linear(hidden_size, output_size),
    nn.Sigmoid()
)

# Equation 3: mean squared error
criterion = nn.MSELoss()

# Equation 9: SGD with momentum, directly from the paper
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for inputs, targets in dataloader:
    # Forward pass: equations 1 and 2 running layer by layer
    outputs = model(inputs)

    # Equation 3: compute error
    loss = criterion(outputs, targets)

    # Backward pass: equations 4 through 7, computed automatically
    # This is the entire backward pass of the paper, done in one line
    optimizer.zero_grad()
    loss.backward()

    # Equation 9: update weights using accumulated gradients
    optimizer.step()

Every single line maps directly to something in the 1986 paper. nn.Linear is equation 1. nn.Sigmoid is equation 2. nn.MSELoss is equation 3. loss.backward() is equations 4 through 7, the entire backward pass, computed automatically using autograd. optimizer.step() with momentum is equation 9.

The difference is that today we would swap sigmoid for ReLU (faster, avoids vanishing gradients), MSELoss for CrossEntropyLoss for classification tasks (better gradient signal), and SGD with momentum for Adam (adaptive learning rates per parameter). But the underlying algorithm, forward pass, compute error, backward pass, update weights, is identical to what Rumelhart, Hinton, and Williams described in four pages in 1986.

One more thing worth noting. The paper says they accumulated gradients over all training cases before updating weights. Today we call that batch gradient descent. The alternative they mention, updating after every case, is stochastic gradient descent. Modern training uses mini-batch gradient descent, a middle ground they could not fully explore in 1986 because they did not have the compute. The insight was already there. The scale was not.

What This Paper Actually Did

The Perceptron told us that connections can learn. But it could only learn to weight features you already knew about.

Backpropagation told us that hidden units can learn too. And when hidden units learn, they discover features nobody designed. Nationality. Generation. Symmetry. Edges. Textures. Syntax. Meaning. Every layer of every deep network you have ever used is discovering features that its designers did not explicitly specify, using a procedure that is recognisably the same four equations from this paper.

The Perceptron answered Rosenblatt's question about how memory influences behavior. Backpropagation answered the harder question: how does a system figure out what to remember in the first place.

Everything since, every architecture, every optimizer, every training trick, is an elaboration on the answer Rumelhart, Hinton, and Williams put into four pages in Nature in October 1986.

They did not know it would scale to a trillion parameters. They did not know it would learn to write and reason and generate images. They just knew it could learn that Colin's aunt is Sophia, without being told to look for aunts.

That was enough to change everything.

This is part of my series reading foundational AI papers from scratch. Next up: Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998. The paper that took what Rumelhart and Hinton built and asked: what if the architecture itself could encode structure?

Blog 3: Adaptive Learning Rate Methods (Part 1)

Harshil Rami — Thu, 23 Apr 2026 18:21:51 +0000

When one learning rate isn't enough — per-parameter scaling and the decay problem

Momentum gave the optimizer a memory across time.
Adaptive methods give it a memory across parameters.
Both are necessary. Neither is sufficient alone.

The problem with a single learning rate

Every optimizer we've covered so far shares one architectural assumption: a single scalar η governs every parameter in the model.

This seems reasonable until you think about what parameters actually experience during training.

Consider a language model's embedding table. It contains one vector per token in the vocabulary — perhaps 50,000 vectors, each of dimension 512. In any given mini-batch of 64 sequences, you might see 2,000 unique tokens. The remaining 48,000 tokens receive zero gradient for that entire step. When they do appear, their gradient signals are sparse, noisy, and infrequent.

Now consider the final projection layer — a dense 512×50,000 matrix. Every forward pass touches every row. Gradients are dense, consistent, and arrive every single step.

Both layers are updated with the same η.

This is the problem. Parameters that receive rare, informative signal should move aggressively when that signal arrives — their updates are precious. Parameters that receive dense, consistent signal should move conservatively — there's no rush, and overshooting is costly.

A global learning rate can't serve both regimes. Set it high and the dense layers oscillate. Set it low and the sparse layers barely move.

AdaGrad's answer: let each parameter maintain its own effective learning rate, derived automatically from its gradient history.

AdaGrad: accumulate, then scale

AdaGrad (Adaptive Gradient Algorithm — Duchi, Hazan & Singer, 2011) introduces a per-parameter accumulator Gₜ that tracks the sum of squared gradients seen so far.

Gₜ = Gₜ₋₁ + (∇L(θₜ))²        # element-wise square, accumulated
θₜ₊₁ = θₜ − (η / √(Gₜ + ε)) · ∇L(θₜ)

Where ε (epsilon) is a small constant (typically 1e-8) added for numerical stability — it prevents division by zero when a parameter has received no gradient.

The update is element-wise: each parameter θᵢ has its own Gᵢ, and divides its gradient by √Gᵢ. No parameter borrows from another's accumulator.

What this achieves

Sparse parameters — infrequent updates, small accumulated G. Dividing by √G yields a large effective learning rate. When signal finally arrives, AdaGrad takes a proportionally large step.

Dense parameters — frequent updates, large accumulated G. Dividing by √G yields a small effective learning rate. Updates are conservative; the optimizer doesn't overfit to any single gradient.

AdaGrad is essentially performing an automatic, online normalization of learning rates. You no longer need to hand-tune separate learning rates for different parameter groups.

Where AdaGrad shines

This mechanism is particularly powerful for:

Sparse features in NLP (word embeddings, bag-of-words models)
Recommendation systems with millions of item/user embeddings
Convex optimization problems where the accumulated curvature information is always relevant

In fact, for convex problems, AdaGrad has provably optimal regret bounds in the online learning setting. This theoretical grounding is part of why it was so influential.

The learning rate as a function of time

It helps to think of AdaGrad as replacing the fixed learning rate η with an effective per-parameter rate:

ηᵢ_eff(t) = η / √(Σₛ₌₁ᵗ gᵢ,ₛ²)

This is a monotonically decreasing function of time. Every gradient update, regardless of size, increases Gᵢ, which decreases ηᵢ_eff. The learning rate can only go down. It never recovers.

That property is exactly AdaGrad's fatal flaw.

AdaGrad's fatal flaw: the dying learning rate

In practice, AdaGrad's accumulator grows without bound. After enough training steps, Gₜ becomes so large that η/√Gₜ shrinks toward zero for every parameter — including the ones that still need to learn.

This is not a tuning problem. It is structural. The accumulator is a sum, not an average, and sums only increase.

The consequences are severe:

Training effectively stops after a certain number of steps, even if the model hasn't converged.
On non-stationary loss surfaces (which all deep learning surfaces are — the loss landscape shifts as other parameters update), old gradient information from early training becomes misleading. Parameters that moved quickly early on get permanently penalized for it, even if the relevant gradients now point in a completely different direction.
The "optimal for convex problems" guarantee doesn't transfer to deep learning, where the curvature landscape changes throughout training.

AdaGrad is excellent for shallow, convex problems with sparse features. For deep networks trained over many epochs, it usually fails in practice.

This is exactly the problem RMSProp was designed to fix.

RMSProp: forget the distant past

RMSProp (Root Mean Square Propagation — Hinton, 2012, unpublished but widely cited from his Coursera lectures) makes one targeted change to AdaGrad: replace the cumulative sum with an exponentially weighted moving average.

Eₜ = ρ · Eₜ₋₁ + (1 − ρ) · (∇L(θₜ))²     # running average of squared gradients
θₜ₊₁ = θₜ − (η / √(Eₜ + ε)) · ∇L(θₜ)

Where ρ (rho) is the decay coefficient — typically 0.9 or 0.99. It controls how much weight is given to recent gradients vs. historical ones.

This is the exponential moving average (EMA) pattern — the same mechanism used in momentum. The difference here is that it's applied to squared gradients rather than the gradients themselves.

Why this works

With exponential decay, the effective window of "remembered" gradient history is roughly 1/(1−ρ) steps. At ρ = 0.9, that's ~10 recent steps. At ρ = 0.99, ~100 steps.

Old gradients from many hundreds of steps ago contribute essentially nothing to Eₜ. The accumulator doesn't grow forever — it's a sliding window. When the loss landscape shifts (as it always does in deep learning), the old curvature information fades out, and new curvature information takes over.

This is the key property AdaGrad lacked: adaptability over time, not just over parameters.

The geometry being captured

The denominator √Eₜ is an estimate of the root mean square of recent gradients for each parameter — hence the name. It approximates the local gradient scale without integrating all history.

Parameters that have recently received large gradients get scaled down. Parameters that have recently been quiet get scaled up. The key word is recently — unlike AdaGrad, this estimate is always locally relevant.

AdaGrad vs. RMSProp: direct comparison

AdaGrad:  Gₜ = Gₜ₋₁ + g²          → cumulative sum, unbounded growth
RMSProp:  Eₜ = ρ·Eₜ₋₁ + (1−ρ)·g²  → exponential decay, bounded estimate

The structural difference is one line. The practical difference is enormous.

Property	AdaGrad	RMSProp
Accumulator	Cumulative sum	Exponential moving average
Learning rate over time	Monotonically decreasing	Stationary (fluctuates around a value)
Memory of past gradients	All of history, equally weighted	Recent history, exponentially weighted
Suitable for	Convex, sparse, shallow models	Non-convex, deep networks
Long training runs	Fails (LR collapses)	Works
Non-stationary landscapes	Fails (stale curvature)	Works
Key hyperparameter	η	η, ρ

RMSProp became the default adaptive optimizer for deep learning before Adam existed, and many practitioners still reach for it when they want something lighter than Adam.

Implementation

# AdaGrad
G = 0
for batch in dataloader:
    grad = compute_gradient(loss_fn, model, batch)
    G = G + grad ** 2
    theta = theta - (lr / (G ** 0.5 + eps)) * grad

# RMSProp
E = 0
for batch in dataloader:
    grad = compute_gradient(loss_fn, model, batch)
    E = rho * E + (1 - rho) * grad ** 2
    theta = theta - (lr / (E ** 0.5 + eps)) * grad

In PyTorch:

# AdaGrad
optimizer = torch.optim.Adagrad(model.parameters(), lr=0.01, eps=1e-8)

# RMSProp
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9, eps=1e-8)

Practical hyperparameter guidance:

AdaGrad: η = 0.01 is a reasonable start. There's little else to tune — the accumulator handles the rest.
RMSProp: ρ = 0.9 is standard. Use ρ = 0.99 for more stable but slower-adapting effective LR. η typically needs to be smaller than you'd use with SGD — start at 1e-3 or 1e-4.
Both methods are sensitive to ε. In low-precision training (float16, bfloat16), the default 1e-8 can cause numerical issues. Try 1e-6 or 1e-5 if you see NaN losses.

What's still missing

RMSProp solves the dying learning rate. It gives us per-parameter adaptivity that stays relevant throughout training. It's a genuinely good optimizer.

But look at the update rule again:

θₜ₊₁ = θₜ − (η / √(Eₜ + ε)) · ∇L(θₜ)

There's no velocity term. No momentum. The update is still reactive — it responds to the current gradient, scaled by recent gradient history, but it doesn't accumulate direction the way momentum does.

We now have two powerful, independent ideas:

Momentum — smooth out gradient noise by accumulating velocity over time
Adaptive scaling — normalize updates by per-parameter gradient magnitude

Neither knows about the other. Both help convergence. The obvious question is: what happens if you combine them?

That's Blog 4. Adam takes exactly this step — it maintains both a first-moment estimate (momentum over gradients) and a second-moment estimate (RMSProp-style scaling), applies bias corrections to both, and produces one of the most robust general-purpose optimizers ever designed. Adamax, Nadam, and AMSGrad follow as targeted improvements on specific failure modes Adam introduces.

Three things to hold onto

Concept	AdaGrad	RMSProp
Core idea	Accumulate squared gradients	EMA of squared gradients
What it fixes	Uniform LR across parameters	Uniform LR across parameters
What it breaks	Long-run training (LR → 0)	Nothing fatal — but no momentum
Best use case	Convex, sparse, shallow	Deep networks, non-stationary

What's next

In Blog 4, the two threads of this series converge. Adam combines the first moment (gradient direction, momentum-style) with the second moment (gradient magnitude, RMSProp-style) into a single update, then applies bias corrections to prevent cold-start distortion. Adamax extends the second moment to the L∞ norm. Nadam swaps standard momentum for Nesterov lookahead. AMSGrad addresses Adam's theoretical non-convergence issue.

Each one is a targeted answer to a specific flaw in Adam's design. By the end of Blog 4, you'll have the full picture of why AdamW — not Adam — is the default for modern LLM training.

This is Blog 3 of an 8-part series on optimization algorithms for deep learning.

Blog 2: Momentum-Based Optimizers

Harshil Rami — Wed, 22 Apr 2026 18:03:58 +0000

Giving the optimizer a memory — and teaching it to look before it leaps

SGD knows where it is. Momentum knows where it's been. Nesterov knows where it's going.
That single sentence is the entire story of this post.

The ravine problem, visualized

Let's be concrete about what SGD's zig-zagging actually looks like.

Suppose your loss surface is an elongated valley — steep walls on the left and right, a gentle slope running toward the minimum far ahead. This is the classic ravine geometry, and it's not an academic toy. It shows up naturally when your features have very different scales, when layers have different learning dynamics, or when you're in the early phases of training a deep network.

SGD's update on this surface behaves as follows:

Along the steep axis (across the ravine): the gradient is large. SGD takes a big step, overshoots, corrects back, overshoots again. The updates oscillate violently.
Along the shallow axis (down the ravine, toward the minimum): the gradient is small. SGD takes tiny, tentative steps. Progress is glacial.

The result is a path that looks like a snake moving sideways more than forward. You can shrink the learning rate to tame the oscillations on the steep axis, but that makes the shallow axis even slower. There's no single learning rate that handles both directions well.

This is the fundamental limitation SGD leaves on the table, and it's what momentum is designed to fix.

Momentum: giving the optimizer velocity

The core idea behind momentum is borrowed directly from physics. Instead of updating parameters based on the current gradient alone, we maintain a velocity vector v that accumulates a running average of past gradients.

vₜ = β · vₜ₋₁ + η · ∇L(θₜ)
θₜ₊₁ = θₜ − vₜ

Where β (beta) is the momentum coefficient — typically 0.9. Some formulations absorb the learning rate differently; the semantics are equivalent.

Let's unpack what this actually does.

On the oscillating (steep) axis:
Gradients alternate sign — positive, negative, positive, negative. The velocity accumulates these with the decay factor β. Because they cancel each other out over time, the velocity along this axis stays small. Oscillations are damped.

On the consistent (shallow) axis:
Gradients consistently point in the same direction — always slightly downhill. The velocity accumulates these constructively. Each step adds to the previous. The effective step size grows, and the optimizer accelerates.

This is the momentum effect: dampening in oscillating directions, acceleration in consistent ones. The optimizer builds up speed where the surface is consistently sloped and brakes naturally where the surface is ambiguous.

Effective learning rate under momentum

With β = 0.9, a gradient that persists in the same direction for many steps produces a velocity roughly 1/(1−β) = 10× the nominal learning rate. This is why momentum often requires a slightly lower learning rate than vanilla SGD — the effective step size is larger.

More precisely, if the gradient is constant at g, the velocity converges to:

v* = η · g / (1 − β)

So momentum scales up the effective learning rate by 1/(1−β). Set β = 0.9 → 10× amplification. Set β = 0.99 → 100×. This amplification is the source of both momentum's power and its instability if misconfigured.

The ball analogy

Momentum is often described as a ball rolling down a hill. The ball doesn't instantly respond to every local slope — it carries inertia. A small bump doesn't stop it; it takes a sustained uphill slope to decelerate it meaningfully.

This analogy is accurate and useful, but it has a limit: the ball analogy suggests the optimizer might overshoot and roll up the other side of a valley. That's a real failure mode of high-momentum settings. If β is too large, the optimizer can oscillate around minima rather than settling into them, or sail through a narrow good basin entirely.

Nesterov Accelerated Gradient: look before you leap

Momentum is good. Nesterov Accelerated Gradient (NAG), proposed by Yurii Nesterov in 1983, makes one surgical improvement that turns out to matter significantly in practice.

The problem with standard momentum: the gradient is evaluated at the current position, before applying the velocity. By the time you apply the update, you're no longer at that position — you've already moved. You're using stale directional information.

NAG fixes this with a simple conceptual shift: evaluate the gradient at the position you're about to arrive at, then correct from there.

θ_lookahead = θₜ − β · vₜ₋₁          # project forward
vₜ = β · vₜ₋₁ + η · ∇L(θ_lookahead)  # gradient at projected position
θₜ₊₁ = θₜ − vₜ

The momentum step projects you forward to where you'll be before the gradient correction. Then you evaluate the gradient there. This means the correction accounts for the momentum-driven position, not the pre-momentum position.

Why this helps

Think of it this way. Standard momentum is like running toward a wall and only noticing the wall after you've taken your full step. Nesterov is like looking ahead as you run and starting to slow down before you hit the wall.

In regions where the momentum is carrying you toward a steep uphill, NAG detects that uphill slope earlier and applies a corrective force sooner. The update is more anticipatory than reactive.

In practice, NAG converges faster than standard momentum on convex problems — Nesterov's original theoretical analysis showed an O(1/k²) convergence rate versus SGD's O(1/k), a meaningful gap. For non-convex deep learning loss surfaces, the improvement is empirical rather than provably guaranteed, but it's consistently observed.

NAG in the equivalent update form

The two-equation NAG formulation above has an equivalent single-equation form that's more commonly implemented:

vₜ = β · vₜ₋₁ + ∇L(θₜ − β · vₜ₋₁)
θₜ₊₁ = θₜ − η · vₜ

Both are equivalent; the second form makes it clearer that the only change from standard momentum is where the gradient is evaluated.

Side-by-side comparison

	Gradient Descent	SGD	Momentum	NAG
Gradient source	Full dataset	Single sample	Mini-batch	Mini-batch (at projected pos.)
Memory	None	None	Velocity vₜ	Velocity vₜ
Oscillation handling	None	None	Dampens via averaging	Dampens + anticipates
Convergence rate (convex)	O(1/k)	O(1/√k)	O(1/k)	O(1/k²)
Typical β	—	—	0.9	0.9–0.99

The convergence rate column is worth reading carefully. SGD's O(1/√k) is actually worse than GD's O(1/k) — the variance of stochastic gradients costs you a square root. Momentum restores GD-level rates. Nesterov goes further.

Implementation and practical notes

# Standard Momentum
v = 0
for batch in dataloader:
    grad = compute_gradient(loss_fn, model, batch)
    v = beta * v + learning_rate * grad
    theta = theta - v

# Nesterov Accelerated Gradient
v = 0
for batch in dataloader:
    # Evaluate gradient at lookahead position
    theta_lookahead = theta - beta * v
    grad = compute_gradient(loss_fn, model, theta_lookahead)
    v = beta * v + learning_rate * grad
    theta = theta - v

In PyTorch, both are available via torch.optim.SGD with momentum and nesterov flags:

# Momentum
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# NAG
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

Practical hyperparameter guidance:

β = 0.9 is the standard default. It works well across a wide range of architectures.
β = 0.99 gives stronger smoothing but risks slower response to genuine direction changes and can overshoot narrow minima.
When switching from SGD to momentum, reduce the learning rate by roughly 1/(1−β) — so if η = 0.1 for SGD, try η = 0.01 with β = 0.9.
NAG is almost always preferable to plain momentum for the same computational cost. Default to it.

Where momentum still falls short

Momentum is a major step forward. But it inherits one fundamental limitation from SGD: a single global learning rate for all parameters.

Your model's parameters live in very different regimes. The embedding layer of a language model sees only a handful of tokens in each batch — its effective gradient is sparse and noisy. The final linear layer sees a dense, consistent gradient every step. Both are updated with the same η.

This is deeply suboptimal. Sparse parameters should move faster when they do receive signal. Dense parameters can afford more conservative updates to avoid oscillation.

Momentum has no mechanism to learn this. It smooths over time but not over parameters.

That's exactly the problem Blog 3 solves. AdaGrad will introduce per-parameter learning rates, scaling each update by the history of that parameter's gradient magnitude. RMSProp will fix AdaGrad's long-term decay problem. And the combination of per-parameter scaling with momentum will eventually give us Adam.

Three things to hold onto

Concept	What it means	Why it matters
Velocity accumulation	Past gradients persist via β decay	Accelerates in consistent directions, dampens oscillations
Lookahead gradient (NAG)	Gradient evaluated at projected position	Earlier correction, better convergence rate
Effective LR scaling	Velocity → η/(1−β) effective step	Must tune LR down when adding momentum

What's next

In Blog 3, we shift from when the optimizer has seen a gradient to which parameters have seen large gradients. AdaGrad introduces a per-parameter accumulator — parameters that receive frequent, large gradients get smaller effective learning rates; sparse parameters get larger ones. RMSProp then fixes AdaGrad's fatal flaw: the accumulator grows without bound, eventually shrinking all learning rates to zero.

If momentum gave the optimizer a memory across time, adaptive methods give it a memory across parameters. Both are necessary. Neither is sufficient alone.

This is Blog 2 of an 8-part series on optimization algorithms for deep learning.

Blog 1: Foundations of Gradient Descent

Harshil Rami — Wed, 22 Apr 2026 17:54:43 +0000

How neural networks learn — and why the obvious approach breaks immediately

Every optimizer you'll ever use — Adam, AdamW, Lion, LAMB — is an answer to a problem that gradient descent creates. To understand why those answers exist, you need to feel the problem first.

The loss surface is a landscape you can't see

Imagine you're blindfolded, standing somewhere on a hilly terrain. Your only tool is a stick: you can poke the ground around you and measure the slope. Your goal is to reach the lowest valley.

That's optimization.

The "terrain" is your loss surface — a high-dimensional function L(θ) mapping your model's parameters θ to a scalar loss. You can't see the whole surface. You can only evaluate the gradient at your current position and take a step.

The question every optimizer tries to answer: which direction, and how far?

Gradient Descent: the right idea, the wrong scale

Gradient Descent (GD) is the foundational answer. Given a loss function L(θ), we compute the gradient over the entire dataset and update:

θ ← θ − η · ∇L(θ)

Where η (eta) is the learning rate — a scalar controlling step size.

The update rule is clean. The gradient points in the direction of steepest ascent, so we move opposite to it. Mathematically, this is the direction of maximum local decrease in loss.

The intuition is correct. The implementation is catastrophically expensive.

To compute ∇L(θ) exactly, you need to pass your entire dataset through the model. For ImageNet-scale data or modern LLM corpora, this means billions of examples per update. You'd compute one parameter update per epoch. On a 100M parameter model. That's not slow — it's dead on arrival.

GD also has a subtle failure mode people underappreciate: when your dataset has redundant structure (and it almost always does), successive gradients are nearly identical. You're paying full-dataset cost for almost zero additional information after the first few passes.

Stochastic Gradient Descent: embrace the noise

The fix seems almost too simple: instead of computing the gradient over all N samples, pick one sample at random and update on that alone.

θ ← θ − η · ∇Lᵢ(θ)    for a randomly sampled i

This is Stochastic Gradient Descent (SGD). The gradient estimate is now noisy — it's a single-sample approximation of the true gradient. But that noise turns out to be a feature, not a bug.

Why noisy updates help:

Escaping shallow local minima. A noisy gradient doesn't always point exactly downhill. This stochasticity gives the optimizer a kind of thermal energy — it can jitter out of shallow basins that would trap a deterministic update.
Better generalization (empirically). The noise in SGD acts as implicit regularization. Models trained with SGD often generalize better than those trained with exact gradient methods, particularly in overparameterized regimes. There's a growing body of theory around this — the "flat minima" hypothesis suggests noisy SGD preferentially finds wider, flatter basins that transfer better to test data.
Speed per effective update. One SGD step is O(1) in data cost. You can make N updates in the time GD makes one, seeing every sample along the way.

The cost of noise:

SGD's gradient estimate has high variance. Updates zig-zag erratically, especially in directions where the loss surface has high curvature along one axis and low curvature along another (the classic "ravine" geometry). The path to the minimum looks like a drunk person's walk rather than a confident descent.

You can reduce the learning rate to smooth this out, but then you lose the speed advantage. You're always trading variance against convergence rate.

Mini-Batch Gradient Descent: the practical compromise

The resolution in practice is obvious in retrospect: use a small batch of B samples.

θ ← θ − η · (1/B) · Σᵢ∈Bₜ ∇Lᵢ(θ)

Where Bₜ is a randomly sampled mini-batch of size B at step t.

This is Mini-Batch Gradient Descent (MBGD) — and when practitioners say "SGD" today, this is almost always what they mean. Typical batch sizes range from 32 to 512, though the right choice depends on your model, hardware, and regularization goals.

What mini-batching buys you:

Variance reduction. Averaging over B samples reduces gradient variance by a factor of B compared to single-sample SGD, without B× the compute cost (thanks to parallelism on GPU/TPU).
Hardware efficiency. GPUs are throughput machines — they saturate at batch sizes that fully utilize memory bandwidth. A single-sample forward pass wastes most of your compute budget.
Enough noise to generalize. Mini-batch gradients are still noisy enough to provide the regularization benefits of SGD, unlike full-batch gradients.

The batch size isn't free:

Larger batches reduce noise, which sounds good, but:

They can converge to sharper minima with worse generalization (the "large-batch training problem" — Keskar et al., 2017).
They require proportionally larger learning rates to maintain the same effective update magnitude, but scaling LR linearly with batch size breaks down at large B.
Beyond a critical batch size, you're paying compute cost without improving convergence speed.

This last point becomes the central tension in Blog 6, when we look at LARS and LAMB — optimizers specifically designed to handle very large batches in distributed LLM training.

The update rule in full

Here's where we stand after mini-batch SGD. The complete training loop:

for epoch in range(num_epochs):
    shuffle(dataset)
    for batch in get_batches(dataset, batch_size=B):
        grad = compute_gradient(loss_fn, model, batch)
        θ = θ - learning_rate * grad

This works. It's what trains ResNets, early language models, and still forms the backbone of large-scale training in certain regimes (SGD with momentum remains competitive with Adam on image classification tasks).

But watch what happens on a ravine-shaped loss surface:

The gradient along the short axis (high curvature) is large → big oscillating steps
The gradient along the long axis (low curvature, toward the minimum) is small → slow progress

The optimizer zig-zags across the ravine instead of marching down it. You need a very small learning rate to prevent divergence on the steep axis, which makes the shallow axis painfully slow.

This is exactly the problem Blog 2 solves. Momentum will give the optimizer memory — a velocity vector that accumulates in persistent directions and dampens oscillations. Nesterov will take that one step further, looking ahead before committing to the update.

Three things to hold onto

Before moving to momentum, here are the three tensions that the rest of this series is spent resolving:

Problem	What causes it	Solved by
Slow, expensive updates	Full-dataset gradient	SGD / Mini-batch
High-variance, zig-zagging path	Single/small-batch noise	Momentum (Blog 2)
Uniform learning rate for all params	LR is a global scalar	AdaGrad, RMSProp (Blog 3)

Every optimizer from here on is a targeted intervention on one of these failure modes — or a combination of several at once.

Key equations, plain-English summary

Method	Update rule	One sentence
GD	`θ ← θ − η · ∇L(θ)`	Exact gradient, entire dataset, one step per epoch
SGD	`θ ← θ − η · ∇Lᵢ(θ)`	Noisy gradient, one sample, fast but erratic
MBGD	`θ ← θ − η · (1/B)·Σ∇Lᵢ(θ)`	Averaged gradient, batch of B, the practical default

What's next

In Blog 2, we add velocity. Momentum accumulates past gradients into a running average, smoothing the zig-zagging path and accelerating convergence in consistent directions. Nesterov takes the lookahead step — evaluating the gradient at a projected future position rather than the current one.

If SGD is someone walking blindfolded downhill, momentum is that same person carrying a ball that's already rolling. It takes more to change direction. That turns out to be exactly what you want.

This is Blog 1 of an 8-part series on optimization algorithms for deep learning. Each post covers one family of optimizers, following a problem → limitation → next solution arc.