Harshil Rami

Posted on Apr 22

Blog 2: Momentum-Based Optimizers

#ai #deeplearning #machinelearning #gradientdescent

Giving the optimizer a memory — and teaching it to look before it leaps

SGD knows where it is. Momentum knows where it's been. Nesterov knows where it's going.
That single sentence is the entire story of this post.

The ravine problem, visualized

Let's be concrete about what SGD's zig-zagging actually looks like.

Suppose your loss surface is an elongated valley — steep walls on the left and right, a gentle slope running toward the minimum far ahead. This is the classic ravine geometry, and it's not an academic toy. It shows up naturally when your features have very different scales, when layers have different learning dynamics, or when you're in the early phases of training a deep network.

SGD's update on this surface behaves as follows:

Along the steep axis (across the ravine): the gradient is large. SGD takes a big step, overshoots, corrects back, overshoots again. The updates oscillate violently.
Along the shallow axis (down the ravine, toward the minimum): the gradient is small. SGD takes tiny, tentative steps. Progress is glacial.

The result is a path that looks like a snake moving sideways more than forward. You can shrink the learning rate to tame the oscillations on the steep axis, but that makes the shallow axis even slower. There's no single learning rate that handles both directions well.

This is the fundamental limitation SGD leaves on the table, and it's what momentum is designed to fix.

Momentum: giving the optimizer velocity

The core idea behind momentum is borrowed directly from physics. Instead of updating parameters based on the current gradient alone, we maintain a velocity vector v that accumulates a running average of past gradients.

vₜ = β · vₜ₋₁ + η · ∇L(θₜ)
θₜ₊₁ = θₜ − vₜ

Where β (beta) is the momentum coefficient — typically 0.9. Some formulations absorb the learning rate differently; the semantics are equivalent.

Let's unpack what this actually does.

On the oscillating (steep) axis:
Gradients alternate sign — positive, negative, positive, negative. The velocity accumulates these with the decay factor β. Because they cancel each other out over time, the velocity along this axis stays small. Oscillations are damped.

On the consistent (shallow) axis:
Gradients consistently point in the same direction — always slightly downhill. The velocity accumulates these constructively. Each step adds to the previous. The effective step size grows, and the optimizer accelerates.

This is the momentum effect: dampening in oscillating directions, acceleration in consistent ones. The optimizer builds up speed where the surface is consistently sloped and brakes naturally where the surface is ambiguous.

Effective learning rate under momentum

With β = 0.9, a gradient that persists in the same direction for many steps produces a velocity roughly 1/(1−β) = 10× the nominal learning rate. This is why momentum often requires a slightly lower learning rate than vanilla SGD — the effective step size is larger.

More precisely, if the gradient is constant at g, the velocity converges to:

v* = η · g / (1 − β)

So momentum scales up the effective learning rate by 1/(1−β). Set β = 0.9 → 10× amplification. Set β = 0.99 → 100×. This amplification is the source of both momentum's power and its instability if misconfigured.

The ball analogy

Momentum is often described as a ball rolling down a hill. The ball doesn't instantly respond to every local slope — it carries inertia. A small bump doesn't stop it; it takes a sustained uphill slope to decelerate it meaningfully.

This analogy is accurate and useful, but it has a limit: the ball analogy suggests the optimizer might overshoot and roll up the other side of a valley. That's a real failure mode of high-momentum settings. If β is too large, the optimizer can oscillate around minima rather than settling into them, or sail through a narrow good basin entirely.

Nesterov Accelerated Gradient: look before you leap

Momentum is good. Nesterov Accelerated Gradient (NAG), proposed by Yurii Nesterov in 1983, makes one surgical improvement that turns out to matter significantly in practice.

The problem with standard momentum: the gradient is evaluated at the current position, before applying the velocity. By the time you apply the update, you're no longer at that position — you've already moved. You're using stale directional information.

NAG fixes this with a simple conceptual shift: evaluate the gradient at the position you're about to arrive at, then correct from there.

θ_lookahead = θₜ − β · vₜ₋₁          # project forward
vₜ = β · vₜ₋₁ + η · ∇L(θ_lookahead)  # gradient at projected position
θₜ₊₁ = θₜ − vₜ

The momentum step projects you forward to where you'll be before the gradient correction. Then you evaluate the gradient there. This means the correction accounts for the momentum-driven position, not the pre-momentum position.

Why this helps

Think of it this way. Standard momentum is like running toward a wall and only noticing the wall after you've taken your full step. Nesterov is like looking ahead as you run and starting to slow down before you hit the wall.

In regions where the momentum is carrying you toward a steep uphill, NAG detects that uphill slope earlier and applies a corrective force sooner. The update is more anticipatory than reactive.

In practice, NAG converges faster than standard momentum on convex problems — Nesterov's original theoretical analysis showed an O(1/k²) convergence rate versus SGD's O(1/k), a meaningful gap. For non-convex deep learning loss surfaces, the improvement is empirical rather than provably guaranteed, but it's consistently observed.

NAG in the equivalent update form

The two-equation NAG formulation above has an equivalent single-equation form that's more commonly implemented:

vₜ = β · vₜ₋₁ + ∇L(θₜ − β · vₜ₋₁)
θₜ₊₁ = θₜ − η · vₜ

Both are equivalent; the second form makes it clearer that the only change from standard momentum is where the gradient is evaluated.

Side-by-side comparison

	Gradient Descent	SGD	Momentum	NAG
Gradient source	Full dataset	Single sample	Mini-batch	Mini-batch (at projected pos.)
Memory	None	None	Velocity vₜ	Velocity vₜ
Oscillation handling	None	None	Dampens via averaging	Dampens + anticipates
Convergence rate (convex)	O(1/k)	O(1/√k)	O(1/k)	O(1/k²)
Typical β	—	—	0.9	0.9–0.99

The convergence rate column is worth reading carefully. SGD's O(1/√k) is actually worse than GD's O(1/k) — the variance of stochastic gradients costs you a square root. Momentum restores GD-level rates. Nesterov goes further.

Implementation and practical notes

# Standard Momentum
v = 0
for batch in dataloader:
    grad = compute_gradient(loss_fn, model, batch)
    v = beta * v + learning_rate * grad
    theta = theta - v

# Nesterov Accelerated Gradient
v = 0
for batch in dataloader:
    # Evaluate gradient at lookahead position
    theta_lookahead = theta - beta * v
    grad = compute_gradient(loss_fn, model, theta_lookahead)
    v = beta * v + learning_rate * grad
    theta = theta - v

In PyTorch, both are available via torch.optim.SGD with momentum and nesterov flags:

# Momentum
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# NAG
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

Practical hyperparameter guidance:

β = 0.9 is the standard default. It works well across a wide range of architectures.
β = 0.99 gives stronger smoothing but risks slower response to genuine direction changes and can overshoot narrow minima.
When switching from SGD to momentum, reduce the learning rate by roughly 1/(1−β) — so if η = 0.1 for SGD, try η = 0.01 with β = 0.9.
NAG is almost always preferable to plain momentum for the same computational cost. Default to it.

Where momentum still falls short

Momentum is a major step forward. But it inherits one fundamental limitation from SGD: a single global learning rate for all parameters.

Your model's parameters live in very different regimes. The embedding layer of a language model sees only a handful of tokens in each batch — its effective gradient is sparse and noisy. The final linear layer sees a dense, consistent gradient every step. Both are updated with the same η.

This is deeply suboptimal. Sparse parameters should move faster when they do receive signal. Dense parameters can afford more conservative updates to avoid oscillation.

Momentum has no mechanism to learn this. It smooths over time but not over parameters.

That's exactly the problem Blog 3 solves. AdaGrad will introduce per-parameter learning rates, scaling each update by the history of that parameter's gradient magnitude. RMSProp will fix AdaGrad's long-term decay problem. And the combination of per-parameter scaling with momentum will eventually give us Adam.

Three things to hold onto

Concept	What it means	Why it matters
Velocity accumulation	Past gradients persist via β decay	Accelerates in consistent directions, dampens oscillations
Lookahead gradient (NAG)	Gradient evaluated at projected position	Earlier correction, better convergence rate
Effective LR scaling	Velocity → η/(1−β) effective step	Must tune LR down when adding momentum

What's next

In Blog 3, we shift from when the optimizer has seen a gradient to which parameters have seen large gradients. AdaGrad introduces a per-parameter accumulator — parameters that receive frequent, large gradients get smaller effective learning rates; sparse parameters get larger ones. RMSProp then fixes AdaGrad's fatal flaw: the accumulator grows without bound, eventually shrinking all learning rates to zero.

If momentum gave the optimizer a memory across time, adaptive methods give it a memory across parameters. Both are necessary. Neither is sufficient alone.

This is Blog 2 of an 8-part series on optimization algorithms for deep learning.

DEV Community