When one learning rate isn't enough — per-parameter scaling and the decay problem
Momentum gave the optimizer a memory across time.
Adaptive methods give it a memory across parameters.
Both are necessary. Neither is sufficient alone.
The problem with a single learning rate
Every optimizer we've covered so far shares one architectural assumption: a single scalar η governs every parameter in the model.
This seems reasonable until you think about what parameters actually experience during training.
Consider a language model's embedding table. It contains one vector per token in the vocabulary — perhaps 50,000 vectors, each of dimension 512. In any given mini-batch of 64 sequences, you might see 2,000 unique tokens. The remaining 48,000 tokens receive zero gradient for that entire step. When they do appear, their gradient signals are sparse, noisy, and infrequent.
Now consider the final projection layer — a dense 512×50,000 matrix. Every forward pass touches every row. Gradients are dense, consistent, and arrive every single step.
Both layers are updated with the same η.
This is the problem. Parameters that receive rare, informative signal should move aggressively when that signal arrives — their updates are precious. Parameters that receive dense, consistent signal should move conservatively — there's no rush, and overshooting is costly.
A global learning rate can't serve both regimes. Set it high and the dense layers oscillate. Set it low and the sparse layers barely move.
AdaGrad's answer: let each parameter maintain its own effective learning rate, derived automatically from its gradient history.
AdaGrad: accumulate, then scale
AdaGrad (Adaptive Gradient Algorithm — Duchi, Hazan & Singer, 2011) introduces a per-parameter accumulator Gₜ that tracks the sum of squared gradients seen so far.
Gₜ = Gₜ₋₁ + (∇L(θₜ))² # element-wise square, accumulated
θₜ₊₁ = θₜ − (η / √(Gₜ + ε)) · ∇L(θₜ)
Where ε (epsilon) is a small constant (typically 1e-8) added for numerical stability — it prevents division by zero when a parameter has received no gradient.
The update is element-wise: each parameter θᵢ has its own Gᵢ, and divides its gradient by √Gᵢ. No parameter borrows from another's accumulator.
What this achieves
Sparse parameters — infrequent updates, small accumulated G. Dividing by √G yields a large effective learning rate. When signal finally arrives, AdaGrad takes a proportionally large step.
Dense parameters — frequent updates, large accumulated G. Dividing by √G yields a small effective learning rate. Updates are conservative; the optimizer doesn't overfit to any single gradient.
AdaGrad is essentially performing an automatic, online normalization of learning rates. You no longer need to hand-tune separate learning rates for different parameter groups.
Where AdaGrad shines
This mechanism is particularly powerful for:
- Sparse features in NLP (word embeddings, bag-of-words models)
- Recommendation systems with millions of item/user embeddings
- Convex optimization problems where the accumulated curvature information is always relevant
In fact, for convex problems, AdaGrad has provably optimal regret bounds in the online learning setting. This theoretical grounding is part of why it was so influential.
The learning rate as a function of time
It helps to think of AdaGrad as replacing the fixed learning rate η with an effective per-parameter rate:
ηᵢ_eff(t) = η / √(Σₛ₌₁ᵗ gᵢ,ₛ²)
This is a monotonically decreasing function of time. Every gradient update, regardless of size, increases Gᵢ, which decreases ηᵢ_eff. The learning rate can only go down. It never recovers.
That property is exactly AdaGrad's fatal flaw.
AdaGrad's fatal flaw: the dying learning rate
In practice, AdaGrad's accumulator grows without bound. After enough training steps, Gₜ becomes so large that η/√Gₜ shrinks toward zero for every parameter — including the ones that still need to learn.
This is not a tuning problem. It is structural. The accumulator is a sum, not an average, and sums only increase.
The consequences are severe:
- Training effectively stops after a certain number of steps, even if the model hasn't converged.
- On non-stationary loss surfaces (which all deep learning surfaces are — the loss landscape shifts as other parameters update), old gradient information from early training becomes misleading. Parameters that moved quickly early on get permanently penalized for it, even if the relevant gradients now point in a completely different direction.
- The "optimal for convex problems" guarantee doesn't transfer to deep learning, where the curvature landscape changes throughout training.
AdaGrad is excellent for shallow, convex problems with sparse features. For deep networks trained over many epochs, it usually fails in practice.
This is exactly the problem RMSProp was designed to fix.
RMSProp: forget the distant past
RMSProp (Root Mean Square Propagation — Hinton, 2012, unpublished but widely cited from his Coursera lectures) makes one targeted change to AdaGrad: replace the cumulative sum with an exponentially weighted moving average.
Eₜ = ρ · Eₜ₋₁ + (1 − ρ) · (∇L(θₜ))² # running average of squared gradients
θₜ₊₁ = θₜ − (η / √(Eₜ + ε)) · ∇L(θₜ)
Where ρ (rho) is the decay coefficient — typically 0.9 or 0.99. It controls how much weight is given to recent gradients vs. historical ones.
This is the exponential moving average (EMA) pattern — the same mechanism used in momentum. The difference here is that it's applied to squared gradients rather than the gradients themselves.
Why this works
With exponential decay, the effective window of "remembered" gradient history is roughly 1/(1−ρ) steps. At ρ = 0.9, that's ~10 recent steps. At ρ = 0.99, ~100 steps.
Old gradients from many hundreds of steps ago contribute essentially nothing to Eₜ. The accumulator doesn't grow forever — it's a sliding window. When the loss landscape shifts (as it always does in deep learning), the old curvature information fades out, and new curvature information takes over.
This is the key property AdaGrad lacked: adaptability over time, not just over parameters.
The geometry being captured
The denominator √Eₜ is an estimate of the root mean square of recent gradients for each parameter — hence the name. It approximates the local gradient scale without integrating all history.
Parameters that have recently received large gradients get scaled down. Parameters that have recently been quiet get scaled up. The key word is recently — unlike AdaGrad, this estimate is always locally relevant.
AdaGrad vs. RMSProp: direct comparison
AdaGrad: Gₜ = Gₜ₋₁ + g² → cumulative sum, unbounded growth
RMSProp: Eₜ = ρ·Eₜ₋₁ + (1−ρ)·g² → exponential decay, bounded estimate
The structural difference is one line. The practical difference is enormous.
| Property | AdaGrad | RMSProp |
|---|---|---|
| Accumulator | Cumulative sum | Exponential moving average |
| Learning rate over time | Monotonically decreasing | Stationary (fluctuates around a value) |
| Memory of past gradients | All of history, equally weighted | Recent history, exponentially weighted |
| Suitable for | Convex, sparse, shallow models | Non-convex, deep networks |
| Long training runs | Fails (LR collapses) | Works |
| Non-stationary landscapes | Fails (stale curvature) | Works |
| Key hyperparameter | η | η, ρ |
RMSProp became the default adaptive optimizer for deep learning before Adam existed, and many practitioners still reach for it when they want something lighter than Adam.
Implementation
# AdaGrad
G = 0
for batch in dataloader:
grad = compute_gradient(loss_fn, model, batch)
G = G + grad ** 2
theta = theta - (lr / (G ** 0.5 + eps)) * grad
# RMSProp
E = 0
for batch in dataloader:
grad = compute_gradient(loss_fn, model, batch)
E = rho * E + (1 - rho) * grad ** 2
theta = theta - (lr / (E ** 0.5 + eps)) * grad
In PyTorch:
# AdaGrad
optimizer = torch.optim.Adagrad(model.parameters(), lr=0.01, eps=1e-8)
# RMSProp
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9, eps=1e-8)
Practical hyperparameter guidance:
- AdaGrad: η = 0.01 is a reasonable start. There's little else to tune — the accumulator handles the rest.
- RMSProp: ρ = 0.9 is standard. Use ρ = 0.99 for more stable but slower-adapting effective LR. η typically needs to be smaller than you'd use with SGD — start at 1e-3 or 1e-4.
- Both methods are sensitive to ε. In low-precision training (float16, bfloat16), the default 1e-8 can cause numerical issues. Try 1e-6 or 1e-5 if you see NaN losses.
What's still missing
RMSProp solves the dying learning rate. It gives us per-parameter adaptivity that stays relevant throughout training. It's a genuinely good optimizer.
But look at the update rule again:
θₜ₊₁ = θₜ − (η / √(Eₜ + ε)) · ∇L(θₜ)
There's no velocity term. No momentum. The update is still reactive — it responds to the current gradient, scaled by recent gradient history, but it doesn't accumulate direction the way momentum does.
We now have two powerful, independent ideas:
- Momentum — smooth out gradient noise by accumulating velocity over time
- Adaptive scaling — normalize updates by per-parameter gradient magnitude
Neither knows about the other. Both help convergence. The obvious question is: what happens if you combine them?
That's Blog 4. Adam takes exactly this step — it maintains both a first-moment estimate (momentum over gradients) and a second-moment estimate (RMSProp-style scaling), applies bias corrections to both, and produces one of the most robust general-purpose optimizers ever designed. Adamax, Nadam, and AMSGrad follow as targeted improvements on specific failure modes Adam introduces.
Three things to hold onto
| Concept | AdaGrad | RMSProp |
|---|---|---|
| Core idea | Accumulate squared gradients | EMA of squared gradients |
| What it fixes | Uniform LR across parameters | Uniform LR across parameters |
| What it breaks | Long-run training (LR → 0) | Nothing fatal — but no momentum |
| Best use case | Convex, sparse, shallow | Deep networks, non-stationary |
What's next
In Blog 4, the two threads of this series converge. Adam combines the first moment (gradient direction, momentum-style) with the second moment (gradient magnitude, RMSProp-style) into a single update, then applies bias corrections to prevent cold-start distortion. Adamax extends the second moment to the L∞ norm. Nadam swaps standard momentum for Nesterov lookahead. AMSGrad addresses Adam's theoretical non-convergence issue.
Each one is a targeted answer to a specific flaw in Adam's design. By the end of Blog 4, you'll have the full picture of why AdamW — not Adam — is the default for modern LLM training.
This is Blog 3 of an 8-part series on optimization algorithms for deep learning.
Top comments (0)