DEV Community

shangkyu shin
shangkyu shin

Posted on • Originally published at zeromathai.com

Adaptive Optimization and Learning Rate Scheduling — Why Adam Works (and Why It’s Not Enough)

Most deep learning tutorials tell you to “just use Adam.”
That works — until it doesn’t.
This post breaks down gradient noise, adaptive optimization, and why learning rate scheduling still matters for stable training.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/adaptive-optimization-en/


The Real Problem: Gradient Noise

In theory:

θ ← θ − ε ĝ

In practice:

  • gradients are computed from mini-batches
  • updates are noisy
  • optimization becomes unstable

Deep learning training is fundamentally stochastic.


Momentum Solves Direction

Momentum smooths gradients:

vₜ = αvₜ₋₁ + (1 − α)ĝₜ

It acts like inertia:

  • reduces oscillation
  • stabilizes direction
  • speeds up convergence

Without momentum:

  • zig-zag updates
  • slow progress

Adaptive Learning Rates Solve Scale

Different parameters need different step sizes.

AdaGrad

  • shrinks learning rate over time
  • works for sparse features
  • but decays too aggressively

RMSProp

  • uses moving averages
  • keeps updates responsive
  • fixes AdaGrad’s decay problem

Adam Combines Both

Adam = Momentum + RMSProp

That’s why it’s the default:

  • stable
  • fast
  • easy to use

But Adam Isn’t the Full Story

In many real-world cases:

  • Adam converges faster
  • SGD generalizes better

A common strategy:

→ start with Adam

→ switch to SGD later


Learning Rate Scheduling Solves Time

Even with Adam, learning rate still matters.

Because training changes over time:

  • early → explore
  • late → refine

What Actually Works in Practice

  • cosine decay
  • warm-up (especially for large models)
  • step decay for simple setups

Big Picture

Optimization =

  • Momentum → direction
  • Adaptive LR → scale
  • Scheduling → time

Adaptive methods fix parameter-level issues.

Schedulers fix time-level issues.

You need both.


Practical Defaults

If you start a new project:

  • Adam + cosine decay
  • warm-up for large models

If performance matters:

  • try switching to SGD at the end

One Insight That Changes Everything

In large-scale deep learning:

learning rate schedule often matters more than optimizer choice


Question

Do you stick with Adam the whole time,

or switch to SGD for better generalization?

Top comments (0)