Most deep learning tutorials tell you to “just use Adam.”
That works — until it doesn’t.
This post breaks down gradient noise, adaptive optimization, and why learning rate scheduling still matters for stable training.
Cross-posted from Zeromath. Original article: https://zeromathai.com/en/adaptive-optimization-en/
The Real Problem: Gradient Noise
In theory:
θ ← θ − ε ĝ
In practice:
- gradients are computed from mini-batches
- updates are noisy
- optimization becomes unstable
Deep learning training is fundamentally stochastic.
Momentum Solves Direction
Momentum smooths gradients:
vₜ = αvₜ₋₁ + (1 − α)ĝₜ
It acts like inertia:
- reduces oscillation
- stabilizes direction
- speeds up convergence
Without momentum:
- zig-zag updates
- slow progress
Adaptive Learning Rates Solve Scale
Different parameters need different step sizes.
AdaGrad
- shrinks learning rate over time
- works for sparse features
- but decays too aggressively
RMSProp
- uses moving averages
- keeps updates responsive
- fixes AdaGrad’s decay problem
Adam Combines Both
Adam = Momentum + RMSProp
That’s why it’s the default:
- stable
- fast
- easy to use
But Adam Isn’t the Full Story
In many real-world cases:
- Adam converges faster
- SGD generalizes better
A common strategy:
→ start with Adam
→ switch to SGD later
Learning Rate Scheduling Solves Time
Even with Adam, learning rate still matters.
Because training changes over time:
- early → explore
- late → refine
What Actually Works in Practice
- cosine decay
- warm-up (especially for large models)
- step decay for simple setups
Big Picture
Optimization =
- Momentum → direction
- Adaptive LR → scale
- Scheduling → time
Adaptive methods fix parameter-level issues.
Schedulers fix time-level issues.
You need both.
Practical Defaults
If you start a new project:
- Adam + cosine decay
- warm-up for large models
If performance matters:
- try switching to SGD at the end
One Insight That Changes Everything
In large-scale deep learning:
learning rate schedule often matters more than optimizer choice
Question
Do you stick with Adam the whole time,
or switch to SGD for better generalization?
Top comments (0)