Backprop tells you which way is downhill. The OPTIMIZER decides how to actually step — and that choice hugely changes training speed and stability. Here's SGD vs Momentum vs Adam, racing down the same valley.
⚙️ Watch them race: https://dev48v.infy.uk/dl/day7-optimizers.html
Real loss surfaces are "ill-conditioned" — steep one way, nearly flat another (a long narrow valley). Each optimizer copes differently:
SGD — plain step
w = w - lr * grad;
One learning rate for every direction → it zigzags wall-to-wall in the valley and crawls along the floor.
Momentum — a heavy ball
v = 0.9*v - lr*grad; w = w + v;
Velocity accumulates, damping the zigzag and building speed along the consistent downhill direction. It rolls through where SGD stutters.
Adam — adaptive, per-parameter
m = b1*m + (1-b1)*g; // momentum
v = b2*v + (1-b2)*g*g; // per-direction scale
w = w - lr * mHat / (√vHat + ε);
Big steps where the surface is flat, small where it's steep — a custom learning rate per weight. It arrows almost straight to the minimum.
Which one?
Adam is the safe default for most deep learning — fast and forgiving about the learning rate. SGD + momentum often generalizes a touch better and is standard in vision. The optimizer is a hyperparameter like any other.
Race them and watch SGD zigzag while Adam goes straight.
Top comments (0)