Optimizers Explained: SGD vs Momentum vs Adam, Visualized

#machinelearning #deeplearning #ai #beginners

Backprop tells you which way is downhill. The OPTIMIZER decides how to actually step — and that choice hugely changes training speed and stability. Here's SGD vs Momentum vs Adam, racing down the same valley.

⚙️ Watch them race: https://dev48v.infy.uk/dl/day7-optimizers.html

Real loss surfaces are "ill-conditioned" — steep one way, nearly flat another (a long narrow valley). Each optimizer copes differently:

SGD — plain step

w = w - lr * grad;

One learning rate for every direction → it zigzags wall-to-wall in the valley and crawls along the floor.

Momentum — a heavy ball

v = 0.9*v - lr*grad;  w = w + v;

Velocity accumulates, damping the zigzag and building speed along the consistent downhill direction. It rolls through where SGD stutters.

Adam — adaptive, per-parameter

m = b1*m + (1-b1)*g;          // momentum
v = b2*v + (1-b2)*g*g;        // per-direction scale
w = w - lr * mHat / (√vHat + ε);

Big steps where the surface is flat, small where it's steep — a custom learning rate per weight. It arrows almost straight to the minimum.

Which one?

Adam is the safe default for most deep learning — fast and forgiving about the learning rate. SGD + momentum often generalizes a touch better and is standard in vision. The optimizer is a hyperparameter like any other.

Race them and watch SGD zigzag while Adam goes straight.