"SGD walks. Momentum runs. Adam runs intelligently."
In my last post, I showed you how backpropagation could learn the weights for XOR automatically. No more hand-crafting. No more trial and error. Just set a learning rate, run the algorithm, and watch the loss curve drop.
It felt like magic. Almost too easy.
But here's what I glossed over: XOR has just 4 training examples. With 4 examples, you compute the gradient using all of them at once. Every weight update sees the complete picture.
But XOR is a toy problem. Let me tell you about a real dataset.
MNIST: The "Hello World" of Deep Learning
MNIST is a collection of 70,000 handwritten digit images—60,000 for training, 10,000 for testing. Each image is 28×28 grayscale pixels.
The task: look at an image and predict which digit (0-9) it represents.
Trivial for humans. Genuinely hard for 1990s computers. It became the standard benchmark for machine learning.
Each image has 784 pixels (28×28). To classify them, we need 784 inputs, a hidden layer (say, 128 neurons), and 10 outputs. That's roughly 100,000 weights to learn.
Here's the problem: computing the gradient using all 60,000 examples—like we did with XOR—requires 60,000 forward and backward passes per weight update. On my laptop, that's 30 seconds per epoch.
Training for 100 epochs? 50 minutes total.
Not terrible. But remember, this is just MNIST.
Now consider:
- GPT-2 trained on 40 billion tokens
- GPT-3 trained on 300 billion tokens
- GPT-4? Trillions of tokens, hundreds of billions of weights
If full-batch gradient descent takes 50 minutes for MNIST, modern LLMs would take thousands of years.
This is the "Scale Wall." Nobody trains this way anymore.
The Leap from Toy to Real
The jump from XOR to MNIST isn't just more data—it's a fundamental shift in thinking. At scale, learning can't be perfect. It has to be incremental, approximate, adaptive. Just like human learning.
This is where optimisers enter.
Mini-Batches: Learning from Subsets
The insight that changed everything: you don't need all 60,000 examples to know which direction to move your weights.
Think about cooking. You don't taste every grain of rice to know if you need more salt. You taste a spoonful—a small sample tells you enough.
That's mini-batch stochastic gradient descent.
Instead of using all 60,000 examples:
The algorithm is simple:
for each epoch:
shuffle training data
divide into mini-batches
for each mini-batch:
forward pass (compute predictions)
compute loss (average over batch)
backward pass (compute gradients)
update weights
One complete pass through all data is an epoch. With 60,000 examples and batch size 64, that's 937 mini-batches per epoch.
The magic? Each mini-batch gradient is noisy—not the exact gradient from all 60,000 examples. But it points roughly in the right direction. And it's fast.
On my laptop:
- Full-batch: 30 seconds per update
- Mini-batch (size 64): 0.03 seconds per update
That's 1000× faster. Training for 100 epochs drops from 50 minutes to 5 minutes.
The Three Key Concepts
Mini-batch: A small subset of training data for one gradient update. Common sizes: 16, 32, 64, 128. Smaller batches are noisier but faster. Larger batches are more accurate but slower.
Epoch: One complete pass through your training dataset. With 60,000 examples and batch size 64: one epoch = 937 updates.
Stochastic: Means "random." We shuffle data before each epoch, so mini-batches differ every time. This randomness helps—it prevents the network from memorising example order.
Optimizers as Human Learners
| SGD: Stubborn Learner | Momentum: Persistent Learner | Adam: Adaptive Learner | |
|---|---|---|---|
| Algorithm | w = w - lr * grad |
v = β*v - lr*gradw = w + v
|
m = β₁*m + (1-β₁)*gradv = β₂*v + (1-β₂)*grad²w = w - lr * m/√v
|
| Analogy | A student who learns at a constant pace, treating every mistake equally | A student who builds confidence from past successes, using momentum to push through | An adaptive learner tracking both direction and intensity of learning |
| Variables |
w → understandinglr → speedgrad → mistake |
v → confidenceβ → memory (0.9)lr → speedgrad → mistake |
m → direction memoryv → intensity memoryβ₁,β₂ → decay (0.9, 0.999)m/√v → balanced update |
| Good at | ✓ Simple problems ✓ Guaranteed convergence ✓ Memory efficient |
✓ Escaping local minima ✓ Faster convergence ✓ Smooths noise |
✓ Most deep learning ✓ Momentum + adaptive ✓ Robust default |
| Bad at | ✗ Gets stuck easily ✗ Slow near flat areas ✗ Same step for all |
✗ Can overshoot ✗ Same lr for all ✗ Needs β tuning |
✗ Poor local minima ✗ More memory ✗ More hyperparameters |
Key Insight: Just like human learners, no single optimiser is perfect. SGD is reliable but stubborn. Momentum adds persistence but can be overconfident. AdaGrad adapts intelligently but burns out. Adam balances everything but sometimes settles for "good enough."
Note: There are many more optimizers out there—AdaGrad (adapts per-parameter but burns out), RMSProp (fixes AdaGrad's decay problem), AdaDelta, NAdam, L-BFGS (second-order method for smaller datasets), and newer variants like AdamW, RAdam, and Lookahead. Each has its own strengths and trade-offs. I'll cover these in future posts when the context is right.
Time to Experiment
Let's see this in action! I've built a network on MNIST with three optimizers— Batch SGD, Momentum, and Adam. Experiment with each and watch how they differ in training time and convergence speed.
Sample screenshot from the playground:
Running the Playground
# Clone the repository
git clone https://github.com/rnilav/perceptrons-to-transformers.git
cd perceptrons-to-transformers/04-optimization
# Install dependencies
pip install -r requirements.txt
# Run the playground
streamlit run optimization_playground.py
What Clicked for Me
Scale changes everything. Full-batch works for 4 examples, collapses at 60,000. Mini-batching is survival, not cleverness.
Adaptive learning makes sense. Not all weights should move equally. Adam adjusts per-parameter instead of treating everything the same.
And the progression is elegant:
Perceptron → multi-layer networks → backpropagation → scalable optimizers.
Each breakthrough made the next one possible.
What's Next
We can now train on real data. Backprop computes gradients, Adam updates weights. 99% accuracy on MNIST in seconds.
Everything works. Until it doesn't.
Train longer: training accuracy climbs to 99.8%, but test accuracy stalls and drops. The model isn't learning—it's memorising.
Next: why overfitting happens and how dropout and weight decay force networks to generalise instead of memorise.
Training a network is easy. Making it work in the real world? That's the challenge.
References
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Tieleman, T., & Hinton, G. (2012). Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.
Tags: #MachineLearning #AI #DeepLearning #Optimization #SGD #Adam #Momentum #MNIST #NeuralNetworks

Top comments (0)