The One-Line Summary: Batch uses all data (slow but accurate). Stochastic uses one sample (fast but noisy). Mini-batch uses a small group (best of both worlds). That's why everyone uses mini-batch.
Three Hikers, One Mountain
Three friends decide to hike down a foggy mountain. They can't see the valley below. They can only ask other hikers which way seems downhill.
But they each have a very different strategy.
Hiker 1: The Perfectionist
Before taking a single step, the Perfectionist finds every single hiker on the mountain. All 10,000 of them.
She asks each one: "Which way is downhill?"
She collects 10,000 opinions. She averages them. She calculates the perfect direction.
Then — and only then — she takes one step.
Then she does it all over again. Find 10,000 hikers. Ask all of them. Average. Step.
She's incredibly accurate. But by the time she reaches the bottom, her friends have been at the pub for three hours.
Hiker 2: The Impulsive
The Impulsive hiker has no patience. He grabs one random person and asks: "Which way is down?"
The stranger points vaguely left. The Impulsive takes a step.
He grabs another stranger. "Which way?" They point right. He steps right.
Another stranger. Another direction. Step. Stranger. Step. Stranger. Step.
His path is chaotic. He zigzags wildly. Sometimes he even goes uphill by accident.
But he's fast. Insanely fast. He takes thousands of steps while the Perfectionist is still conducting her survey.
And despite the chaos, he eventually stumbles into the valley.
Hiker 3: The Pragmatist
The Pragmatist thinks both her friends are crazy.
She finds a small group of 32 hikers. She asks them all. She averages their opinions.
She takes a step.
She finds another group of 32. Asks. Averages. Steps.
Not as accurate as asking everyone. Not as fast as asking one person. But way more balanced.
Her path is smooth-ish. Her speed is reasonable. She reaches the valley efficiently.
These three hikers are:
- Batch Gradient Descent (The Perfectionist)
- Stochastic Gradient Descent (The Impulsive)
- Mini-Batch Gradient Descent (The Pragmatist)
Let's dive deeper into each one.
Batch Gradient Descent: The Perfectionist
How It Works
Batch gradient descent computes the gradient using the entire dataset before making a single update.
for epoch in range(num_epochs):
# Use ALL training data
gradient = compute_gradient(entire_dataset) # 10,000 samples
# One update per epoch
parameters = parameters - learning_rate * gradient
Every epoch, you:
- Run the entire dataset through the model
- Calculate one gradient (averaged over all samples)
- Make one parameter update
- Repeat
The Path It Takes
Loss
│
│\
│ \
│ \
│ \
│ \
│ \
│ \ ★
│ \___/
│
└────────────────── Parameter
Path: Smooth, direct, no zigzags
Steps: Few (one per epoch)
Each step: Carefully calculated
Beautiful, right? The smoothest possible path to the minimum.
The Pros
| Advantage | Why |
|---|---|
| Stable convergence | Gradient is accurate (averaged over all data) |
| No noise | Every step is in the "true" best direction |
| Guaranteed to converge | For convex problems, will find minimum |
| Predictable | Easy to debug, loss always decreases |
The Cons
| Disadvantage | Why |
|---|---|
| Painfully slow | Must process ALL data for ONE step |
| Memory hungry | Must load entire dataset into memory |
| Can't escape local minima | Too smooth, no randomness to jump out |
| No online learning | Can't update as new data arrives |
When Batch GD Dies
Imagine your dataset has 1 billion samples. Each sample is an image.
To take ONE step:
- Load 1 billion images into memory (impossible)
- Compute gradient for each one
- Average them
- Update parameters once
Then repeat this for thousands of epochs.
You'll die of old age before your model trains.
Stochastic Gradient Descent (SGD): The Impulsive
How It Works
SGD computes the gradient using one random sample at a time.
for epoch in range(num_epochs):
shuffle(dataset) # Randomize order
for sample in dataset: # One sample at a time
# Use ONE training example
gradient = compute_gradient(sample)
# Update immediately
parameters = parameters - learning_rate * gradient
Every sample, you:
- Compute gradient for that ONE sample
- Update parameters immediately
- Move to next sample
If you have 10,000 samples, you make 10,000 updates per epoch instead of just one.
The Path It Takes
Loss
│
│\ ↗
│ ↘ ↗
│ ↗ ↙
│ ↘ ↗
│ ↙↗
│ ↘↙
│ ★
│
└────────────────── Parameter
Path: Chaotic, zigzagging, noisy
Steps: Many (one per sample)
Each step: Quick and dirty
It's drunk. It's chaotic. But it's fast.
The Pros
| Advantage | Why |
|---|---|
| Blazing fast | Updates after every single sample |
| Memory efficient | Only one sample in memory at a time |
| Can escape local minima | Noise helps jump out of bad valleys |
| Online learning | Can learn from streaming data |
| Regularization effect | Noise prevents overfitting |
The Cons
| Disadvantage | Why |
|---|---|
| Noisy convergence | Gradient from one sample is unreliable |
| Zigzags wildly | Steps often go the wrong direction |
| Never truly settles | Keeps bouncing around the minimum |
| Hard to parallelize | Sequential by nature |
The Beautiful Chaos
Here's the weird thing: the noise is actually useful.
That random bouncing around? It can knock you out of bad local minima that batch GD would get stuck in.
Batch GD: SGD:
│\ /\ │\ /\
│ \/ \ │ \/ \
│ ▼ \ │ ↘ ← Noise kicks it out!
│stuck \ │ ↘
│ \_★ │ ★
The Perfectionist would settle into the first dip. The Impulsive bounces right through it.
Mini-Batch Gradient Descent: The Pragmatist
How It Works
Mini-batch GD computes the gradient using a small batch of samples (typically 32, 64, 128, or 256).
batch_size = 32
for epoch in range(num_epochs):
shuffle(dataset)
for batch in create_batches(dataset, batch_size):
# Use a small batch (32 samples)
gradient = compute_gradient(batch)
# Update after each batch
parameters = parameters - learning_rate * gradient
If you have 10,000 samples and batch size 32:
- 10,000 / 32 = 312 batches
- 312 updates per epoch (not 1, not 10,000)
The Path It Takes
Loss
│
│\
│ \
│ ↘
│ ↘
│ ↘
│ ↘ ★
│ \_/
│
└────────────────── Parameter
Path: Slightly noisy but mostly smooth
Steps: Moderate (one per batch)
Each step: Reasonably accurate
Not as smooth as batch. Not as chaotic as SGD. Just right.
The Pros
| Advantage | Why |
|---|---|
| Fast | More updates than batch GD |
| Stable | More accurate than single-sample SGD |
| Memory efficient | Only load batch_size samples at a time |
| GPU friendly | Batches can be processed in parallel! |
| Some noise | Can still escape shallow local minima |
| Best of both worlds | Balances speed and accuracy |
The Cons
| Disadvantage | Why |
|---|---|
| One hyperparameter | Need to choose batch size |
| Not perfect | Still some noise (but usually good) |
| Slightly more complex | Need to implement batching |
Why GPUs Love Mini-Batches
Here's a secret: GPUs are designed for batch processing.
A GPU can multiply a 32x1000 matrix just as fast as a 1x1000 matrix. The parallelism is essentially free.
CPU: Process sample 1... done. Sample 2... done. Sample 3...
GPU: Process samples 1-32... done. All at once.
This is why mini-batch dominates deep learning. It's not just a compromise — it's actually faster than pure SGD on modern hardware.
Side-by-Side Comparison
Let's see all three approaches with the same dataset.
import numpy as np
# Dataset: 1000 samples
np.random.seed(42)
X = np.random.randn(1000, 1)
y = 3 * X + 2 + np.random.randn(1000, 1) * 0.5
# Initialize
w, b = 0.0, 0.0
learning_rate = 0.01
def compute_gradient(X_batch, y_batch, w, b):
predictions = X_batch * w + b
error = predictions - y_batch
dw = 2 * np.mean(error * X_batch)
db = 2 * np.mean(error)
return dw, db
def compute_loss(X, y, w, b):
return np.mean((X * w + b - y) ** 2)
Batch Gradient Descent
w, b = 0.0, 0.0
print("=== BATCH GRADIENT DESCENT ===")
print(f"Updates per epoch: 1")
print()
for epoch in range(100):
# Use ALL data
dw, db = compute_gradient(X, y, w, b)
w = w - learning_rate * dw
b = b - learning_rate * db
if epoch % 20 == 0:
loss = compute_loss(X, y, w, b)
print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")
Output:
=== BATCH GRADIENT DESCENT ===
Updates per epoch: 1
Epoch 0: w=0.1192, b=0.0814, loss=9.4271
Epoch 20: w=2.6SEP1, b=1.8732, loss=0.3124
Epoch 40: w=2.9642, b=1.9812, loss=0.2518
Epoch 60: w=2.9961, b=1.9974, loss=0.2487
Epoch 80: w=2.9994, b=1.9997, loss=0.2485
Smooth. Stable. But only 100 updates total.
Stochastic Gradient Descent
w, b = 0.0, 0.0
print("=== STOCHASTIC GRADIENT DESCENT ===")
print(f"Updates per epoch: {len(X)}")
print()
for epoch in range(100):
# Shuffle data
indices = np.random.permutation(len(X))
for i in indices:
# Use ONE sample
xi = X[i:i+1]
yi = y[i:i+1]
dw, db = compute_gradient(xi, yi, w, b)
w = w - learning_rate * dw
b = b - learning_rate * db
if epoch % 20 == 0:
loss = compute_loss(X, y, w, b)
print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")
Output:
=== STOCHASTIC GRADIENT DESCENT ===
Updates per epoch: 1000
Epoch 0: w=2.9847, b=1.9923, loss=0.2491
Epoch 20: w=3.0142, b=2.0087, loss=0.2486
Epoch 40: w=2.9891, b=1.9812, loss=0.2489
Epoch 60: w=3.0023, b=2.0156, loss=0.2487
Epoch 80: w=2.9967, b=1.9943, loss=0.2485
Notice: Converged in the FIRST epoch! 1000 updates vs batch's 1.
But also notice: The values keep bouncing around. It never truly settles.
Mini-Batch Gradient Descent
w, b = 0.0, 0.0
batch_size = 32
print("=== MINI-BATCH GRADIENT DESCENT ===")
print(f"Updates per epoch: {len(X) // batch_size}")
print()
for epoch in range(100):
# Shuffle data
indices = np.random.permutation(len(X))
for start in range(0, len(X), batch_size):
# Use a BATCH of samples
batch_idx = indices[start:start+batch_size]
X_batch = X[batch_idx]
y_batch = y[batch_idx]
dw, db = compute_gradient(X_batch, y_batch, w, b)
w = w - learning_rate * dw
b = b - learning_rate * db
if epoch % 20 == 0:
loss = compute_loss(X, y, w, b)
print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")
Output:
=== MINI-BATCH GRADIENT DESCENT ===
Updates per epoch: 31
Epoch 0: w=2.8934, b=1.9234, loss=0.2612
Epoch 20: w=2.9987, b=1.9991, loss=0.2485
Epoch 40: w=3.0001, b=2.0003, loss=0.2485
Epoch 60: w=2.9998, b=1.9999, loss=0.2485
Epoch 80: w=3.0000, b=2.0001, loss=0.2485
Fast convergence (first few epochs) AND stable final values. The sweet spot.
The Noise Comparison
Let's visualize the different noise levels:
Gradient Accuracy Over Time
Batch GD (All 10,000 samples):
True direction: →
Estimated: → → → → → → → →
(Always correct)
SGD (1 sample):
True direction: →
Estimated: ↗ ↙ → ↖ ↘ ← ↗ ↓
(Wildly inconsistent)
Mini-Batch (32 samples):
True direction: →
Estimated: → ↗ → ↘ → → ↗ →
(Mostly correct, slight noise)
The gradient from 32 samples is much more reliable than from 1, but much faster to compute than 10,000.
Choosing the Batch Size
Batch size is a hyperparameter. Here's how to think about it:
Small Batch (8-32)
Pros: Cons:
+ More noise (regularization) - Less stable
+ Can escape local minima - Slower on GPU (underutilized)
+ Less memory needed - More updates = more overhead
Medium Batch (64-256)
Pros: Cons:
+ Good balance - Jack of all trades
+ GPU efficient - Master of none
+ Stable enough
Large Batch (512-4096)
Pros: Cons:
+ Very stable gradients - May converge to sharp minima
+ Maximizes GPU utilization - Needs lots of memory
+ Fewer updates needed - Less regularization
- May need to adjust learning rate
The Rule of Thumb
| Situation | Recommended Batch Size |
|---|---|
| Just starting out | 32 or 64 |
| Limited GPU memory | Largest that fits |
| Want more regularization | Smaller (16-32) |
| Very large dataset | Larger (256-512) |
| Default/don't know | 32 |
The Learning Rate Connection
Here's something important: batch size and learning rate are linked.
Larger batch → More accurate gradient → Can use larger learning rate
Smaller batch → Noisier gradient → Need smaller learning rate
# General heuristic
# If you double batch size, you can increase learning rate by ~sqrt(2)
# Batch size 32, LR = 0.001
# Batch size 64, LR = 0.0014
# Batch size 128, LR = 0.002
# Batch size 256, LR = 0.0028
This is called the linear scaling rule (approximately):
new_lr = base_lr × (new_batch_size / base_batch_size)
The Memory Math
Let's do some real numbers.
Scenario: Training a CNN on images
- Image size: 224 × 224 × 3 = 150,528 floats
- Float size: 4 bytes
- Per image: ~600 KB
| Batch Size | Memory for Inputs | Plus activations, gradients... |
|---|---|---|
| 1 | 600 KB | ~50 MB |
| 32 | 19 MB | ~1.6 GB |
| 128 | 77 MB | ~6.4 GB |
| 512 | 307 MB | ~25 GB |
This is why batch size is often limited by GPU memory, not by choice.
# Common scenario
batch_size = 32 # What you want
# CUDA out of memory error!
batch_size = 16 # What you can actually fit
Advanced: Gradient Accumulation
What if you want the stability of large batches but only have memory for small ones?
Gradient accumulation: Run multiple small batches, accumulate gradients, then update.
accumulation_steps = 4 # Effective batch size = 32 * 4 = 128
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps # Scale loss
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update after 4 batches
optimizer.zero_grad()
Now you get the gradient quality of batch_size=128, using only memory for batch_size=32.
Real-World Usage
What do practitioners actually use?
# PyTorch
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)
for epoch in range(epochs):
for batch in train_loader: # Mini-batch!
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()
# TensorFlow/Keras
model.fit(X, y, batch_size=32, epochs=100) # Mini-batch!
# Fast.ai
learn.fit_one_cycle(10, lr=3e-3, bs=64) # Mini-batch!
Everyone uses mini-batch. When people say "SGD" in deep learning, they almost always mean "mini-batch SGD."
The Convergence Visualization
Here's what training looks like with each method:
Training Loss Over Time
Batch GD:
Loss │ \
│ \
│ \
│ \────────────
└──────────────────── Time
(Smooth but slow progress)
SGD:
Loss │\ /\ /\
│ \/ v \/\
│ \/\/\ /\
│ \/ \──
└──────────────────── Time
(Fast but noisy, never settles)
Mini-Batch:
Loss │\
│ \\
│ \~\
│ ~\~~~~~~~~~~
└──────────────────── Time
(Fast AND eventually settles)
Quick Decision Guide
What's your situation?
"I have a tiny dataset (< 1000 samples)"
→ Batch GD is fine, or small mini-batch
"I have a normal dataset and GPU"
→ Mini-batch (32-128)
"I have huge data and limited memory"
→ Small mini-batch + gradient accumulation
"I need to train on streaming data"
→ SGD (true single-sample)
"I want maximum regularization"
→ Smaller mini-batch + data augmentation
"I don't know / just want it to work"
→ Mini-batch, batch_size=32
The Summary Table
| Aspect | Batch GD | Stochastic GD | Mini-Batch GD |
|---|---|---|---|
| Samples per update | All (N) | One (1) | Some (32-512) |
| Updates per epoch | 1 | N | N / batch_size |
| Gradient accuracy | Perfect | Very noisy | Pretty good |
| Memory usage | High | Minimal | Moderate |
| Convergence | Smooth | Chaotic | Smooth-ish |
| Speed | Slow | Fast* | Fast |
| GPU utilization | Good | Poor | Excellent |
| Can escape local minima | No | Yes | Sometimes |
| Used in practice | Rarely | Rarely | Always |
*SGD is fast in updates but slow on GPU due to poor parallelization.
Key Takeaways
Batch GD = All data, one update. Accurate but slow and memory-hungry.
Stochastic GD = One sample, one update. Fast but noisy and chaotic.
Mini-Batch GD = Small batch, one update. Best of both worlds.
Everyone uses mini-batch. It's the default. It's what "SGD" usually means.
Batch size matters. Affects speed, memory, stability, and generalization.
GPU loves batches. Parallel processing makes mini-batch faster than SGD.
Learning rate scales with batch size. Bigger batch → can use bigger learning rate.
The Hiker Analogy Summary
| Hiker | Strategy | Result |
|---|---|---|
| Perfectionist | Ask everyone (10,000 people) | Perfect direction but takes forever |
| Impulsive | Ask one random person | Fast but zigzags everywhere |
| Pragmatist | Ask a small group (32 people) | Fast AND mostly accurate |
The Pragmatist wins. That's why we use mini-batch.
What's Next?
Now that you understand the three gradient descent variants, you're ready for:
- Learning Rate Schedules — Changing step size over time
- Optimizers — Adam, RMSprop, and beyond
- Batch Normalization — Stabilizing training with batches
- Distributed Training — When one GPU isn't enough
Follow me for the next article in this series!
Let's Connect!
If this cleared up the batch/mini-batch/SGD confusion, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
Which batch size do you typically use? I'm curious!
The next time someone says "we're using SGD," you'll know to ask: "Batch, mini-batch, or true stochastic?" And you'll know why the answer is almost always mini-batch.
Share this with someone who's confused about why their training code uses DataLoader with a batch_size parameter. Now they'll understand!
Happy learning!
Top comments (0)