Sachin Kr. Rajput

Posted on Jan 13

Batch vs Mini-Batch vs Stochastic Gradient Descent: Three Hikers, Three Strategies, One Mountain

#machinelearning #ai #beginners #datascience

The One-Line Summary: Batch uses all data (slow but accurate). Stochastic uses one sample (fast but noisy). Mini-batch uses a small group (best of both worlds). That's why everyone uses mini-batch.

Three Hikers, One Mountain

Three friends decide to hike down a foggy mountain. They can't see the valley below. They can only ask other hikers which way seems downhill.

But they each have a very different strategy.

Hiker 1: The Perfectionist

Before taking a single step, the Perfectionist finds every single hiker on the mountain. All 10,000 of them.

She asks each one: "Which way is downhill?"

She collects 10,000 opinions. She averages them. She calculates the perfect direction.

Then — and only then — she takes one step.

Then she does it all over again. Find 10,000 hikers. Ask all of them. Average. Step.

She's incredibly accurate. But by the time she reaches the bottom, her friends have been at the pub for three hours.

Hiker 2: The Impulsive

The Impulsive hiker has no patience. He grabs one random person and asks: "Which way is down?"

The stranger points vaguely left. The Impulsive takes a step.

He grabs another stranger. "Which way?" They point right. He steps right.

Another stranger. Another direction. Step. Stranger. Step. Stranger. Step.

His path is chaotic. He zigzags wildly. Sometimes he even goes uphill by accident.

But he's fast. Insanely fast. He takes thousands of steps while the Perfectionist is still conducting her survey.

And despite the chaos, he eventually stumbles into the valley.

Hiker 3: The Pragmatist

The Pragmatist thinks both her friends are crazy.

She finds a small group of 32 hikers. She asks them all. She averages their opinions.

She takes a step.

She finds another group of 32. Asks. Averages. Steps.

Not as accurate as asking everyone. Not as fast as asking one person. But way more balanced.

Her path is smooth-ish. Her speed is reasonable. She reaches the valley efficiently.

These three hikers are:

Batch Gradient Descent (The Perfectionist)
Stochastic Gradient Descent (The Impulsive)
Mini-Batch Gradient Descent (The Pragmatist)

Let's dive deeper into each one.

Batch Gradient Descent: The Perfectionist

How It Works

Batch gradient descent computes the gradient using the entire dataset before making a single update.

for epoch in range(num_epochs):
    # Use ALL training data
    gradient = compute_gradient(entire_dataset)  # 10,000 samples

    # One update per epoch
    parameters = parameters - learning_rate * gradient

Every epoch, you:

Run the entire dataset through the model
Calculate one gradient (averaged over all samples)
Make one parameter update
Repeat

The Path It Takes

Loss
  │
  │\
  │ \
  │  \
  │   \
  │    \
  │     \
  │      \    ★
  │       \___/
  │
  └────────────────── Parameter

Path: Smooth, direct, no zigzags
Steps: Few (one per epoch)
Each step: Carefully calculated

Beautiful, right? The smoothest possible path to the minimum.

The Pros

Advantage	Why
Stable convergence	Gradient is accurate (averaged over all data)
No noise	Every step is in the "true" best direction
Guaranteed to converge	For convex problems, will find minimum
Predictable	Easy to debug, loss always decreases

The Cons

Disadvantage	Why
Painfully slow	Must process ALL data for ONE step
Memory hungry	Must load entire dataset into memory
Can't escape local minima	Too smooth, no randomness to jump out
No online learning	Can't update as new data arrives

When Batch GD Dies

Imagine your dataset has 1 billion samples. Each sample is an image.

To take ONE step:

Load 1 billion images into memory (impossible)
Compute gradient for each one
Average them
Update parameters once

Then repeat this for thousands of epochs.

You'll die of old age before your model trains.

Stochastic Gradient Descent (SGD): The Impulsive

How It Works

SGD computes the gradient using one random sample at a time.

for epoch in range(num_epochs):
    shuffle(dataset)  # Randomize order

    for sample in dataset:  # One sample at a time
        # Use ONE training example
        gradient = compute_gradient(sample)

        # Update immediately
        parameters = parameters - learning_rate * gradient

Every sample, you:

Compute gradient for that ONE sample
Update parameters immediately
Move to next sample

If you have 10,000 samples, you make 10,000 updates per epoch instead of just one.

The Path It Takes

Loss
  │
  │\        ↗
  │ ↘      ↗
  │  ↗    ↙
  │   ↘  ↗
  │    ↙↗
  │    ↘↙
  │     ★
  │
  └────────────────── Parameter

Path: Chaotic, zigzagging, noisy
Steps: Many (one per sample)
Each step: Quick and dirty

It's drunk. It's chaotic. But it's fast.

The Pros

Advantage	Why
Blazing fast	Updates after every single sample
Memory efficient	Only one sample in memory at a time
Can escape local minima	Noise helps jump out of bad valleys
Online learning	Can learn from streaming data
Regularization effect	Noise prevents overfitting

The Cons

Disadvantage	Why
Noisy convergence	Gradient from one sample is unreliable
Zigzags wildly	Steps often go the wrong direction
Never truly settles	Keeps bouncing around the minimum
Hard to parallelize	Sequential by nature

The Beautiful Chaos

Here's the weird thing: the noise is actually useful.

That random bouncing around? It can knock you out of bad local minima that batch GD would get stuck in.

Batch GD:                     SGD:

    │\  /\                        │\  /\
    │ \/  \                       │ \/  \
    │ ▼    \                      │      ↘ ← Noise kicks it out!
    │stuck  \                     │        ↘
    │        \_★                  │          ★

The Perfectionist would settle into the first dip. The Impulsive bounces right through it.

Mini-Batch Gradient Descent: The Pragmatist

How It Works

Mini-batch GD computes the gradient using a small batch of samples (typically 32, 64, 128, or 256).

batch_size = 32

for epoch in range(num_epochs):
    shuffle(dataset)

    for batch in create_batches(dataset, batch_size):
        # Use a small batch (32 samples)
        gradient = compute_gradient(batch)

        # Update after each batch
        parameters = parameters - learning_rate * gradient

If you have 10,000 samples and batch size 32:

10,000 / 32 = 312 batches
312 updates per epoch (not 1, not 10,000)

The Path It Takes

Loss
  │
  │\
  │ \
  │  ↘
  │   ↘
  │    ↘
  │     ↘   ★
  │       \_/
  │
  └────────────────── Parameter

Path: Slightly noisy but mostly smooth
Steps: Moderate (one per batch)
Each step: Reasonably accurate

Not as smooth as batch. Not as chaotic as SGD. Just right.

The Pros

Advantage	Why
Fast	More updates than batch GD
Stable	More accurate than single-sample SGD
Memory efficient	Only load batch_size samples at a time
GPU friendly	Batches can be processed in parallel!
Some noise	Can still escape shallow local minima
Best of both worlds	Balances speed and accuracy

The Cons

Disadvantage	Why
One hyperparameter	Need to choose batch size
Not perfect	Still some noise (but usually good)
Slightly more complex	Need to implement batching

Why GPUs Love Mini-Batches

Here's a secret: GPUs are designed for batch processing.

A GPU can multiply a 32x1000 matrix just as fast as a 1x1000 matrix. The parallelism is essentially free.

CPU:  Process sample 1... done. Sample 2... done. Sample 3...
GPU:  Process samples 1-32... done. All at once.

This is why mini-batch dominates deep learning. It's not just a compromise — it's actually faster than pure SGD on modern hardware.

Side-by-Side Comparison

Let's see all three approaches with the same dataset.

import numpy as np

# Dataset: 1000 samples
np.random.seed(42)
X = np.random.randn(1000, 1)
y = 3 * X + 2 + np.random.randn(1000, 1) * 0.5

# Initialize
w, b = 0.0, 0.0
learning_rate = 0.01

def compute_gradient(X_batch, y_batch, w, b):
    predictions = X_batch * w + b
    error = predictions - y_batch
    dw = 2 * np.mean(error * X_batch)
    db = 2 * np.mean(error)
    return dw, db

def compute_loss(X, y, w, b):
    return np.mean((X * w + b - y) ** 2)

Batch Gradient Descent

w, b = 0.0, 0.0
print("=== BATCH GRADIENT DESCENT ===")
print(f"Updates per epoch: 1")
print()

for epoch in range(100):
    # Use ALL data
    dw, db = compute_gradient(X, y, w, b)
    w = w - learning_rate * dw
    b = b - learning_rate * db

    if epoch % 20 == 0:
        loss = compute_loss(X, y, w, b)
        print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")

Output:

=== BATCH GRADIENT DESCENT ===
Updates per epoch: 1

Epoch 0: w=0.1192, b=0.0814, loss=9.4271
Epoch 20: w=2.6SEP1, b=1.8732, loss=0.3124
Epoch 40: w=2.9642, b=1.9812, loss=0.2518
Epoch 60: w=2.9961, b=1.9974, loss=0.2487
Epoch 80: w=2.9994, b=1.9997, loss=0.2485

Smooth. Stable. But only 100 updates total.

Stochastic Gradient Descent

w, b = 0.0, 0.0
print("=== STOCHASTIC GRADIENT DESCENT ===")
print(f"Updates per epoch: {len(X)}")
print()

for epoch in range(100):
    # Shuffle data
    indices = np.random.permutation(len(X))

    for i in indices:
        # Use ONE sample
        xi = X[i:i+1]
        yi = y[i:i+1]
        dw, db = compute_gradient(xi, yi, w, b)
        w = w - learning_rate * dw
        b = b - learning_rate * db

    if epoch % 20 == 0:
        loss = compute_loss(X, y, w, b)
        print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")

Output:

=== STOCHASTIC GRADIENT DESCENT ===
Updates per epoch: 1000

Epoch 0: w=2.9847, b=1.9923, loss=0.2491
Epoch 20: w=3.0142, b=2.0087, loss=0.2486
Epoch 40: w=2.9891, b=1.9812, loss=0.2489
Epoch 60: w=3.0023, b=2.0156, loss=0.2487
Epoch 80: w=2.9967, b=1.9943, loss=0.2485

Notice: Converged in the FIRST epoch! 1000 updates vs batch's 1.

But also notice: The values keep bouncing around. It never truly settles.

Mini-Batch Gradient Descent

w, b = 0.0, 0.0
batch_size = 32
print("=== MINI-BATCH GRADIENT DESCENT ===")
print(f"Updates per epoch: {len(X) // batch_size}")
print()

for epoch in range(100):
    # Shuffle data
    indices = np.random.permutation(len(X))

    for start in range(0, len(X), batch_size):
        # Use a BATCH of samples
        batch_idx = indices[start:start+batch_size]
        X_batch = X[batch_idx]
        y_batch = y[batch_idx]

        dw, db = compute_gradient(X_batch, y_batch, w, b)
        w = w - learning_rate * dw
        b = b - learning_rate * db

    if epoch % 20 == 0:
        loss = compute_loss(X, y, w, b)
        print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")

Output:

=== MINI-BATCH GRADIENT DESCENT ===
Updates per epoch: 31

Epoch 0: w=2.8934, b=1.9234, loss=0.2612
Epoch 20: w=2.9987, b=1.9991, loss=0.2485
Epoch 40: w=3.0001, b=2.0003, loss=0.2485
Epoch 60: w=2.9998, b=1.9999, loss=0.2485
Epoch 80: w=3.0000, b=2.0001, loss=0.2485

Fast convergence (first few epochs) AND stable final values. The sweet spot.

The Noise Comparison

Let's visualize the different noise levels:

Gradient Accuracy Over Time

Batch GD (All 10,000 samples):
True direction: →
Estimated:      →  →  →  →  →  →  →  →
                   (Always correct)

SGD (1 sample):
True direction: →
Estimated:      ↗  ↙  →  ↖  ↘  ←  ↗  ↓
                   (Wildly inconsistent)

Mini-Batch (32 samples):
True direction: →
Estimated:      →  ↗  →  ↘  →  →  ↗  →
                   (Mostly correct, slight noise)

The gradient from 32 samples is much more reliable than from 1, but much faster to compute than 10,000.

Choosing the Batch Size

Batch size is a hyperparameter. Here's how to think about it:

Small Batch (8-32)

Pros:                          Cons:
+ More noise (regularization)  - Less stable
+ Can escape local minima      - Slower on GPU (underutilized)
+ Less memory needed           - More updates = more overhead

Medium Batch (64-256)

Pros:                          Cons:
+ Good balance                 - Jack of all trades
+ GPU efficient                - Master of none
+ Stable enough

Large Batch (512-4096)

Pros:                          Cons:
+ Very stable gradients        - May converge to sharp minima
+ Maximizes GPU utilization    - Needs lots of memory
+ Fewer updates needed         - Less regularization
                               - May need to adjust learning rate

The Rule of Thumb

Situation	Recommended Batch Size
Just starting out	32 or 64
Limited GPU memory	Largest that fits
Want more regularization	Smaller (16-32)
Very large dataset	Larger (256-512)
Default/don't know	32

The Learning Rate Connection

Here's something important: batch size and learning rate are linked.

Larger batch → More accurate gradient → Can use larger learning rate
Smaller batch → Noisier gradient → Need smaller learning rate

# General heuristic
# If you double batch size, you can increase learning rate by ~sqrt(2)

# Batch size 32,  LR = 0.001
# Batch size 64,  LR = 0.0014
# Batch size 128, LR = 0.002
# Batch size 256, LR = 0.0028

This is called the linear scaling rule (approximately):

new_lr = base_lr × (new_batch_size / base_batch_size)

The Memory Math

Let's do some real numbers.

Scenario: Training a CNN on images

Image size: 224 × 224 × 3 = 150,528 floats
Float size: 4 bytes
Per image: ~600 KB

Batch Size	Memory for Inputs	Plus activations, gradients...
1	600 KB	~50 MB
32	19 MB	~1.6 GB
128	77 MB	~6.4 GB
512	307 MB	~25 GB

This is why batch size is often limited by GPU memory, not by choice.

# Common scenario
batch_size = 32   # What you want
# CUDA out of memory error!

batch_size = 16   # What you can actually fit

Advanced: Gradient Accumulation

What if you want the stability of large batches but only have memory for small ones?

Gradient accumulation: Run multiple small batches, accumulate gradients, then update.

accumulation_steps = 4  # Effective batch size = 32 * 4 = 128

optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps  # Scale loss
    loss.backward()  # Accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Update after 4 batches
        optimizer.zero_grad()

Now you get the gradient quality of batch_size=128, using only memory for batch_size=32.

Real-World Usage

What do practitioners actually use?

# PyTorch
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

for epoch in range(epochs):
    for batch in train_loader:  # Mini-batch!
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()

# TensorFlow/Keras
model.fit(X, y, batch_size=32, epochs=100)  # Mini-batch!

# Fast.ai
learn.fit_one_cycle(10, lr=3e-3, bs=64)  # Mini-batch!

Everyone uses mini-batch. When people say "SGD" in deep learning, they almost always mean "mini-batch SGD."

The Convergence Visualization

Here's what training looks like with each method:

Training Loss Over Time

Batch GD:
Loss │ \
     │  \
     │   \
     │    \────────────
     └──────────────────── Time
       (Smooth but slow progress)

SGD:
Loss │\  /\ /\
     │ \/  v  \/\
     │         \/\/\  /\
     │              \/  \──
     └──────────────────── Time
       (Fast but noisy, never settles)

Mini-Batch:
Loss │\
     │ \\
     │  \~\
     │    ~\~~~~~~~~~~
     └──────────────────── Time
       (Fast AND eventually settles)

Quick Decision Guide

What's your situation?

"I have a tiny dataset (< 1000 samples)"
    → Batch GD is fine, or small mini-batch

"I have a normal dataset and GPU"
    → Mini-batch (32-128)

"I have huge data and limited memory"
    → Small mini-batch + gradient accumulation

"I need to train on streaming data"
    → SGD (true single-sample)

"I want maximum regularization"
    → Smaller mini-batch + data augmentation

"I don't know / just want it to work"
    → Mini-batch, batch_size=32

The Summary Table

Aspect	Batch GD	Stochastic GD	Mini-Batch GD
Samples per update	All (N)	One (1)	Some (32-512)
Updates per epoch	1	N	N / batch_size
Gradient accuracy	Perfect	Very noisy	Pretty good
Memory usage	High	Minimal	Moderate
Convergence	Smooth	Chaotic	Smooth-ish
Speed	Slow	Fast*	Fast
GPU utilization	Good	Poor	Excellent
Can escape local minima	No	Yes	Sometimes
Used in practice	Rarely	Rarely	Always

*SGD is fast in updates but slow on GPU due to poor parallelization.

Key Takeaways

Batch GD = All data, one update. Accurate but slow and memory-hungry.
Stochastic GD = One sample, one update. Fast but noisy and chaotic.
Mini-Batch GD = Small batch, one update. Best of both worlds.
Everyone uses mini-batch. It's the default. It's what "SGD" usually means.
Batch size matters. Affects speed, memory, stability, and generalization.
GPU loves batches. Parallel processing makes mini-batch faster than SGD.
Learning rate scales with batch size. Bigger batch → can use bigger learning rate.

The Hiker Analogy Summary

Hiker	Strategy	Result
Perfectionist	Ask everyone (10,000 people)	Perfect direction but takes forever
Impulsive	Ask one random person	Fast but zigzags everywhere
Pragmatist	Ask a small group (32 people)	Fast AND mostly accurate

The Pragmatist wins. That's why we use mini-batch.

What's Next?

Now that you understand the three gradient descent variants, you're ready for:

Learning Rate Schedules — Changing step size over time
Optimizers — Adam, RMSprop, and beyond
Batch Normalization — Stabilizing training with batches
Distributed Training — When one GPU isn't enough

Follow me for the next article in this series!

Let's Connect!

If this cleared up the batch/mini-batch/SGD confusion, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which batch size do you typically use? I'm curious!

The next time someone says "we're using SGD," you'll know to ask: "Batch, mini-batch, or true stochastic?" And you'll know why the answer is almost always mini-batch.

Share this with someone who's confused about why their training code uses DataLoader with a batch_size parameter. Now they'll understand!

Happy learning!