DEV Community

Cover image for Batch vs Mini-Batch vs Stochastic Gradient Descent: Three Hikers, Three Strategies, One Mountain
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Batch vs Mini-Batch vs Stochastic Gradient Descent: Three Hikers, Three Strategies, One Mountain

The One-Line Summary: Batch uses all data (slow but accurate). Stochastic uses one sample (fast but noisy). Mini-batch uses a small group (best of both worlds). That's why everyone uses mini-batch.


Three Hikers, One Mountain

Three friends decide to hike down a foggy mountain. They can't see the valley below. They can only ask other hikers which way seems downhill.

But they each have a very different strategy.


Hiker 1: The Perfectionist

Before taking a single step, the Perfectionist finds every single hiker on the mountain. All 10,000 of them.

She asks each one: "Which way is downhill?"

She collects 10,000 opinions. She averages them. She calculates the perfect direction.

Then — and only then — she takes one step.

Then she does it all over again. Find 10,000 hikers. Ask all of them. Average. Step.

She's incredibly accurate. But by the time she reaches the bottom, her friends have been at the pub for three hours.


Hiker 2: The Impulsive

The Impulsive hiker has no patience. He grabs one random person and asks: "Which way is down?"

The stranger points vaguely left. The Impulsive takes a step.

He grabs another stranger. "Which way?" They point right. He steps right.

Another stranger. Another direction. Step. Stranger. Step. Stranger. Step.

His path is chaotic. He zigzags wildly. Sometimes he even goes uphill by accident.

But he's fast. Insanely fast. He takes thousands of steps while the Perfectionist is still conducting her survey.

And despite the chaos, he eventually stumbles into the valley.


Hiker 3: The Pragmatist

The Pragmatist thinks both her friends are crazy.

She finds a small group of 32 hikers. She asks them all. She averages their opinions.

She takes a step.

She finds another group of 32. Asks. Averages. Steps.

Not as accurate as asking everyone. Not as fast as asking one person. But way more balanced.

Her path is smooth-ish. Her speed is reasonable. She reaches the valley efficiently.


These three hikers are:

  • Batch Gradient Descent (The Perfectionist)
  • Stochastic Gradient Descent (The Impulsive)
  • Mini-Batch Gradient Descent (The Pragmatist)

Let's dive deeper into each one.


Batch Gradient Descent: The Perfectionist

How It Works

Batch gradient descent computes the gradient using the entire dataset before making a single update.

for epoch in range(num_epochs):
    # Use ALL training data
    gradient = compute_gradient(entire_dataset)  # 10,000 samples

    # One update per epoch
    parameters = parameters - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

Every epoch, you:

  1. Run the entire dataset through the model
  2. Calculate one gradient (averaged over all samples)
  3. Make one parameter update
  4. Repeat

The Path It Takes

Loss
  │
  │\
  │ \
  │  \
  │   \
  │    \
  │     \
  │      \    ★
  │       \___/
  │
  └────────────────── Parameter

Path: Smooth, direct, no zigzags
Steps: Few (one per epoch)
Each step: Carefully calculated
Enter fullscreen mode Exit fullscreen mode

Beautiful, right? The smoothest possible path to the minimum.

The Pros

Advantage Why
Stable convergence Gradient is accurate (averaged over all data)
No noise Every step is in the "true" best direction
Guaranteed to converge For convex problems, will find minimum
Predictable Easy to debug, loss always decreases

The Cons

Disadvantage Why
Painfully slow Must process ALL data for ONE step
Memory hungry Must load entire dataset into memory
Can't escape local minima Too smooth, no randomness to jump out
No online learning Can't update as new data arrives

When Batch GD Dies

Imagine your dataset has 1 billion samples. Each sample is an image.

To take ONE step:

  • Load 1 billion images into memory (impossible)
  • Compute gradient for each one
  • Average them
  • Update parameters once

Then repeat this for thousands of epochs.

You'll die of old age before your model trains.


Stochastic Gradient Descent (SGD): The Impulsive

How It Works

SGD computes the gradient using one random sample at a time.

for epoch in range(num_epochs):
    shuffle(dataset)  # Randomize order

    for sample in dataset:  # One sample at a time
        # Use ONE training example
        gradient = compute_gradient(sample)

        # Update immediately
        parameters = parameters - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

Every sample, you:

  1. Compute gradient for that ONE sample
  2. Update parameters immediately
  3. Move to next sample

If you have 10,000 samples, you make 10,000 updates per epoch instead of just one.

The Path It Takes

Loss
  │
  │\        ↗
  │ ↘      ↗
  │  ↗    ↙
  │   ↘  ↗
  │    ↙↗
  │    ↘↙
  │     ★
  │
  └────────────────── Parameter

Path: Chaotic, zigzagging, noisy
Steps: Many (one per sample)
Each step: Quick and dirty
Enter fullscreen mode Exit fullscreen mode

It's drunk. It's chaotic. But it's fast.

The Pros

Advantage Why
Blazing fast Updates after every single sample
Memory efficient Only one sample in memory at a time
Can escape local minima Noise helps jump out of bad valleys
Online learning Can learn from streaming data
Regularization effect Noise prevents overfitting

The Cons

Disadvantage Why
Noisy convergence Gradient from one sample is unreliable
Zigzags wildly Steps often go the wrong direction
Never truly settles Keeps bouncing around the minimum
Hard to parallelize Sequential by nature

The Beautiful Chaos

Here's the weird thing: the noise is actually useful.

That random bouncing around? It can knock you out of bad local minima that batch GD would get stuck in.

Batch GD:                     SGD:

    │\  /\                        │\  /\
    │ \/  \                       │ \/  \
    │ ▼    \                      │      ↘ ← Noise kicks it out!
    │stuck  \                     │        ↘
    │        \_★                  │          ★
Enter fullscreen mode Exit fullscreen mode

The Perfectionist would settle into the first dip. The Impulsive bounces right through it.


Mini-Batch Gradient Descent: The Pragmatist

How It Works

Mini-batch GD computes the gradient using a small batch of samples (typically 32, 64, 128, or 256).

batch_size = 32

for epoch in range(num_epochs):
    shuffle(dataset)

    for batch in create_batches(dataset, batch_size):
        # Use a small batch (32 samples)
        gradient = compute_gradient(batch)

        # Update after each batch
        parameters = parameters - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

If you have 10,000 samples and batch size 32:

  • 10,000 / 32 = 312 batches
  • 312 updates per epoch (not 1, not 10,000)

The Path It Takes

Loss
  │
  │\
  │ \
  │  ↘
  │   ↘
  │    ↘
  │     ↘   ★
  │       \_/
  │
  └────────────────── Parameter

Path: Slightly noisy but mostly smooth
Steps: Moderate (one per batch)
Each step: Reasonably accurate
Enter fullscreen mode Exit fullscreen mode

Not as smooth as batch. Not as chaotic as SGD. Just right.

The Pros

Advantage Why
Fast More updates than batch GD
Stable More accurate than single-sample SGD
Memory efficient Only load batch_size samples at a time
GPU friendly Batches can be processed in parallel!
Some noise Can still escape shallow local minima
Best of both worlds Balances speed and accuracy

The Cons

Disadvantage Why
One hyperparameter Need to choose batch size
Not perfect Still some noise (but usually good)
Slightly more complex Need to implement batching

Why GPUs Love Mini-Batches

Here's a secret: GPUs are designed for batch processing.

A GPU can multiply a 32x1000 matrix just as fast as a 1x1000 matrix. The parallelism is essentially free.

CPU:  Process sample 1... done. Sample 2... done. Sample 3...
GPU:  Process samples 1-32... done. All at once.
Enter fullscreen mode Exit fullscreen mode

This is why mini-batch dominates deep learning. It's not just a compromise — it's actually faster than pure SGD on modern hardware.


Side-by-Side Comparison

Let's see all three approaches with the same dataset.

import numpy as np

# Dataset: 1000 samples
np.random.seed(42)
X = np.random.randn(1000, 1)
y = 3 * X + 2 + np.random.randn(1000, 1) * 0.5

# Initialize
w, b = 0.0, 0.0
learning_rate = 0.01

def compute_gradient(X_batch, y_batch, w, b):
    predictions = X_batch * w + b
    error = predictions - y_batch
    dw = 2 * np.mean(error * X_batch)
    db = 2 * np.mean(error)
    return dw, db

def compute_loss(X, y, w, b):
    return np.mean((X * w + b - y) ** 2)
Enter fullscreen mode Exit fullscreen mode

Batch Gradient Descent

w, b = 0.0, 0.0
print("=== BATCH GRADIENT DESCENT ===")
print(f"Updates per epoch: 1")
print()

for epoch in range(100):
    # Use ALL data
    dw, db = compute_gradient(X, y, w, b)
    w = w - learning_rate * dw
    b = b - learning_rate * db

    if epoch % 20 == 0:
        loss = compute_loss(X, y, w, b)
        print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

=== BATCH GRADIENT DESCENT ===
Updates per epoch: 1

Epoch 0: w=0.1192, b=0.0814, loss=9.4271
Epoch 20: w=2.6SEP1, b=1.8732, loss=0.3124
Epoch 40: w=2.9642, b=1.9812, loss=0.2518
Epoch 60: w=2.9961, b=1.9974, loss=0.2487
Epoch 80: w=2.9994, b=1.9997, loss=0.2485
Enter fullscreen mode Exit fullscreen mode

Smooth. Stable. But only 100 updates total.

Stochastic Gradient Descent

w, b = 0.0, 0.0
print("=== STOCHASTIC GRADIENT DESCENT ===")
print(f"Updates per epoch: {len(X)}")
print()

for epoch in range(100):
    # Shuffle data
    indices = np.random.permutation(len(X))

    for i in indices:
        # Use ONE sample
        xi = X[i:i+1]
        yi = y[i:i+1]
        dw, db = compute_gradient(xi, yi, w, b)
        w = w - learning_rate * dw
        b = b - learning_rate * db

    if epoch % 20 == 0:
        loss = compute_loss(X, y, w, b)
        print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

=== STOCHASTIC GRADIENT DESCENT ===
Updates per epoch: 1000

Epoch 0: w=2.9847, b=1.9923, loss=0.2491
Epoch 20: w=3.0142, b=2.0087, loss=0.2486
Epoch 40: w=2.9891, b=1.9812, loss=0.2489
Epoch 60: w=3.0023, b=2.0156, loss=0.2487
Epoch 80: w=2.9967, b=1.9943, loss=0.2485
Enter fullscreen mode Exit fullscreen mode

Notice: Converged in the FIRST epoch! 1000 updates vs batch's 1.

But also notice: The values keep bouncing around. It never truly settles.

Mini-Batch Gradient Descent

w, b = 0.0, 0.0
batch_size = 32
print("=== MINI-BATCH GRADIENT DESCENT ===")
print(f"Updates per epoch: {len(X) // batch_size}")
print()

for epoch in range(100):
    # Shuffle data
    indices = np.random.permutation(len(X))

    for start in range(0, len(X), batch_size):
        # Use a BATCH of samples
        batch_idx = indices[start:start+batch_size]
        X_batch = X[batch_idx]
        y_batch = y[batch_idx]

        dw, db = compute_gradient(X_batch, y_batch, w, b)
        w = w - learning_rate * dw
        b = b - learning_rate * db

    if epoch % 20 == 0:
        loss = compute_loss(X, y, w, b)
        print(f"Epoch {epoch}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

=== MINI-BATCH GRADIENT DESCENT ===
Updates per epoch: 31

Epoch 0: w=2.8934, b=1.9234, loss=0.2612
Epoch 20: w=2.9987, b=1.9991, loss=0.2485
Epoch 40: w=3.0001, b=2.0003, loss=0.2485
Epoch 60: w=2.9998, b=1.9999, loss=0.2485
Epoch 80: w=3.0000, b=2.0001, loss=0.2485
Enter fullscreen mode Exit fullscreen mode

Fast convergence (first few epochs) AND stable final values. The sweet spot.


The Noise Comparison

Let's visualize the different noise levels:

Gradient Accuracy Over Time

Batch GD (All 10,000 samples):
True direction: →
Estimated:      →  →  →  →  →  →  →  →
                   (Always correct)

SGD (1 sample):
True direction: →
Estimated:      ↗  ↙  →  ↖  ↘  ←  ↗  ↓
                   (Wildly inconsistent)

Mini-Batch (32 samples):
True direction: →
Estimated:      →  ↗  →  ↘  →  →  ↗  →
                   (Mostly correct, slight noise)
Enter fullscreen mode Exit fullscreen mode

The gradient from 32 samples is much more reliable than from 1, but much faster to compute than 10,000.


Choosing the Batch Size

Batch size is a hyperparameter. Here's how to think about it:

Small Batch (8-32)

Pros:                          Cons:
+ More noise (regularization)  - Less stable
+ Can escape local minima      - Slower on GPU (underutilized)
+ Less memory needed           - More updates = more overhead
Enter fullscreen mode Exit fullscreen mode

Medium Batch (64-256)

Pros:                          Cons:
+ Good balance                 - Jack of all trades
+ GPU efficient                - Master of none
+ Stable enough
Enter fullscreen mode Exit fullscreen mode

Large Batch (512-4096)

Pros:                          Cons:
+ Very stable gradients        - May converge to sharp minima
+ Maximizes GPU utilization    - Needs lots of memory
+ Fewer updates needed         - Less regularization
                               - May need to adjust learning rate
Enter fullscreen mode Exit fullscreen mode

The Rule of Thumb

Situation Recommended Batch Size
Just starting out 32 or 64
Limited GPU memory Largest that fits
Want more regularization Smaller (16-32)
Very large dataset Larger (256-512)
Default/don't know 32

The Learning Rate Connection

Here's something important: batch size and learning rate are linked.

Larger batch → More accurate gradient → Can use larger learning rate
Smaller batch → Noisier gradient → Need smaller learning rate

# General heuristic
# If you double batch size, you can increase learning rate by ~sqrt(2)

# Batch size 32,  LR = 0.001
# Batch size 64,  LR = 0.0014
# Batch size 128, LR = 0.002
# Batch size 256, LR = 0.0028
Enter fullscreen mode Exit fullscreen mode

This is called the linear scaling rule (approximately):

new_lr = base_lr × (new_batch_size / base_batch_size)
Enter fullscreen mode Exit fullscreen mode

The Memory Math

Let's do some real numbers.

Scenario: Training a CNN on images

  • Image size: 224 × 224 × 3 = 150,528 floats
  • Float size: 4 bytes
  • Per image: ~600 KB
Batch Size Memory for Inputs Plus activations, gradients...
1 600 KB ~50 MB
32 19 MB ~1.6 GB
128 77 MB ~6.4 GB
512 307 MB ~25 GB

This is why batch size is often limited by GPU memory, not by choice.

# Common scenario
batch_size = 32   # What you want
# CUDA out of memory error!

batch_size = 16   # What you can actually fit
Enter fullscreen mode Exit fullscreen mode

Advanced: Gradient Accumulation

What if you want the stability of large batches but only have memory for small ones?

Gradient accumulation: Run multiple small batches, accumulate gradients, then update.

accumulation_steps = 4  # Effective batch size = 32 * 4 = 128

optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps  # Scale loss
    loss.backward()  # Accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Update after 4 batches
        optimizer.zero_grad()
Enter fullscreen mode Exit fullscreen mode

Now you get the gradient quality of batch_size=128, using only memory for batch_size=32.


Real-World Usage

What do practitioners actually use?

# PyTorch
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

for epoch in range(epochs):
    for batch in train_loader:  # Mini-batch!
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()
Enter fullscreen mode Exit fullscreen mode
# TensorFlow/Keras
model.fit(X, y, batch_size=32, epochs=100)  # Mini-batch!
Enter fullscreen mode Exit fullscreen mode
# Fast.ai
learn.fit_one_cycle(10, lr=3e-3, bs=64)  # Mini-batch!
Enter fullscreen mode Exit fullscreen mode

Everyone uses mini-batch. When people say "SGD" in deep learning, they almost always mean "mini-batch SGD."


The Convergence Visualization

Here's what training looks like with each method:

Training Loss Over Time

Batch GD:
Loss │ \
     │  \
     │   \
     │    \────────────
     └──────────────────── Time
       (Smooth but slow progress)

SGD:
Loss │\  /\ /\
     │ \/  v  \/\
     │         \/\/\  /\
     │              \/  \──
     └──────────────────── Time
       (Fast but noisy, never settles)

Mini-Batch:
Loss │\
     │ \\
     │  \~\
     │    ~\~~~~~~~~~~
     └──────────────────── Time
       (Fast AND eventually settles)
Enter fullscreen mode Exit fullscreen mode

Quick Decision Guide

What's your situation?

"I have a tiny dataset (< 1000 samples)"
    → Batch GD is fine, or small mini-batch

"I have a normal dataset and GPU"
    → Mini-batch (32-128)

"I have huge data and limited memory"
    → Small mini-batch + gradient accumulation

"I need to train on streaming data"
    → SGD (true single-sample)

"I want maximum regularization"
    → Smaller mini-batch + data augmentation

"I don't know / just want it to work"
    → Mini-batch, batch_size=32
Enter fullscreen mode Exit fullscreen mode

The Summary Table

Aspect Batch GD Stochastic GD Mini-Batch GD
Samples per update All (N) One (1) Some (32-512)
Updates per epoch 1 N N / batch_size
Gradient accuracy Perfect Very noisy Pretty good
Memory usage High Minimal Moderate
Convergence Smooth Chaotic Smooth-ish
Speed Slow Fast* Fast
GPU utilization Good Poor Excellent
Can escape local minima No Yes Sometimes
Used in practice Rarely Rarely Always

*SGD is fast in updates but slow on GPU due to poor parallelization.


Key Takeaways

  1. Batch GD = All data, one update. Accurate but slow and memory-hungry.

  2. Stochastic GD = One sample, one update. Fast but noisy and chaotic.

  3. Mini-Batch GD = Small batch, one update. Best of both worlds.

  4. Everyone uses mini-batch. It's the default. It's what "SGD" usually means.

  5. Batch size matters. Affects speed, memory, stability, and generalization.

  6. GPU loves batches. Parallel processing makes mini-batch faster than SGD.

  7. Learning rate scales with batch size. Bigger batch → can use bigger learning rate.


The Hiker Analogy Summary

Hiker Strategy Result
Perfectionist Ask everyone (10,000 people) Perfect direction but takes forever
Impulsive Ask one random person Fast but zigzags everywhere
Pragmatist Ask a small group (32 people) Fast AND mostly accurate

The Pragmatist wins. That's why we use mini-batch.


What's Next?

Now that you understand the three gradient descent variants, you're ready for:

  • Learning Rate Schedules — Changing step size over time
  • Optimizers — Adam, RMSprop, and beyond
  • Batch Normalization — Stabilizing training with batches
  • Distributed Training — When one GPU isn't enough

Follow me for the next article in this series!


Let's Connect!

If this cleared up the batch/mini-batch/SGD confusion, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which batch size do you typically use? I'm curious!


The next time someone says "we're using SGD," you'll know to ask: "Batch, mini-batch, or true stochastic?" And you'll know why the answer is almost always mini-batch.


Share this with someone who's confused about why their training code uses DataLoader with a batch_size parameter. Now they'll understand!

Happy learning!

Top comments (0)