DEV Community

Cover image for Gradient Descent: How to Find the Lowest Point in a Valley While Completely Blindfolded
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Gradient Descent: How to Find the Lowest Point in a Valley While Completely Blindfolded

The One-Line Summary: Gradient descent finds the best model by repeatedly asking "Which way is downhill?" and taking a step in that direction. That's it. That's the whole algorithm.


Lost in the Mountains

Picture this.

You're hiking in the Swiss Alps. Suddenly, a thick fog rolls in. You can't see more than two feet ahead. You're completely disoriented.

You need to get down to the village in the valley. But you can't see it. You can't see the path. You can't see anything.

How do you get down?

Here's what you do:

  1. Feel the ground beneath your feet
  2. Figure out which direction slopes downward
  3. Take a small step in that direction
  4. Repeat

You don't need to see the valley. You don't need a map. You just need to answer one question over and over:

"Which way is down?"

Eventually, step by step, you'll reach the bottom.


Congratulations. You just understood gradient descent.

This is literally how neural networks learn. They're lost in a fog of billions of possible parameter values. They can't see the "best" answer. But they can feel which direction makes things better.

And that's enough.


The Valley is the Loss Landscape

Remember loss functions from the previous article?

The loss function measures how wrong your model is. Now imagine plotting that loss:

  • X-axis: Model parameter (like a weight)
  • Y-axis: Loss value (how wrong you are)

What do you get? A curve. Often shaped like a valley.

Loss
  │
  │\                      /
  │ \                    /
  │  \                  /
  │   \                /
  │    \              /
  │     \            /
  │      \    ★    /
  │       \______/
  │         ↑
  │    Minimum (goal)
  └─────────────────────── Parameter Value
Enter fullscreen mode Exit fullscreen mode

Your goal: Find the bottom of the valley (lowest loss).

The problem: You can't see the whole landscape. You only know the loss at your current position.

The solution: Feel which way is downhill. Step that way. Repeat.


The Gradient: Your Downhill Detector

Here's the key insight.

The gradient tells you which way is uphill.

So the negative gradient tells you which way is downhill.

What Is a Gradient?

In simple terms: the gradient is the slope at your current position.

  • Positive slope → Ground is rising to the right
  • Negative slope → Ground is falling to the right
  • Zero slope → You're at a flat point (maybe the bottom!)
        Slope = Positive              Slope = Negative
        (Going uphill →)              (Going downhill →)

              /                              \
             /                                \
            /                                  \
           / • You are here          You are • \
          /                            here     \
Enter fullscreen mode Exit fullscreen mode

The Math (Don't Panic)

The gradient is just a derivative. It answers: "If I move the parameter a tiny bit, how much does the loss change?"

Gradient = ∂Loss / ∂parameter

Translation: "How sensitive is the loss to changes in this parameter?"
Enter fullscreen mode Exit fullscreen mode
  • Large gradient → Small changes cause big loss changes (steep slope)
  • Small gradient → Small changes cause small loss changes (gentle slope)
  • Zero gradient → You're at a minimum (or maximum, or saddle point)

The Algorithm: Stupidly Simple

Here's gradient descent in four lines:

1. Start somewhere (random parameters)
2. Calculate the gradient (which way is uphill?)
3. Take a step in the opposite direction (go downhill)
4. Repeat until you reach the bottom
Enter fullscreen mode Exit fullscreen mode

That's it. That's the algorithm that powers everything from spam filters to ChatGPT.

Let me make it even more concrete.

The Update Rule

new_parameter = old_parameter - learning_rate × gradient
Enter fullscreen mode Exit fullscreen mode

Let's break this down:

  • old_parameter: Where you are now
  • gradient: The slope (which way is uphill)
  • learning_rate: How big of a step to take
  • new_parameter: Where you end up

The minus sign is crucial — it makes you go opposite to the gradient, which means downhill.


A Walk Through the Valley

Let me show you gradient descent happening step by step.

Starting Position

Loss
  │
  │\
  │ \                    /
  │  \                  /
  │   • ← You start here (random)
  │    \              /
  │     \            /
  │      \    ★    /
  │       \______/
  │
  └─────────────────────── Parameter
Enter fullscreen mode Exit fullscreen mode

You're on the left slope. The gradient here is negative (downhill is to the right).

Step 1

Gradient says: "Uphill is to the LEFT"
You go: RIGHT (opposite direction)

Loss
  │
  │\
  │ \                    /
  │  \                  /
  │   \               /
  │    • ← After step 1
  │     \            /
  │      \    ★    /
  │       \______/
  │
  └─────────────────────── Parameter
Enter fullscreen mode Exit fullscreen mode

Loss decreased! You moved closer to the bottom.

Step 2

Gradient still says: "Uphill is to the LEFT"
You go: RIGHT again

Loss
  │
  │\
  │ \                    /
  │  \                  /
  │   \               /
  │    \             /
  │     \   •      /  ← After step 2
  │      \    ★   /
  │       \______/
  │
  └─────────────────────── Parameter
Enter fullscreen mode Exit fullscreen mode

Even closer!

Step 3, 4, 5...

You keep going until the gradient is nearly zero (flat ground = bottom of valley).

Loss
  │
  │\
  │ \                    /
  │  \                  /
  │   \               /
  │    \             /
  │     \           /
  │      \    •   /  ← Almost there!
  │       \__★___/
  │
  └─────────────────────── Parameter
Enter fullscreen mode Exit fullscreen mode

You found the minimum! The model is now as good as it can be (for this landscape).


The Learning Rate: Step Size Matters

Here's where things get interesting.

The learning rate controls how big each step is. And choosing it correctly is crucial.

Too Small: The Snail

Learning rate = 0.0001

Loss
  │
  │\
  │ \                    /
  │  • → • → • → • → •  /   ← Tiny steps
  │   \               /
  │    \             /
  │     \           /
  │      \    ★   /
  │       \______/
  │
  └─────────────────────── Parameter

Result: Takes forever to reach the bottom. Might never get there.
Enter fullscreen mode Exit fullscreen mode

Problem: Training is painfully slow. You'll grow old waiting.

Too Large: The Drunk Hiker

Learning rate = 10

Loss
  │
  │\          •         /
  │ \        ↗ ↖       /
  │  •      ↗   ↖     /
  │   \    ↗     ↖   /
  │    \  ↗       ↖ /
  │     ↗           •
  │      \    ★   /
  │       \______/
  │
  └─────────────────────── Parameter

Result: Overshoots! Bounces back and forth. Never settles.
Enter fullscreen mode Exit fullscreen mode

Problem: You step right over the minimum. Then overcorrect. Then overcorrect again. Chaos.

Just Right: The Goldilocks Zone

Learning rate = 0.01

Loss
  │
  │\
  │ \                    /
  │  •                  /
  │   ↘               /
  │    •             /
  │     ↘           /
  │      •    ★   /
  │       \__•___/  ← Converged nicely
  │
  └─────────────────────── Parameter

Result: Steady progress. Reaches the bottom efficiently.
Enter fullscreen mode Exit fullscreen mode

Finding the right learning rate is part art, part science. Too small = slow. Too big = unstable. Just right = magic.


Let's See It In Code

Time to make this real.

The Simplest Example

Let's minimize a simple function: f(x) = x²

The minimum is obviously at x = 0. But let's pretend we don't know that.

import numpy as np

# Our function: f(x) = x²
def loss(x):
    return x ** 2

# Gradient of f(x) = x² is 2x
def gradient(x):
    return 2 * x

# Gradient Descent
x = 10.0  # Start somewhere random
learning_rate = 0.1
history = [x]

for step in range(50):
    grad = gradient(x)
    x = x - learning_rate * grad  # The magic update rule!
    history.append(x)

    if step % 10 == 0:
        print(f"Step {step}: x = {x:.6f}, loss = {loss(x):.6f}")

print(f"\nFinal: x = {x:.6f} (should be close to 0)")
Enter fullscreen mode Exit fullscreen mode

Output:

Step 0: x = 8.000000, loss = 64.000000
Step 10: x = 1.073742, loss = 1.152682
Step 20: x = 0.014412, loss = 0.000208
Step 30: x = 0.000193, loss = 0.000000
Step 40: x = 0.000003, loss = 0.000000

Final: x = 0.000000 (should be close to 0)
Enter fullscreen mode Exit fullscreen mode

Starting from x = 10, gradient descent found x = 0 (the minimum) automatically!


Visualizing the Journey

import matplotlib.pyplot as plt

# Plot the loss landscape
x_range = np.linspace(-10, 10, 100)
y_range = x_range ** 2

plt.figure(figsize=(10, 6))
plt.plot(x_range, y_range, 'b-', label='Loss function (x²)')
plt.plot(history, [h**2 for h in history], 'ro-', label='Gradient descent path')
plt.xlabel('Parameter (x)')
plt.ylabel('Loss')
plt.title('Gradient Descent Finding the Minimum')
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

You'd see a red path bouncing down the parabola to the bottom. Beautiful!


Gradient Descent for Linear Regression

Now let's do a real ML example:

import numpy as np

# Generate fake data: y = 3x + 2 + noise
np.random.seed(42)
X = np.random.randn(100)
y_true = 3 * X + 2 + np.random.randn(100) * 0.5

# Our model: y = wx + b
# We need to learn w and b

w = 0.0  # Initialize weight
b = 0.0  # Initialize bias
learning_rate = 0.1
n = len(X)

print("Starting: w = 0, b = 0")
print("Target:   w ≈ 3, b ≈ 2\n")

for epoch in range(100):
    # Forward pass: predictions
    y_pred = w * X + b

    # Calculate loss (MSE)
    loss = np.mean((y_pred - y_true) ** 2)

    # Calculate gradients
    dw = (2/n) * np.sum((y_pred - y_true) * X)  # ∂Loss/∂w
    db = (2/n) * np.sum(y_pred - y_true)         # ∂Loss/∂b

    # Update parameters (gradient descent!)
    w = w - learning_rate * dw
    b = b - learning_rate * db

    if epoch % 20 == 0:
        print(f"Epoch {epoch}: w = {w:.4f}, b = {b:.4f}, loss = {loss:.4f}")

print(f"\nFinal: w = {w:.4f}, b = {b:.4f}")
print(f"True:  w = 3.0000, b = 2.0000")
Enter fullscreen mode Exit fullscreen mode

Output:

Starting: w = 0, b = 0
Target:   w ≈ 3, b ≈ 2

Epoch 0: w = 0.5765, b = 0.3909, loss = 11.7583
Epoch 20: w = 2.8638, b = 1.9476, loss = 0.2521
Epoch 40: w = 2.9742, b = 1.9893, loss = 0.2319
Epoch 60: w = 2.9948, b = 1.9972, loss = 0.2304
Epoch 80: w = 2.9987, b = 1.9993, loss = 0.2302

Final: w = 2.9996, b = 1.9998
True:  w = 3.0000, b = 2.0000
Enter fullscreen mode Exit fullscreen mode

Gradient descent learned the correct values just by following the slope downhill!


Variants of Gradient Descent

The basic algorithm has some problems. Smart people invented solutions.

Batch Gradient Descent

What: Calculate gradient using ALL training data.

Pro: Stable, accurate gradient.
Con: Slow for large datasets.

# Uses entire dataset
gradient = compute_gradient(entire_dataset)
parameters = parameters - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

Stochastic Gradient Descent (SGD)

What: Calculate gradient using ONE random sample.

Pro: Fast! Can escape local minima.
Con: Noisy, zigzags a lot.

# Uses single sample
for sample in dataset:
    gradient = compute_gradient(sample)
    parameters = parameters - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode
Loss
  │
  │  Batch GD          SGD
  │  (Smooth)          (Noisy)
  │
  │    ↘                ↘ ↗
  │     ↘              ↙ ↘
  │      ↘           ↘ ↗ ↘
  │       ↘        ↙ ↘ ↘
  │        ★         ★
  └─────────────────────────
Enter fullscreen mode Exit fullscreen mode

Mini-Batch Gradient Descent

What: Calculate gradient using a SMALL BATCH (e.g., 32 samples).

Pro: Best of both worlds! Fast AND stable.
Con: Need to choose batch size.

# Uses mini-batches (most common in practice!)
for batch in create_batches(dataset, batch_size=32):
    gradient = compute_gradient(batch)
    parameters = parameters - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

This is what everyone uses in deep learning. When people say "SGD" in practice, they usually mean mini-batch.


The Problem of Local Minima

Here's a scary truth.

Not all valleys lead to the global minimum.

Loss
  │
  │\        /\
  │ \      /  \        /
  │  \    /    \      /
  │   \  /      \    /
  │    \/        \  /
  │    ↑          \/
  │  Local        ↑
  │  Minimum    Global
  │             Minimum
  └─────────────────────── Parameter
Enter fullscreen mode Exit fullscreen mode

If you start on the left, you might get stuck in the local minimum, never finding the true best answer.

Solutions

1. Random Restarts: Try multiple starting points.

2. Momentum: Build up speed to roll through small valleys.

3. Learning Rate Schedules: Start with big steps (explore), then small steps (settle).

4. Advanced Optimizers: Adam, RMSprop, etc. (covered in next section).

In practice, for neural networks with millions of parameters, local minima are less of a problem than you'd think. The landscape is so high-dimensional that most "valleys" are actually saddle points with escape routes.


Momentum: The Rolling Ball

Imagine a ball rolling down a valley.

Without momentum: The ball stops instantly when you stop pushing.

With momentum: The ball keeps rolling, building up speed.

velocity = 0
momentum = 0.9

for epoch in range(epochs):
    gradient = compute_gradient(parameters)

    # Update velocity (builds up over time)
    velocity = momentum * velocity + learning_rate * gradient

    # Update parameters
    parameters = parameters - velocity
Enter fullscreen mode Exit fullscreen mode

Why it helps:

  • Faster convergence (builds up speed on consistent slopes)
  • Can escape shallow local minima (momentum carries you through)
  • Reduces zigzagging (smooths out noisy gradients)
Without Momentum         With Momentum
     ↘ ↗                      ↘
    ↙ ↘                        ↘
   ↘ ↗ ↘                        ↘
    ↙ ↘                          ↘
      ★                           ★
  (Zigzag)                   (Smooth)
Enter fullscreen mode Exit fullscreen mode

Modern Optimizers

Gradient descent has evolved. Here are the popular variants:

Optimizer Key Idea When to Use
SGD Basic + mini-batches Simple problems, when you want control
SGD + Momentum Adds velocity When SGD is too slow
AdaGrad Adapts learning rate per parameter Sparse data (NLP)
RMSprop Fixes AdaGrad's decay problem RNNs, non-stationary
Adam RMSprop + Momentum Default choice! Works great almost everywhere
AdamW Adam + weight decay fix Transformers, modern deep learning

The Go-To Choice: Adam

Adam is like gradient descent with superpowers:

  • Adapts learning rate for each parameter
  • Uses momentum
  • Handles sparse gradients
# In Keras/TensorFlow
model.compile(optimizer='adam', loss='mse')

# In PyTorch
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Enter fullscreen mode Exit fullscreen mode

When in doubt, use Adam. It's not always the best, but it's almost never bad.


Visualizing in Higher Dimensions

We've been drawing 2D pictures. But real neural networks have millions of parameters.

Imagine gradient descent, but instead of finding the lowest point on a line, you're finding the lowest point in a space with 175 billion dimensions (like GPT-3).

Your brain can't visualize that. Neither can mine.

But the math doesn't care. The gradient still points uphill. You still go the opposite way. You still reach a minimum.

2D: Walk down a valley
3D: Walk down a bowl-shaped surface
1,000,000D: Walk down a... thing. 
            It works the same way. Trust the math.
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Learning Rate Too High

Symptom: Loss explodes to infinity or NaN.

# WRONG
optimizer = Adam(lr=1.0)  # Way too high for most problems

# RIGHT
optimizer = Adam(lr=0.001)  # Safe default
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Learning Rate Too Low

Symptom: Loss decreases incredibly slowly. Training takes forever.

# WRONG
optimizer = Adam(lr=0.0000001)  # Glacial progress

# RIGHT
optimizer = Adam(lr=0.001)  # Or use learning rate finder
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Shuffling Data

Symptom: Model learns patterns in the order of data, not the actual patterns.

# WRONG
model.fit(X, y, shuffle=False)

# RIGHT
model.fit(X, y, shuffle=True)  # Default in most frameworks
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Forgetting to Zero Gradients (PyTorch)

# WRONG - Gradients accumulate!
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()

# RIGHT
for batch in dataloader:
    optimizer.zero_grad()  # Reset gradients!
    loss = model(batch)
    loss.backward()
    optimizer.step()
Enter fullscreen mode Exit fullscreen mode

The Complete Picture

Let me connect everything:

┌─────────────────────────────────────────────────────┐
│              THE LEARNING LOOP                      │
├─────────────────────────────────────────────────────┤
│                                                     │
│   1. FORWARD PASS                                   │
│      Input → Model → Prediction                     │
│                                                     │
│   2. LOSS CALCULATION                               │
│      Loss = f(Prediction, Actual)                   │
│      "How wrong are we?"                            │
│                                                     │
│   3. BACKWARD PASS (Backpropagation)                │
│      Calculate gradients for all parameters         │
│      "Which way is uphill for each weight?"         │
│                                                     │
│   4. PARAMETER UPDATE (Gradient Descent)            │
│      parameters = parameters - lr × gradient        │
│      "Take a step downhill"                         │
│                                                     │
│   5. REPEAT until loss is low enough                │
│                                                     │
└─────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Loss function tells you how wrong you are.
Gradient tells you which way is uphill.
Gradient descent takes you downhill.
Learning rate controls your step size.
Repeat until you reach the bottom.

That's machine learning. That's how ChatGPT learned to talk. That's how image classifiers learned to see.

A blindfolded hiker, feeling the slope, taking small steps downhill, eventually reaching the valley.


Key Takeaways

  1. Gradient descent = Repeatedly step opposite to the gradient (downhill)
  2. Gradient = The slope, tells you which way is uphill
  3. Learning rate = Step size (too small = slow, too big = chaos)
  4. Mini-batch SGD = Standard practice, uses small batches
  5. Momentum = Build up speed, avoid zigzagging
  6. Adam = Go-to optimizer, works almost everywhere
  7. Local minima = A risk, but less scary in high dimensions

The Analogy Summary

Concept Analogy
Loss landscape A foggy mountain valley
Gradient The slope you feel underfoot
Gradient descent Walking downhill step by step
Learning rate How big your steps are
Local minimum A small dip that traps you
Momentum A rolling ball that builds speed
Global minimum The true bottom of the valley

What's Next?

Now that you understand gradient descent, you're ready for:

  • Backpropagation — How gradients are calculated in neural networks
  • Learning Rate Schedules — Changing step size during training
  • Advanced Optimizers — Adam, AdamW, and beyond
  • Batch Normalization — Making the landscape easier to navigate

Follow me for the next article in this series!


Let's Connect!

If this made gradient descent click, drop a heart!

Questions? Ask in the comments — I respond to every one.

Still confused about something? Let me know. I'll try to explain it differently.


Every neural network that ever learned anything — from MNIST digit classifiers to GPT-4 — did so by feeling the slope and taking small steps downhill. That's gradient descent. Simple, powerful, everywhere.


Share this with someone who finds ML optimization intimidating. Sometimes all you need is a good hiking analogy.

Happy learning!

Top comments (0)