DEV Community

Vineet Chauhan
Vineet Chauhan

Posted on

Gradient Descent: The Engine That Made Deep Learning Possible : How one simple idea changed the way machines learn

When I first started learning deep learning, I thought the magic was inside the model architecture.

CNNs looked powerful.

RNNs looked intelligent.

Transformers looked almost impossible to understand.

But slowly I realized something important.

Architecture is only one part of deep learning.

The real question is:

How does the model actually learn?

A neural network may contain millions or even billions of parameters.

But at the beginning, all those parameters are almost useless.

They are usually random numbers.

The model does not know anything.

It does not understand images.

It does not understand language.

It does not understand patterns.

So the real magic is not that a neural network has many parameters.

The real magic is that it can adjust those parameters automatically.

That automatic adjustment is made possible by Gradient Descent.


The Basic Problem

Suppose we are training a simple model.

y_pred = w * x + b
Enter fullscreen mode Exit fullscreen mode

Here:

w = weight
b = bias
Enter fullscreen mode Exit fullscreen mode

At first, the model makes wrong predictions.

So we calculate error.

loss = (y_actual - y_pred) ** 2
Enter fullscreen mode Exit fullscreen mode

The goal is simple:

Reduce the loss
Enter fullscreen mode Exit fullscreen mode

But the question is:

How should w and b change?
Enter fullscreen mode Exit fullscreen mode

Should weight increase?

Should weight decrease?

By how much?

This is where Gradient Descent comes in.


What Gradient Descent Really Means

Gradient Descent simply means:

Move the parameters in the direction where loss decreases.

Imagine standing on a mountain in fog.

You cannot see the full path.

You only know the slope under your feet.

So you move downward step by step.

That is Gradient Descent.

In machine learning:

Mountain height = Loss
Position = Parameters
Downward direction = Negative gradient
Enter fullscreen mode Exit fullscreen mode

The formula is:

new_weight = old_weight - learning_rate × gradient
Enter fullscreen mode Exit fullscreen mode

or:

w = w - lr * dw
b = b - lr * db
Enter fullscreen mode Exit fullscreen mode

This small formula is one of the biggest reasons deep learning works.


A Very Small Numerical Example

Suppose our model is:

y_pred = w * x
Enter fullscreen mode Exit fullscreen mode

Let:

x = 2
y_actual = 10
w = 1
Enter fullscreen mode Exit fullscreen mode

Prediction:

y_pred = 1 × 2 = 2
Enter fullscreen mode Exit fullscreen mode

Loss:

loss = (10 - 2)² = 64
Enter fullscreen mode Exit fullscreen mode

The prediction is too small.

So weight should increase.

Gradient Descent tells us exactly how much to update.

For squared error:

loss = (y - wx)²
Enter fullscreen mode Exit fullscreen mode

Gradient with respect to weight:

dL/dw = -2x(y - wx)
Enter fullscreen mode Exit fullscreen mode

Now substitute values:

dL/dw = -2 × 2 × (10 - 2)
      = -4 × 8
      = -32
Enter fullscreen mode Exit fullscreen mode

Let learning rate be:

lr = 0.1
Enter fullscreen mode Exit fullscreen mode

Update:

new_w = old_w - lr × gradient
new_w = 1 - 0.1 × (-32)
new_w = 1 + 3.2
new_w = 4.2
Enter fullscreen mode Exit fullscreen mode

Now prediction becomes:

y_pred = 4.2 × 2 = 8.4
Enter fullscreen mode Exit fullscreen mode

Earlier prediction was 2.

Now it is 8.4.

Much closer to 10.

That is learning.

Not memorization.

Not magic.

Just repeated improvement.


Code: Gradient Descent From Scratch

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

w = 0
b = 0

lr = 0.01
epochs = 1000

losses = []

for epoch in range(epochs):
    y_pred = w * x + b

    loss = np.mean((y - y_pred) ** 2)
    losses.append(loss)

    dw = (-2 / len(x)) * np.sum(x * (y - y_pred))
    db = (-2 / len(x)) * np.sum(y - y_pred)

    w = w - lr * dw
    b = b - lr * db

print("Final weight:", w)
print("Final bias:", b)
Enter fullscreen mode Exit fullscreen mode

Expected output:

Final weight: close to 2
Final bias: close to 0
Enter fullscreen mode Exit fullscreen mode

The model discovers:

y = 2x
Enter fullscreen mode Exit fullscreen mode

by updating weights again and again.


Plotting The Loss Curve

plt.plot(losses)
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Loss Decreasing During Gradient Descent")
plt.show()
Enter fullscreen mode Exit fullscreen mode

This graph is very important.

If loss goes down, the model is learning.

If loss goes up, learning rate may be too high.

If loss is flat, learning rate may be too low or model may not be powerful enough.


What Happens If Learning Rate Is Too High?

Suppose learning rate is very large.

Then the model takes huge jumps.

Instead of reaching the minimum, it may jump over it again and again.

Good learning rate:

Loss → ↓ ↓ ↓ ↓ ↓

Too high learning rate:

Loss → ↑ ↓ ↑ ↓ ↑
Enter fullscreen mode Exit fullscreen mode

Code experiment:

lr = 1.0
Enter fullscreen mode Exit fullscreen mode

You may see loss explode.

This is called divergence.


What Happens If Learning Rate Is Too Low?

If learning rate is too small:

lr = 0.000001
Enter fullscreen mode Exit fullscreen mode

The model learns extremely slowly.

Loss decreases, but almost nothing happens for many epochs.

This is why learning rate is one of the most important hyperparameters in deep learning.


Why Gradient Descent Changed Deep Learning

Before deep learning became powerful, one big problem was:

How do we train huge models?
Enter fullscreen mode Exit fullscreen mode

A deep neural network may have:

Millions of weights
Millions of biases
Multiple layers
Complex activations
Huge datasets
Enter fullscreen mode Exit fullscreen mode

Manually choosing weights is impossible.

Trying all combinations is impossible.

Gradient Descent made training possible because it gave a systematic way to improve every parameter.

Even if a model has 10 million parameters, the idea remains:

Find gradient
Move opposite to gradient
Reduce loss
Repeat
Enter fullscreen mode Exit fullscreen mode

That is why Gradient Descent became the engine of deep learning.


What Happens If There Is No Gradient Descent?

Without Gradient Descent, deep learning would almost collapse.

We would have neural networks, but we would not know how to train them efficiently.

Without Gradient Descent:

No automatic weight improvement
No large-scale neural network training
No modern computer vision
No powerful language models
No practical deep learning revolution
Enter fullscreen mode Exit fullscreen mode

We could still use some alternatives like:

  • Random search
  • Genetic algorithms
  • Manual tuning
  • Closed-form solutions for very small models

But they do not scale like Gradient Descent.

Imagine a neural network with 100 million parameters.

Randomly trying weights would be hopeless.

Gradient Descent gives direction.

That direction changed everything.


Gradient Descent In Neural Networks

In a neural network, every layer has weights.

Input → Hidden Layer → Output
Enter fullscreen mode Exit fullscreen mode

Each layer makes a small transformation.

The final prediction produces loss.

Then backpropagation calculates:

How much each weight contributed to the error
Enter fullscreen mode Exit fullscreen mode

Gradient Descent then updates all weights.

optimizer.zero_grad()
loss.backward()
optimizer.step()
Enter fullscreen mode Exit fullscreen mode

This PyTorch code looks small.

But conceptually:

loss.backward()
Enter fullscreen mode Exit fullscreen mode

calculates gradients.

optimizer.step()
Enter fullscreen mode Exit fullscreen mode

applies Gradient Descent.

That is the heart of deep learning training.


PyTorch Example

import torch
import torch.nn as nn

X = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]])

model = nn.Linear(1, 1)

loss_fn = nn.MSELoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(1000):
    y_pred = model(X)

    loss = loss_fn(y_pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(list(model.parameters()))
Enter fullscreen mode Exit fullscreen mode

This model also learns:

y = 2x
Enter fullscreen mode Exit fullscreen mode

The difference is that PyTorch calculates gradients automatically.

That automatic gradient calculation is called autograd.


Gradient Descent vs Backpropagation

Many beginners confuse these two.

They are related but not the same.

Backpropagation answers:

What are the gradients?
Enter fullscreen mode Exit fullscreen mode

Gradient Descent answers:

How should we update the weights using those gradients?
Enter fullscreen mode Exit fullscreen mode

So:

Backpropagation = gradient calculation

Gradient Descent = parameter update
Enter fullscreen mode Exit fullscreen mode

Together, they train deep neural networks.


Why Deep Learning Needed Gradient Descent More Than Traditional ML

Traditional ML models often have fewer parameters.

Some algorithms do not rely heavily on gradients.

For example:

Decision Trees split data using rules.

KNN stores examples.

Naive Bayes uses probability formulas.

But deep learning is different.

Deep learning is mostly parameter learning.

Millions of parameters must be adjusted.

That is why Gradient Descent became more important in deep learning than almost anywhere else.


Final Thought

Gradient Descent changed deep learning because it converted learning into optimization.

Instead of manually programming intelligence, we define:

Model
Loss Function
Optimizer
Data
Enter fullscreen mode Exit fullscreen mode

Then the model improves itself step by step.

That is the real breakthrough.

Deep learning is not just about big neural networks.

It is about trainable neural networks.

And Gradient Descent is what makes them trainable.

Without it, deep learning would be like a powerful engine with no steering.

With it, random weights slowly become useful knowledge.

Top comments (0)