Manas Patil

Posted on Jun 23 • Edited on Jun 29

Understanding Backpropagation and Gradient Descent in Neural Networks

#ai #beginners #learning #machinelearning

Suppose you're building a neural network, maybe even a deep and complex one. You’ve set up the layers, initialized weights, and defined activation functions. But here's a question:

Can this neural network make accurate predictions without tuning?
The short answer: No.

Building a neural network is just the start, the real magic lies in fine-tuning its parameters (i.e., weights and biases) so it actually learns from the data.

Why Do We Need to Optimize Weights and Biases?

To make better predictions, we want the network to minimize the loss and maximize accuracy.

Fine-tuning means updating weights and biases over multiple epochs using feedback from the output (loss) to improve the next prediction. This continues until we reach a point of minimal loss.

But how exactly do we update the weights and biases?

Early (Inefficient) Ideas for Updating Weights

1. Random Weights & Biases

Try random values, compute the loss, and repeat the process until the lowest loss is achieved.

❌ This is inefficient and slow.

2. Guided Random Tweaks

Set random weights → calculate loss → try new weights close to the previous ones if loss decreases → stop if it doesn’t.

✅ Better than the first, but still not optimal.

Now let’s go one level higher...

Enter: Backpropagation + Gradient Descent

Let’s say we want to go downhill on a loss curve to reach the lowest point (minimum loss). To do this efficiently, we need to know:

The direction to move in → determined by the slope (derivative)
How much to move → controlled by the learning rate

This is where calculus enters the picture.

Gradient Descent — The Update Rule

To minimize the loss L, we update the weights and biases using:

$w = w - \eta \cdot \frac{\partial L}{\partial w}$

$b = b - \eta \cdot \frac{\partial L}{\partial b}$

Where:

$\frac{\partial L}{\partial w}$ = derivative of the loss with respect to weight

$\frac{\partial L}{\partial b}$ = derivative of the loss with respect to bias

But how do we get these derivatives?

That’s where Backpropagation comes in.

Backpropagation: Going Back to Learn Better

Let’s take an example:

A neural network with 3 neurons in the hidden layer
A single output neuron in the final layer
Loss function: Mean Squared Error → $L = (y_{pred}-y_{true})^2$

To update the weights, we apply the chain rule from calculus to compute the gradients.

Gradients for Hidden Layer Weights

Neuron 1

$\frac{\partial L}{\partial w_{11}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{11}}$

$\frac{\partial L}{\partial w_{12}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{12}}$

$\frac{\partial L}{\partial w_{13}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{13}}$

$\frac{\partial L}{\partial w_{14}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{14}}$

Neuron 2

$\frac{\partial L}{\partial w_{21}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{21}}$

$\frac{\partial L}{\partial w_{22}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{22}}$

$\frac{\partial L}{\partial w_{23}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{23}}$

$\frac{\partial L}{\partial w_{24}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{24}}$

Neuron 3

$\frac{\partial L}{\partial w_{31}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{31}}$

$\frac{\partial L}{\partial w_{32}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{32}}$

$\frac{\partial L}{\partial w_{33}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{33}}$

$\frac{\partial L}{\partial w_{34}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{34}}$

Note: For example, $w_{21}$ flows through neuron 2, so we apply the chain rule using neuron 2’s activation and output.

Gradients for Biases

Bias of Neuron 1

$\frac{\partial L}{\partial b_1}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial b_1}$

Bias of Neuron 2

$\frac{\partial L}{\partial b_2}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial b_2}$

Bias of Neuron 3

$\frac{\partial L}{\partial b_3}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial b_3}$

🎯 We calculate these gradients by moving backward through the network — and that’s why the algorithm is called Backpropagation.

Applying Gradient Descent

Once we compute the gradients, we use the gradient descent update rule:

$w = w - \eta \cdot \frac{\partial L}{\partial w}$

$b = b - \eta \cdot \frac{\partial L}{\partial b}$

Where 0.05 is an example learning rate (you can adjust it).

After applying the update:

✅ Loss decreases
✅ Accuracy improves

Repeat this process across epochs to gradually optimize the model.

Before vs After Optimization

📍 Before Gradient Descent: You're sitting somewhere randomly on the loss curve.
📍 After One Update: You’ve moved closer to the local minimum.

This is the power of gradient descent — it helps your model learn how to learn.

Why This Matters

The heart of deep learning is optimization.
And the heart of optimization is:

Backpropagation + Gradient Descent

Without these, neural networks would just be complex calculators spitting out random values.

Thanks to backpropagation, networks can learn from their mistakes, and thanks to gradient descent, they can improve continuously.

DEV Community