Rijul Rajesh

Posted on Dec 26, 2025

Understanding Gradients: The Engine Behind Neural Network Learning

#machinelearning #ai

In the previous article, we explored activation functions and visualized them using Python.

Now, let’s see what gradients are.

Neural networks use activation functions to transform inputs inside them.

But if a neural network gives a wrong output, how does it know what to fix?

This is where gradients come in.

What is a gradient?

Imagine you are walking on a hill. If the ground is steep, you can feel which direction goes up or down.

If the ground is almost flat, it is hard to tell where to go.

A gradient can be simply thought of as a number that tells us how steep a curve is at a point.

How does this apply in the case of neural networks? Let’s see.

Why gradients matter in neural networks

In neural networks:

Gradients tell us how much a parameter should change
The bigger the gradient, the bigger the update
If the gradient is 0, then learning stops

When training a neural network:

We make a prediction
We calculate how wrong it is
We update weights to reduce the loss
This update depends entirely on gradients

Gradients of activation functions

Each activation function has:

A curve
A gradient curve

Let’s check this in Python, starting first with ReLU.

Gradient of ReLU

We can define the ReLU gradient as:

def relu_grad(x):
    return np.where(x > 0, 1, 0)

This means:

Gradient = 0 when input ≤ 0
Gradient = 1 when input > 0

Let’s plot it:

plt.figure()
plt.plot(x, relu_grad(x), label="ReLU Gradient")
plt.title("Gradient of ReLU")
plt.xlabel("Input")
plt.ylabel("Gradient")
plt.grid(True)
plt.legend()
plt.show()

This is our gradient. From the above, you can observe the following:

The entire negative side has zero gradient
The positive side has a constant gradient of 1

From this, we can further understand that:

ReLU learns very fast when active
ReLU neurons can die if they always receive negative inputs

Gradient of Softplus

def softplus_grad(x):
    return 1 / (1 + np.exp(-x))

Let’s plot it:

plt.figure()
plt.plot(x, softplus_grad(x), label="Softplus Gradient")
plt.title("Gradient of Softplus")
plt.xlabel("Input")
plt.ylabel("Gradient")
plt.grid(True)
plt.legend()
plt.show()

You can observe that it is the same as the sigmoid activation function.

You can also observe that:

There is a smooth transition
Learning always continues
This avoids dying neurons and adds stability, but it is slower than ReLU

Gradient of Sigmoid

The sigmoid gradient looks like this:

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)

Let’s plot it:

plt.figure()
plt.plot(x, sigmoid_grad(x), label="Sigmoid Gradient")
plt.title("Gradient of Sigmoid")
plt.xlabel("Input")
plt.ylabel("Gradient")
plt.grid(True)
plt.legend()
plt.show()

From the above, you can observe the following:

The gradient is very small at both extremes
It is strong only around the middle
It is almost zero for large positive or negative values

This leads to a famous problem called the vanishing gradient problem. We will explore this more in the next article.

You can try the examples out via the Colab notebook.

If you’ve ever struggled with repetitive tasks, obscure commands, or debugging headaches, this platform is here to make your life easier. It’s free, open-source, and built with developers in mind.