In the previous article, we explored activation functions and visualized them using Python.
Now, let’s see what gradients are.
Neural networks use activation functions to transform inputs inside them.
But if a neural network gives a wrong output, how does it know what to fix?
This is where gradients come in.
What is a gradient?
Imagine you are walking on a hill. If the ground is steep, you can feel which direction goes up or down.
If the ground is almost flat, it is hard to tell where to go.
A gradient can be simply thought of as a number that tells us how steep a curve is at a point.
How does this apply in the case of neural networks? Let’s see.
Why gradients matter in neural networks
In neural networks:
- Gradients tell us how much a parameter should change
- The bigger the gradient, the bigger the update
- If the gradient is 0, then learning stops
When training a neural network:
- We make a prediction
- We calculate how wrong it is
- We update weights to reduce the loss
- This update depends entirely on gradients
Gradients of activation functions
Each activation function has:
- A curve
- A gradient curve
Let’s check this in Python, starting first with ReLU.
Gradient of ReLU
We can define the ReLU gradient as:
def relu_grad(x):
return np.where(x > 0, 1, 0)
This means:
- Gradient = 0 when input ≤ 0
- Gradient = 1 when input > 0
Let’s plot it:
plt.figure()
plt.plot(x, relu_grad(x), label="ReLU Gradient")
plt.title("Gradient of ReLU")
plt.xlabel("Input")
plt.ylabel("Gradient")
plt.grid(True)
plt.legend()
plt.show()
This is our gradient. From the above, you can observe the following:
- The entire negative side has zero gradient
- The positive side has a constant gradient of 1
From this, we can further understand that:
- ReLU learns very fast when active
- ReLU neurons can die if they always receive negative inputs
Gradient of Softplus
def softplus_grad(x):
return 1 / (1 + np.exp(-x))
Let’s plot it:
plt.figure()
plt.plot(x, softplus_grad(x), label="Softplus Gradient")
plt.title("Gradient of Softplus")
plt.xlabel("Input")
plt.ylabel("Gradient")
plt.grid(True)
plt.legend()
plt.show()
You can observe that it is the same as the sigmoid activation function.
You can also observe that:
- There is a smooth transition
- Learning always continues
- This avoids dying neurons and adds stability, but it is slower than ReLU
Gradient of Sigmoid
The sigmoid gradient looks like this:
def sigmoid_grad(x):
s = sigmoid(x)
return s * (1 - s)
Let’s plot it:
plt.figure()
plt.plot(x, sigmoid_grad(x), label="Sigmoid Gradient")
plt.title("Gradient of Sigmoid")
plt.xlabel("Input")
plt.ylabel("Gradient")
plt.grid(True)
plt.legend()
plt.show()
From the above, you can observe the following:
- The gradient is very small at both extremes
- It is strong only around the middle
- It is almost zero for large positive or negative values
This leads to a famous problem called the vanishing gradient problem. We will explore this more in the next article.
You can try the examples out via the Colab notebook.
If you’ve ever struggled with repetitive tasks, obscure commands, or debugging headaches, this platform is here to make your life easier. It’s free, open-source, and built with developers in mind.
👉 Explore the tools: FreeDevTools
👉 Star the repo: freedevtools




Top comments (0)