In the previous article, we explored the gradients of different activation functions.
Now let's see a scenario where gradients become a problem, we call it the vanishing gradient problem.
What is the Vanishing Gradient Problem?
In neural networks, the gradient tells the network how much to change each weight to reduce the error.
However, if the gradient is too small, the network learns extremely slowly, or sometimes stops learning completely.
This is the vanishing gradient problem.
Why Does it Happen?
Vanishing gradients usually happen in deep networks, especially with activation functions like sigmoid.
- Sigmoid squashes inputs to a small range: 0 to 1.
When you compute gradients through many layers, the gradient for an early layer is calculated as:
Gradient of layer 1 = Gradient at output x product of gradients of all layers after it
Since each gradient is less than 1, multiplying many small numbers results in a very tiny gradient, almost zero for the early layers.
Visualizing Vanishing Gradients
import numpy as np
import matplotlib.pyplot as plt
# Sigmoid activation
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Sigmoid gradient
def sigmoid_grad(x):
s = sigmoid(x)
return s * (1 - s)
x = np.linspace(-10, 10, 1000)
grad = sigmoid_grad(x)
plt.figure(figsize=(8, 4))
plt.plot(x, grad, label="Sigmoid Gradient")
plt.title("Sigmoid Gradient (Vanishing at Extremes)")
plt.xlabel("Input")
plt.ylabel("Gradient")
plt.grid(True)
plt.legend()
plt.show()
Observation: Gradients are very small for large positive or negative inputs. This is what causes vanishing gradients in deep networks.
Simple Deep Network Simulation
Let's simulate a deep network where the gradient vanishes:
# Simulate deep network gradient
layers = 10
x = 5 # extreme input
grad = 1.0
for i in range(layers):
grad *= sigmoid_grad(x) # multiply by each layer's gradient
print(f"Layer {i+1}, Gradient: {grad:.8f}")
Output
Layer 1, Gradient: 0.00665
Layer 2, Gradient: 0.00004
Layer 3, Gradient: 0.00000
Layer 4, Gradient: 0.00000
Layer 5, Gradient: 0.00000
Layer 6, Gradient: 0.00000
Layer 7, Gradient: 0.00000
Layer 8, Gradient: 0.00000
Layer 9, Gradient: 0.00000
Layer 10, Gradient: 0.00000
Observation: The gradient quickly becomes extremely small. Early layers cannot learn effectively.
Mitigation with ReLU
ReLU does not squash positive values, so gradients don’t vanish easily.
Let's compare sigmoid and ReLU in a similar deep network:
# ReLU activation and gradient
def relu(x):
return np.maximum(0, x)
def relu_grad(x):
return np.where(x > 0, 1, 0)
layers = 10
x = 5 # extreme input
# Sigmoid deep gradient
grad_sigmoid = 1.0
grad_relu = 1.0
for i in range(layers):
grad_sigmoid *= sigmoid_grad(x)
grad_relu *= relu_grad(x)
print(f"Layer {i+1}, Sigmoid Grad: {grad_sigmoid:.8f}, ReLU Grad: {grad_relu}")
Output
Layer 1, Sigmoid Grad: 0.00664806, ReLU Grad: 1.0
Layer 2, Sigmoid Grad: 0.00004420, ReLU Grad: 1.0
Layer 3, Sigmoid Grad: 0.00000029, ReLU Grad: 1.0
Layer 4, Sigmoid Grad: 0.00000000, ReLU Grad: 1.0
Layer 5, Sigmoid Grad: 0.00000000, ReLU Grad: 1.0
Layer 6, Sigmoid Grad: 0.00000000, ReLU Grad: 1.0
Layer 7, Sigmoid Grad: 0.00000000, ReLU Grad: 1.0
Layer 8, Sigmoid Grad: 0.00000000, ReLU Grad: 1.0
Layer 9, Sigmoid Grad: 0.00000000, ReLU Grad: 1.0
Layer 10, Sigmoid Grad: 0.00000000, ReLU Grad: 1.0
Observation:
- Sigmoid gradient vanishes quickly.
- ReLU gradient remains 1 as long as inputs are positive, allowing faster and stable learning.
Wrapping up
Hope you now have a clear idea of the vanishing gradient problem. Along with activation functions and gradients, this concept forms one of the core building blocks for understanding neural networks. In the coming articles, we’ll build on these foundations and explore what comes next.
You can try the examples out via the Colab notebook.
If you’ve ever struggled with repetitive tasks, obscure commands, or debugging headaches, this platform is here to make your life easier. It’s free, open-source, and built with developers in mind.
👉 Explore the tools: FreeDevTools
👉 Star the repo: freedevtools


Top comments (0)