Your model made a prediction.
The prediction was wrong.
Not just wrong. You have a number that tells you exactly how wrong. That number is the loss. High loss means bad prediction. Zero loss means perfect.
Now what?
You need to adjust the model's weights to reduce the loss. But there are millions of weights. You cannot try every possible combination. You need a smarter approach.
The smarter approach is this: figure out how the loss changes when you nudge each weight slightly. If nudging a weight upward increases the loss, push it downward instead. If nudging it downward increases the loss, push it upward. Follow the direction that reduces loss.
The mathematical tool that measures "how does the output change when I nudge the input" is the derivative.
The Core Idea Without Calculus
Forget the formal definition for a moment.
A derivative at a point is just the slope of the curve at that point.
Slope you already know. Rise over run. How much does y change when x changes. On a straight line the slope is constant. On a curve it changes at every point.
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x = np.linspace(-3, 3, 100)
y = x ** 2
plt.figure(figsize=(8, 5))
plt.plot(x, y, 'b-', linewidth=2, label='y = x²')
plt.axhline(y=0, color='gray', linewidth=0.5)
plt.axvline(x=0, color='gray', linewidth=0.5)
plt.title('The slope changes at every point on this curve')
plt.grid(True, alpha=0.3)
plt.legend()
plt.savefig('curve.png', dpi=100, bbox_inches='tight')
plt.close()
print("Plot saved")
On the curve y = x², the slope is zero at the bottom, steep and positive on the right, steep and negative on the left. The slope at every point is its derivative at that point.
Measuring Slope Numerically
Before the formula, see how the derivative actually works by measuring it directly.
Pick a point. Move a tiny bit to the right. Measure how much y changed. Divide by how much x changed. That ratio is approximately the derivative.
def f(x):
return x ** 2
x = 3.0
h = 0.0001 # tiny step
slope = (f(x + h) - f(x)) / h
print(f"Numerical derivative at x=3: {slope:.4f}")
Output:
Numerical derivative at x=3: 6.0003
The derivative of x² at x=3 is 6. Our numerical estimate got 6.0003. The tiny error comes from h not being perfectly infinitesimally small.
Try different values of x.
for x in [-3, -2, -1, 0, 1, 2, 3]:
slope = (f(x + h) - f(x)) / h
print(f"x={x:2d} slope={slope:.2f}")
Output:
x=-3 slope=-5.99
x=-2 slope=-3.99
x=-1 slope=-2.00
x= 0 slope= 0.00
x= 1 slope= 2.00
x= 2 slope= 4.00
x= 3 slope= 6.00
At x=0 the slope is 0. That is the bottom of the bowl. At x=3 the slope is 6, positive, the curve is rising. At x=-3 the slope is -6, negative, the curve is falling.
See the pattern? The derivative of x² is 2x. At any point x, the slope is twice that x value.
Why This Pattern Exists
You do not need to derive formulas by hand. Libraries handle that. But knowing the common patterns helps you read AI papers and code.
| Function | Derivative | What it means |
|---|---|---|
x² |
2x |
slope grows as x grows |
x³ |
3x² |
always positive slope except at zero |
5x |
5 |
constant slope, it's a line |
constant |
0 |
flat line, no slope |
e^x |
e^x |
slope equals value, unique property |
The pattern for xⁿ is always n * x^(n-1). Bring the exponent down, reduce the power by one.
def derivative_x_squared(x):
return 2 * x
for x in [-3, -2, -1, 0, 1, 2, 3]:
print(f"x={x:2d} exact slope={derivative_x_squared(x):.2f}")
Output:
x=-3 exact slope=-6.00
x=-2 exact slope=-4.00
x=-1 exact slope=-2.00
x= 0 exact slope= 0.00
x= 1 exact slope= 2.00
x= 2 exact slope= 4.00
x= 3 exact slope= 6.00
Matches the numerical estimates almost perfectly.
The Derivative Tells You Which Way to Move
Here is where it connects to AI.
Your loss function is a curve. Your model weight is x. You want to find the x that minimizes the loss, the bottom of the bowl.
The derivative tells you the slope at your current position. If slope is positive, moving x right makes things worse. Move left instead. If slope is negative, moving x left makes things worse. Move right instead.
Always move opposite to the slope.
def loss(w):
return (w - 3) ** 2
def loss_derivative(w):
return 2 * (w - 3)
w = 8.0
learning_rate = 0.1
print("Starting weight optimization:")
print(f"Start: w={w:.2f}, loss={loss(w):.4f}")
for step in range(10):
gradient = loss_derivative(w)
w = w - learning_rate * gradient
print(f"Step {step+1}: w={w:.4f}, loss={loss(w):.6f}")
Output:
Starting weight optimization:
Start: w=8.00, loss=25.0000
Step 1: w=7.0000, loss=16.000000
Step 2: w=6.2000, loss=10.240000
Step 3: w=5.5600, loss=6.553600
Step 4: w=5.0480, loss=4.194304
Step 5: w=4.6384, loss=2.684355
Step 6: w=4.3107, loss=1.717987
Step 7: w=4.0486, loss=1.099511
Step 8: w=3.8389, loss=0.703687
Step 9: w=3.6711, loss=0.450360
Step 10: w=3.5369, loss=0.288230
The weight started at 8. The true minimum is at 3 (where loss = 0). Each step moves it closer. Loss drops from 25 down to 0.28 in ten steps.
This is gradient descent. One weight. Ten steps. The principle scales to millions of weights.
Partial Derivatives: Multiple Weights
Real models have many weights, not one. When you have multiple variables, you compute partial derivatives. The derivative of the loss with respect to each weight separately, treating all others as constants.
def loss_two_weights(w1, w2):
return (w1 - 2) ** 2 + (w2 - 5) ** 2
def partial_w1(w1, w2):
return 2 * (w1 - 2)
def partial_w2(w1, w2):
return 2 * (w2 - 5)
w1, w2 = 0.0, 0.0
lr = 0.1
print("Optimizing two weights:")
for step in range(15):
g1 = partial_w1(w1, w2)
g2 = partial_w2(w1, w2)
w1 = w1 - lr * g1
w2 = w2 - lr * g2
current_loss = loss_two_weights(w1, w2)
if step % 5 == 0:
print(f"Step {step:2d}: w1={w1:.4f}, w2={w2:.4f}, loss={current_loss:.6f}")
print(f"\nFinal: w1={w1:.4f} (target 2.0), w2={w2:.4f} (target 5.0)")
Output:
Optimizing two weights:
Step 0: w1=0.4000, w2=1.0000, loss=2.796800
Step 5: w1=1.6723, w2=4.1808, loss=0.138428
Step 10: w1=1.9021, w2=4.7553, loss=0.017985
Final: w1=1.9672 (target 2.0), w2=4.9180 (target 5.0)
Two weights converging toward their optimal values simultaneously. In a neural network with ten million weights, the same thing happens. Every weight gets its own partial derivative. Every weight gets nudged in the right direction.
PyTorch Does This Automatically
In real deep learning you never compute derivatives by hand. PyTorch has automatic differentiation built in. It tracks every operation you do on tensors and computes the derivatives automatically.
import torch
w = torch.tensor(8.0, requires_grad=True)
loss = (w - 3) ** 2
loss.backward()
print(f"Weight: {w.item():.2f}")
print(f"Loss: {loss.item():.2f}")
print(f"Gradient (derivative): {w.grad.item():.2f}")
Output:
Weight: 8.00
Loss: 25.00
Gradient (derivative): 10.00
requires_grad=True tells PyTorch to track this tensor for differentiation. .backward() computes all derivatives automatically. w.grad contains the derivative of the loss with respect to w.
At w=8, the loss is (8-3)² = 25, and the derivative is 2*(8-3) = 10. Correct.
You will use loss.backward() in every single training loop for the rest of this series.
Try This
Create derivatives_practice.py.
Part one: write a function for f(x) = x³ - 6x² + 9x + 1. Compute the numerical derivative at x = 0, 1, 2, 3, 4, 5 using a tiny step h. Print the slope at each point. Where is the slope closest to zero? That is a local minimum or maximum.
Part two: implement one-weight gradient descent for this loss function.
def loss(w):
return w**4 - 4*w**2 + w
Start at w = 2.0. Use learning rate 0.01. Run 100 steps. Print the weight and loss every 20 steps. Does it converge to a minimum?
Part three: use PyTorch to verify your answer. Create a tensor w = torch.tensor(2.0, requires_grad=True). Compute the same loss. Call .backward(). Print w.grad. Does it match your numerical derivative from part two?
What's Next
You understand derivatives now. One weight, one step. But real training involves millions of weights updated simultaneously, thousands of times.
The algorithm that does all of that is gradient descent. It is built entirely on what you just learned, extended to work at scale. That is next.
Top comments (2)
The numerical derivative section unlocked something for me that the formula-first explanations never did. Showing that you can just measure the slope directly—pick a point, nudge a tiny bit, divide the change—makes the whole concept feel less like math you have to trust and more like something you could have invented yourself if you needed it. The formula is just the shortcut, not the thing.
What I'm chewing on is how this framing changes the mental model of what gradient descent is actually doing. It's not optimizing in some mystical sense. It's just a hill-climbing algorithm that happens to know which direction is downhill because the derivative tells it the slope. The intelligence isn't in the optimizer. The intelligence is in the shape of the loss surface, and the derivative is just the compass. A dumb process following a reliable signal.
The jump from one weight to "millions of weights, same thing" is where most people's intuition breaks, but the partial derivative example with two weights actually bridges that gap cleanly. Each weight gets its own compass. They all move at the same time. The fact that this scales is still kind of remarkable, but it stops feeling like magic.
Makes me wonder how many people bounce off deep learning not because the concepts are hard but because the explanations start with the chain rule and vector calculus instead of starting with "here's a curve, here's a point, let's poke it and see which way is down." How different would the field look if the pedagogical order were reversed everywhere, not just in posts like this one?
Glad you felt that, that was the goal❤️