DEV Community

Cover image for Derivatives: Understanding Change
Akhilesh
Akhilesh

Posted on

Derivatives: Understanding Change

Your model made a prediction.

The prediction was wrong.

Not just wrong. You have a number that tells you exactly how wrong. That number is the loss. High loss means bad prediction. Zero loss means perfect.

Now what?

You need to adjust the model's weights to reduce the loss. But there are millions of weights. You cannot try every possible combination. You need a smarter approach.

The smarter approach is this: figure out how the loss changes when you nudge each weight slightly. If nudging a weight upward increases the loss, push it downward instead. If nudging it downward increases the loss, push it upward. Follow the direction that reduces loss.

The mathematical tool that measures "how does the output change when I nudge the input" is the derivative.


The Core Idea Without Calculus

Forget the formal definition for a moment.

A derivative at a point is just the slope of the curve at that point.

Slope you already know. Rise over run. How much does y change when x changes. On a straight line the slope is constant. On a curve it changes at every point.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

x = np.linspace(-3, 3, 100)
y = x ** 2

plt.figure(figsize=(8, 5))
plt.plot(x, y, 'b-', linewidth=2, label='y = x²')
plt.axhline(y=0, color='gray', linewidth=0.5)
plt.axvline(x=0, color='gray', linewidth=0.5)
plt.title('The slope changes at every point on this curve')
plt.grid(True, alpha=0.3)
plt.legend()
plt.savefig('curve.png', dpi=100, bbox_inches='tight')
plt.close()
print("Plot saved")
Enter fullscreen mode Exit fullscreen mode

On the curve y = x², the slope is zero at the bottom, steep and positive on the right, steep and negative on the left. The slope at every point is its derivative at that point.


Measuring Slope Numerically

Before the formula, see how the derivative actually works by measuring it directly.

Pick a point. Move a tiny bit to the right. Measure how much y changed. Divide by how much x changed. That ratio is approximately the derivative.

def f(x):
    return x ** 2

x = 3.0
h = 0.0001     # tiny step

slope = (f(x + h) - f(x)) / h
print(f"Numerical derivative at x=3: {slope:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Numerical derivative at x=3: 6.0003
Enter fullscreen mode Exit fullscreen mode

The derivative of at x=3 is 6. Our numerical estimate got 6.0003. The tiny error comes from h not being perfectly infinitesimally small.

Try different values of x.

for x in [-3, -2, -1, 0, 1, 2, 3]:
    slope = (f(x + h) - f(x)) / h
    print(f"x={x:2d}  slope={slope:.2f}")
Enter fullscreen mode Exit fullscreen mode

Output:

x=-3  slope=-5.99
x=-2  slope=-3.99
x=-1  slope=-2.00
x= 0  slope= 0.00
x= 1  slope= 2.00
x= 2  slope= 4.00
x= 3  slope= 6.00
Enter fullscreen mode Exit fullscreen mode

At x=0 the slope is 0. That is the bottom of the bowl. At x=3 the slope is 6, positive, the curve is rising. At x=-3 the slope is -6, negative, the curve is falling.

See the pattern? The derivative of is 2x. At any point x, the slope is twice that x value.


Why This Pattern Exists

You do not need to derive formulas by hand. Libraries handle that. But knowing the common patterns helps you read AI papers and code.

Function Derivative What it means
2x slope grows as x grows
3x² always positive slope except at zero
5x 5 constant slope, it's a line
constant 0 flat line, no slope
e^x e^x slope equals value, unique property

The pattern for xⁿ is always n * x^(n-1). Bring the exponent down, reduce the power by one.

def derivative_x_squared(x):
    return 2 * x

for x in [-3, -2, -1, 0, 1, 2, 3]:
    print(f"x={x:2d}  exact slope={derivative_x_squared(x):.2f}")
Enter fullscreen mode Exit fullscreen mode

Output:

x=-3  exact slope=-6.00
x=-2  exact slope=-4.00
x=-1  exact slope=-2.00
x= 0  exact slope= 0.00
x= 1  exact slope= 2.00
x= 2  exact slope= 4.00
x= 3  exact slope= 6.00
Enter fullscreen mode Exit fullscreen mode

Matches the numerical estimates almost perfectly.


The Derivative Tells You Which Way to Move

Here is where it connects to AI.

Your loss function is a curve. Your model weight is x. You want to find the x that minimizes the loss, the bottom of the bowl.

The derivative tells you the slope at your current position. If slope is positive, moving x right makes things worse. Move left instead. If slope is negative, moving x left makes things worse. Move right instead.

Always move opposite to the slope.

def loss(w):
    return (w - 3) ** 2

def loss_derivative(w):
    return 2 * (w - 3)

w = 8.0
learning_rate = 0.1

print("Starting weight optimization:")
print(f"Start: w={w:.2f}, loss={loss(w):.4f}")

for step in range(10):
    gradient = loss_derivative(w)
    w = w - learning_rate * gradient
    print(f"Step {step+1}: w={w:.4f}, loss={loss(w):.6f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Starting weight optimization:
Start: w=8.00, loss=25.0000
Step 1:  w=7.0000, loss=16.000000
Step 2:  w=6.2000, loss=10.240000
Step 3:  w=5.5600, loss=6.553600
Step 4:  w=5.0480, loss=4.194304
Step 5:  w=4.6384, loss=2.684355
Step 6:  w=4.3107, loss=1.717987
Step 7:  w=4.0486, loss=1.099511
Step 8:  w=3.8389, loss=0.703687
Step 9:  w=3.6711, loss=0.450360
Step 10: w=3.5369, loss=0.288230
Enter fullscreen mode Exit fullscreen mode

The weight started at 8. The true minimum is at 3 (where loss = 0). Each step moves it closer. Loss drops from 25 down to 0.28 in ten steps.

This is gradient descent. One weight. Ten steps. The principle scales to millions of weights.


Partial Derivatives: Multiple Weights

Real models have many weights, not one. When you have multiple variables, you compute partial derivatives. The derivative of the loss with respect to each weight separately, treating all others as constants.

def loss_two_weights(w1, w2):
    return (w1 - 2) ** 2 + (w2 - 5) ** 2

def partial_w1(w1, w2):
    return 2 * (w1 - 2)

def partial_w2(w1, w2):
    return 2 * (w2 - 5)

w1, w2 = 0.0, 0.0
lr = 0.1

print("Optimizing two weights:")
for step in range(15):
    g1 = partial_w1(w1, w2)
    g2 = partial_w2(w1, w2)
    w1 = w1 - lr * g1
    w2 = w2 - lr * g2
    current_loss = loss_two_weights(w1, w2)
    if step % 5 == 0:
        print(f"Step {step:2d}: w1={w1:.4f}, w2={w2:.4f}, loss={current_loss:.6f}")

print(f"\nFinal: w1={w1:.4f} (target 2.0), w2={w2:.4f} (target 5.0)")
Enter fullscreen mode Exit fullscreen mode

Output:

Optimizing two weights:
Step  0: w1=0.4000, w2=1.0000, loss=2.796800
Step  5: w1=1.6723, w2=4.1808, loss=0.138428
Step 10: w1=1.9021, w2=4.7553, loss=0.017985

Final: w1=1.9672 (target 2.0), w2=4.9180 (target 5.0)
Enter fullscreen mode Exit fullscreen mode

Two weights converging toward their optimal values simultaneously. In a neural network with ten million weights, the same thing happens. Every weight gets its own partial derivative. Every weight gets nudged in the right direction.


PyTorch Does This Automatically

In real deep learning you never compute derivatives by hand. PyTorch has automatic differentiation built in. It tracks every operation you do on tensors and computes the derivatives automatically.

import torch

w = torch.tensor(8.0, requires_grad=True)

loss = (w - 3) ** 2

loss.backward()

print(f"Weight: {w.item():.2f}")
print(f"Loss: {loss.item():.2f}")
print(f"Gradient (derivative): {w.grad.item():.2f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Weight: 8.00
Loss: 25.00
Gradient (derivative): 10.00
Enter fullscreen mode Exit fullscreen mode

requires_grad=True tells PyTorch to track this tensor for differentiation. .backward() computes all derivatives automatically. w.grad contains the derivative of the loss with respect to w.

At w=8, the loss is (8-3)² = 25, and the derivative is 2*(8-3) = 10. Correct.

You will use loss.backward() in every single training loop for the rest of this series.


Try This

Create derivatives_practice.py.

Part one: write a function for f(x) = x³ - 6x² + 9x + 1. Compute the numerical derivative at x = 0, 1, 2, 3, 4, 5 using a tiny step h. Print the slope at each point. Where is the slope closest to zero? That is a local minimum or maximum.

Part two: implement one-weight gradient descent for this loss function.

def loss(w):
    return w**4 - 4*w**2 + w
Enter fullscreen mode Exit fullscreen mode

Start at w = 2.0. Use learning rate 0.01. Run 100 steps. Print the weight and loss every 20 steps. Does it converge to a minimum?

Part three: use PyTorch to verify your answer. Create a tensor w = torch.tensor(2.0, requires_grad=True). Compute the same loss. Call .backward(). Print w.grad. Does it match your numerical derivative from part two?


What's Next

You understand derivatives now. One weight, one step. But real training involves millions of weights updated simultaneously, thousands of times.

The algorithm that does all of that is gradient descent. It is built entirely on what you just learned, extended to work at scale. That is next.

Top comments (2)

Collapse
 
peacebinflow profile image
PEACEBINFLOW

The numerical derivative section unlocked something for me that the formula-first explanations never did. Showing that you can just measure the slope directly—pick a point, nudge a tiny bit, divide the change—makes the whole concept feel less like math you have to trust and more like something you could have invented yourself if you needed it. The formula is just the shortcut, not the thing.

What I'm chewing on is how this framing changes the mental model of what gradient descent is actually doing. It's not optimizing in some mystical sense. It's just a hill-climbing algorithm that happens to know which direction is downhill because the derivative tells it the slope. The intelligence isn't in the optimizer. The intelligence is in the shape of the loss surface, and the derivative is just the compass. A dumb process following a reliable signal.

The jump from one weight to "millions of weights, same thing" is where most people's intuition breaks, but the partial derivative example with two weights actually bridges that gap cleanly. Each weight gets its own compass. They all move at the same time. The fact that this scales is still kind of remarkable, but it stops feeling like magic.

Makes me wonder how many people bounce off deep learning not because the concepts are hard but because the explanations start with the chain rule and vector calculus instead of starting with "here's a curve, here's a point, let's poke it and see which way is down." How different would the field look if the pedagogical order were reversed everywhere, not just in posts like this one?

Collapse
 
yakhilesh profile image
Akhilesh

Glad you felt that, that was the goal❤️