Gradient descent, explained by rolling downhill

#ai #optimization #machinelearning #python

Behind every trained model, the attention layers, the giant language models, the small on-device ones, is a single training algorithm: gradient descent. It sounds like calculus you'd rather forget, but the idea is a ball rolling downhill, and the code is about three lines. Build it once and "training a model" stops being mysterious.

The one idea: follow the slope downhill

Suppose you have a function that measures how wrong your model is, the loss. Training means finding the inputs that make the loss as small as possible: the bottom of the valley. You can't see the whole landscape, but at any point you can feel the slope under your feet. So you take a small step in the downhill direction, and repeat. Roll downhill, step by step, until you reach the bottom.

The slope is the gradient (the derivative): it points in the direction the function increases fastest. So to go down, you step in the opposite direction of the gradient. That's the whole algorithm.

Minimize a parabola

Take f(x) = (x - 3)². Its lowest point is obviously at x = 3. Its slope is f'(x) = 2(x - 3). Gradient descent finds the minimum without knowing the answer in advance:

def gradient_descent(grad, x, lr=0.1, steps=50):
    for _ in range(steps):
        x = x - lr * grad(x)     # step opposite the slope
    return x

print(gradient_descent(lambda x: 2 * (x - 3), x=0.0))   # -> ~3.0

Start at 0, and each step nudges x toward 3, the bottom of the bowl. That single line, x = x - lr * grad(x), is gradient descent. Everything else in machine learning is computing grad for more complicated functions.

Two details that matter:

The learning rate lr is your step size. Too small and it crawls; too large and it overshoots the bottom and can bounce out to infinity. Tuning it is most of the practical art.
You step against the gradient. The minus sign is the whole point. Flip it and you'd climb the hill, maximizing the loss, which is exactly backwards.

Fit a line to data

Now the real thing. Given points, find the line y = mx + b that best fits them. "Best" means smallest mean squared error. The loss is mean((mx + b − y)²), and its gradients with respect to m and b are standard:

def fit_line(xs, ys, lr=0.01, steps=2000):
    m, b, n = 0.0, 0.0, len(xs)
    for _ in range(steps):
        errs = [(m * x + b) - y for x, y in zip(xs, ys)]   # prediction minus truth
        dm = (2 / n) * sum(e * x for e, x in zip(errs, xs))  # slope wrt m
        db = (2 / n) * sum(errs)                              # slope wrt b
        m -= lr * dm
        b -= lr * db
    return m, b

xs = [1, 2, 3, 4, 5]
ys = [2, 4, 6, 8, 10]          # perfectly y = 2x
print(fit_line(xs, ys))        # -> (~2.0, ~0.0)

No formula for the answer, no library. We started with a flat line (m=0, b=0), measured how wrong it was, and rolled downhill in m and b until it matched. That is linear regression, trained, and it is the same loop that trains a neural network, just with millions of parameters instead of two.

From two parameters to a billion

A neural network is the exact same idea at scale:

The model has millions or billions of parameters instead of m and b.
The loss measures how wrong its predictions are.
Backpropagation is just an efficient way to compute the gradient of the loss with respect to every parameter at once (we build it from scratch in backpropagation explained by coding it).
Then the same update runs: param -= lr * grad, for every parameter, over and over.

That's it. "Training GPT" and "fitting a line" are the same algorithm; only the number of parameters and the way you compute the gradient differ.

Why this is worth building

Gradient descent is the one algorithm that connects a textbook parabola to the largest models on Earth. Once you've watched x = x - lr * grad(x) walk to the bottom of a bowl, the whole field reorganizes around it: a model is a function with parameters, a loss says how wrong it is, and training is rolling those parameters downhill. The learning rate, overfitting, why training "diverges", all of it makes sense from this picture.

If you want to build outward from here, backprop, real networks, the optimizers that improve on plain descent, that path is the AI track, where you train the thing from the gradient up instead of importing .fit().