zeromathai

Posted on Apr 11 • Edited on May 7 • Originally published at zeromathai.com

How Neural Networks Actually Learn: Backpropagation, Gradients, and Training Loop (Developer Guide)

#ai #programming #machinelearning #deeplearning

Learn how neural networks train using forward propagation, loss functions, and backpropagation. This developer-focused guide explains gradients, chain rule, and autograd with practical intuition.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/training-signals-back-fundamentals-en/

The Real Mechanism

Neural networks don’t “learn” in a human sense.

They optimize.

Every step is:

forward → loss → backward → update

Training vs Inference

Training:

compute loss
run backward()
update weights

Inference:

forward only

No backward pass = no learning

Two Signals

Forward → prediction
Backward → gradients

Think:

forward = what happened
backward = how to fix it

Loss Function (Why Errors Matter)

Binary Cross-Entropy example:

ŷ = 0.8 → loss ≈ 0.223

ŷ = 0.1 → loss ≈ 2.302

Key idea:

Wrong predictions create stronger gradients

Gradients = Direction

gradient = ∂loss / ∂parameter

This tells us how to update weights.

Chain Rule (Core)

y = f(g(h(x)))

dL/dx = dL/dy · dy/dg · dg/dh · dh/dx

The Only Rule You Need

gradient = upstream × local derivative

Example: Multiplication

z = x * y

dL/dx = dL/dz * y

dL/dy = dL/dz * x

Example: Square

out = x²

grad = upstream * 2x

Why Autograd Exists

Manual chain rule doesn’t scale.

Frameworks:

store forward values
build computation graph
apply backward automatically

What Happens in Code

y_pred = model(x)

loss = criterion(y_pred, y)

loss.backward()

optimizer.step()

optimizer.zero_grad()

Important Implementation Details

Why zero_grad()?

Gradients accumulate by default.

Without reset:

grad_total = grad_step1 + grad_step2 + ...

Why backward() first?

Because gradients must exist before updating:

loss.backward() → gradients computed

optimizer.step() → parameters updated

Why reverse traversal?

Because gradients depend on outputs.

So computation flows:

output → input

Computational Graph Intuition

forward = build graph
backward = traverse graph in reverse

Intermediate results are reused.

Gradient Descent

θ = θ − η ∇L

η = learning rate

Final Takeaway

Neural networks learn by:

propagating error backward and updating parameters

That’s the entire system.

What helped you understand backprop the most—math, visualization, or code? Let’s discuss 👇

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community