DEV Community

shangkyu shin
shangkyu shin

Posted on • Originally published at zeromathai.com

How Neural Networks Actually Learn: Backpropagation, Gradients, and Training Loop (Developer Guide)

Learn how neural networks train using forward propagation, loss functions, and backpropagation. This developer-focused guide explains gradients, chain rule, and autograd with practical intuition.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/training-signals-back-fundamentals-en/


The Real Mechanism

Neural networks don’t “learn” in a human sense.

They optimize.

Every step is:

forward → loss → backward → update


Training vs Inference

Training:

  • compute loss
  • run backward()
  • update weights

Inference:

  • forward only

No backward pass = no learning


Two Signals

  • Forward → prediction
  • Backward → gradients

Think:

  • forward = what happened
  • backward = how to fix it

Loss Function (Why Errors Matter)

Binary Cross-Entropy example:

ŷ = 0.8 → loss ≈ 0.223

ŷ = 0.1 → loss ≈ 2.302

Key idea:

Wrong predictions create stronger gradients


Gradients = Direction

gradient = ∂loss / ∂parameter

This tells us how to update weights.


Chain Rule (Core)

y = f(g(h(x)))

dL/dx = dL/dy · dy/dg · dg/dh · dh/dx


The Only Rule You Need

gradient = upstream × local derivative


Example: Multiplication

z = x * y

dL/dx = dL/dz * y

dL/dy = dL/dz * x


Example: Square

out = x²

grad = upstream * 2x


Why Autograd Exists

Manual chain rule doesn’t scale.

Frameworks:

  • store forward values
  • build computation graph
  • apply backward automatically

What Happens in Code

y_pred = model(x)

loss = criterion(y_pred, y)

loss.backward()

optimizer.step()

optimizer.zero_grad()


Important Implementation Details

Why zero_grad()?

Gradients accumulate by default.

Without reset:

grad_total = grad_step1 + grad_step2 + ...


Why backward() first?

Because gradients must exist before updating:

loss.backward() → gradients computed

optimizer.step() → parameters updated


Why reverse traversal?

Because gradients depend on outputs.

So computation flows:

output → input


Computational Graph Intuition

  • forward = build graph
  • backward = traverse graph in reverse

Intermediate results are reused.


Gradient Descent

θ = θ − η ∇L

η = learning rate


Final Takeaway

Neural networks learn by:

propagating error backward and updating parameters

That’s the entire system.


What helped you understand backprop the most—math, visualization, or code? Let’s discuss 👇

Top comments (0)