Learn how neural networks train using forward propagation, loss functions, and backpropagation. This developer-focused guide explains gradients, chain rule, and autograd with practical intuition.
Cross-posted from Zeromath. Original article: https://zeromathai.com/en/training-signals-back-fundamentals-en/
The Real Mechanism
Neural networks don’t “learn” in a human sense.
They optimize.
Every step is:
forward → loss → backward → update
Training vs Inference
Training:
- compute loss
- run backward()
- update weights
Inference:
- forward only
No backward pass = no learning
Two Signals
- Forward → prediction
- Backward → gradients
Think:
- forward = what happened
- backward = how to fix it
Loss Function (Why Errors Matter)
Binary Cross-Entropy example:
ŷ = 0.8 → loss ≈ 0.223
ŷ = 0.1 → loss ≈ 2.302
Key idea:
Wrong predictions create stronger gradients
Gradients = Direction
gradient = ∂loss / ∂parameter
This tells us how to update weights.
Chain Rule (Core)
y = f(g(h(x)))
dL/dx = dL/dy · dy/dg · dg/dh · dh/dx
The Only Rule You Need
gradient = upstream × local derivative
Example: Multiplication
z = x * y
dL/dx = dL/dz * y
dL/dy = dL/dz * x
Example: Square
out = x²
grad = upstream * 2x
Why Autograd Exists
Manual chain rule doesn’t scale.
Frameworks:
- store forward values
- build computation graph
- apply backward automatically
What Happens in Code
y_pred = model(x)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Important Implementation Details
Why zero_grad()?
Gradients accumulate by default.
Without reset:
grad_total = grad_step1 + grad_step2 + ...
Why backward() first?
Because gradients must exist before updating:
loss.backward() → gradients computed
optimizer.step() → parameters updated
Why reverse traversal?
Because gradients depend on outputs.
So computation flows:
output → input
Computational Graph Intuition
- forward = build graph
- backward = traverse graph in reverse
Intermediate results are reused.
Gradient Descent
θ = θ − η ∇L
η = learning rate
Final Takeaway
Neural networks learn by:
propagating error backward and updating parameters
That’s the entire system.
What helped you understand backprop the most—math, visualization, or code? Let’s discuss 👇
Top comments (0)