DEV Community

Cover image for How Machines Learn: Understanding the Core Concepts of Neural Networks
Boopathi
Boopathi

Posted on • Originally published at programmerraja.is-a.dev

How Machines Learn: Understanding the Core Concepts of Neural Networks

Imagine trying to teach a child who’s never seen the world to recognize a face, feel that fire is hot, or sense when it might rain. How would you do it?

For centuries, we thought intelligence required something mystical a soul, consciousness, divine spark. But what if it’s just pattern recognition at an extraordinary scale? What if learning is simply tuning millions of tiny parameters until inputs map correctly to outputs?

That’s the bold idea behind deep learning: mathematical systems that can learn any pattern, approximate any function, and tackle problems once thought uniquely human.

In 1989, mathematicians proved the Universal Approximation Theorem showing that even a single hidden layer neural network can approximate any continuous function. In theory, such a network can learn to translate, recognize, play, or predict anything.

But theory alone isn’t enough. The theorem says such a network exists not how to build or train it. That’s where the real craft of deep learning begins: finding the right weights, training efficiently, and learning patterns that generalize.

Let’s unpack the six core ideas that make this possible.

Note: This is a deep dive not a skim. Grab a coffee, settle in, and take your time. By the end, you’ll understand neural networks from the ground up, not just in words but in logic.

1. Neural Networks: Universal Function Approximators

What Are We Trying to Do?

Before we understand neural networks, let's start with something simpler: what is a function?

In mathematics, a function is a relationship that maps inputs to outputs. f(x) = 2x + 1 is a function. You give it x = 3, it returns 7. Simple, deterministic, predictable.

But real-world problems involve functions we can't write down. Consider:

  • f(image) = "cat" or "dog"
  • f(email_text) = "spam" or "not spam"
  • f(patient_symptoms) = disease_probability

These are still functions they map inputs to outputs but we don't know their mathematical form. Traditional programming can't help us here because we can't write explicit rules for every possible image or email.

Building Blocks: The Artificial Neuron

alt text

Let's build from the ground up. Start with a single neuron the atomic unit of a neural network.

A neuron does three things:

  1. Receives multiple inputs (x₁, x₂, x₃, ...)
  2. Multiplies each input by a weight (w₁, w₂, w₃, ...)
  3. Sums everything up and adds a bias: z = w₁x₁ + w₂x₂ + w₃x₃ + ... + b

Why this structure? Because it's the simplest way to combine multiple pieces of information into a single decision.

Geometry of a Neuron: Drawing a Line

Let’s ground this in a real example.

Problem: You're a bank deciding whether to approve loans. You have two pieces of information:

  • x₁ = Annual income (in thousands)
  • x₂ = Credit score

Goal: Separate "approve" from "reject" applications.

A Single Neuron Creates a Line (2D) or Hyperplane (Higher Dimensions)

The equation z = w₁x₁ + w₂x₂ + b is actually the equation of a line! Let's see how:

Example neuron with specific weights:

z = 0.5·income + 2·credit_score - 150
Enter fullscreen mode Exit fullscreen mode

This neuron outputs positive values for "approve" and negative for "reject". The decision boundary is where z = 0:

0 = 0.5·income + 2·credit_score - 150
credit_score = 75 - 0.25·income
Enter fullscreen mode Exit fullscreen mode

This is a line! Let's plot it:

alt text
What the weights mean geometrically:

  • w₁ = 0.5: For every $1000 increase in income, the decision shifts by 0.5 units toward approval
  • w₂ = 2.0: For every 1-point increase in credit score, the decision shifts by 2 units toward approval (4× more important than income!)
  • b = -150: The bias shifts the entire line. Without it, the line would pass through origin (0,0)

The learning process is finding the right line:

  • Start with a random line (random weights)
  • See which points it classifies wrong
  • Adjust the weights to rotate and shift the line
  • Repeat until the line best separates the two groups

What One Neuron Can and Cannot Do

alt text

Cannot separate (non-linearly separable):

XOR is the classic example: you need (0,1) and (1,0) to be class 1, but (0,0) and (1,1) to be class 0. No single line can achieve this separation.

This is why we need multiple layers.

Multiple Neurons, Multiple Lines: Building Complex Boundaries

alt text
If one neuron creates one line, what happens with multiple neurons in one layer?

Example: 3 neurons in one layer


Neuron 1: z₁ = w₁₁x₁ + w₁₂x₂ + b₁ [Line 1]
Neuron 2: z₂ = w₂₁x₁ + w₂₂x₂ + b₂ [Line 2]
Neuron 3: z₃ = w₃₁x₁ + w₃₂x₂ + b₃ [Line 3]
Enter fullscreen mode Exit fullscreen mode

Each neuron draws a different line! But without additional layers, we still can't solve XOR. Why? Because we're just drawing multiple lines without combining them in complex ways.

The key insight: We need to combine these lines non-linearly. This is where activation functions and depth come in.

The Layer Abstraction

Now stack multiple neurons side by side that's a layer. Each neuron in the layer:

  • Receives the same inputs
  • Has its own unique weights and bias
  • Produces its own output

A layer with 10 neurons transforms one input vector into 10 different outputs, each representing a different "feature" or "pattern" it has detected.

Solving XOR: A Complete Example

Let's solve XOR step-by-step to understand how layers work together.

The XOR Problem:

Input (x₁, x₂) → Output
(0, 0) → 0
(0, 1) → 1
(1, 0) → 1
(1, 1) → 0
Enter fullscreen mode Exit fullscreen mode

Two-Layer Solution:

Layer 1: Create useful features (2 neurons with ReLU)

Neuron 1: Detects "at least one input is 1"

z₁ = x₁ + x₂ - 0.5
a₁ = ReLU(z₁)

Testing:

(0,0): z₁ = -0.5, a₁ = 0
(0,1): z₁ = 0.5, a₁ = 0.5
(1,0): z₁ = 0.5, a₁ = 0.5
(1,1): z₁ = 1.5, a₁ = 1.5
Enter fullscreen mode Exit fullscreen mode

Neuron 2: Detects "both inputs are 1"

z₂ = x₁ + x₂ - 1.5
a₂ = ReLU(z₂)

Testing:

(0,0): z₂ = -1.5, a₂ = 0

(0,1): z₂ = -0.5, a₂ = 0

(1,0): z₂ = -0.5, a₂ = 0

(1,1): z₂ = 0.5, a₂ = 0.5

Enter fullscreen mode Exit fullscreen mode

Layer 2: Combine features (1 neuron with Sigmoid)

z₃ = a₁ - 2·a₂ - 0.25

output = Sigmoid(z₃)

Testing:
(0,0): z₃ = 0 - 0 - 0.25 = -0.25 → ≈0 ✓

(0,1): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓

(1,0): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓

(1,1): z₃ = 1.5 - 1 - 0.25 = 0.25 → ≈0 ✓

Enter fullscreen mode Exit fullscreen mode

What happened geometrically?

alt text

Layer 1 transformed the space:

The first layer created new features where the problem becomes linearly separable!

  • a₁ captures "OR-ness" (at least one is true)
  • a₂ captures "AND-ness" (both are true)

Layer 2 drew a simple line in this new space:

a₁ - 2·a₂ = 0.25 [decision boundary]
Enter fullscreen mode Exit fullscreen mode

This line easily separates XOR in the transformed space!

The key insight:

  • Layer 1: Creates useful intermediate features by drawing multiple lines/planes
  • Layer 2: Combines these features with another line/plane
  • Together: They can represent any decision boundary!

The Complete Architecture

A typical neural network:

Input Layer (raw data) 
    → Hidden Layer 1 (low-level features)
    → Hidden Layer 2 (mid-level features)
    → Hidden Layer 3 (high-level features)
    → Output Layer (predictions)
Enter fullscreen mode Exit fullscreen mode

The power lies not in any single neuron, but in the billions of connections between them, each with its own weight, collectively forming a function approximator of extraordinary flexibility.

Universal Approximation Theorem

The Universal Approximation Theorem (1989) proves:

A neural network with just one hidden layer can approximate any continuous function, given enough neurons.

But “enough” might mean billions, which is impractical.

Deep (multi-layer) networks achieve the same expressive power more efficiently through hierarchical composition like compression for abstractions.

So, in theory, neural networks can learn any mapping; in practice, depth makes it tractable.

2. Activation Functions: Breaking Linearity

The Linear Trap: A Fundamental Problem

Imagine we build a neural network with three layers, but we don't use activation functions. Let's trace through what happens mathematically:

Layer 1: z₁ = W₁x + b₁

Layer 2: z₂ = W₂z₁ + b₂ = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂

Layer 3: z₃ = W₃z₂ + b₃ = W₃(W₂W₁x + W₂b₁ + b₂) + b₃

Simplifying: z₃ = (W₃W₂W₁)x + (W₃W₂b₁ + W₃b₂ + b₃)

Notice what happened? No matter how many layers we add, we always end up with: Wx + b a simple linear function. Matrix multiplication of matrices is still a matrix. We've built an expensive way to do simple linear regression.

This is catastrophic. Linear functions can only model linear relationships. The real world is non-linear. The path of a thrown ball, the spread of a virus, the relationship between study time and test scores—all non-linear.

The Solution: Non-Linear Activation Functions

After each neuron computes its weighted sum, we pass it through a non-linear activation function: a = σ(z)

This single addition breaks the linear trap. Now our layers actually do different things, building increasingly complex representations.

What Makes a Good Activation Function?

Let's think about what properties we need:

  1. Non-linearity (obviously, or we're back where we started)
  2. Differentiability (we'll need derivatives for learning)
  3. Computational efficiency (we'll apply it billions of times)
  4. Avoid saturation (outputs shouldn't always be at extremes)
  5. Zero-centered or positive (depending on the problem)

Common Activation Functions

ReLU (Rectified Linear Unit): f(x) = max(0, x)

alt text

Why it works:

  • Dead simple: if input is positive, output equals input; if negative, output is zero
  • Non-linear despite looking linear (the "kink" at zero creates non-linearity)
  • Computationally trivial: just one comparison and zero multiplication
  • Doesn't saturate for positive values (unlike sigmoid)
  • Induces sparsity: many neurons output exactly zero, creating efficient representations

The problem:

  • "Dying ReLU": if a neuron's weights push it permanently into negative territory, its gradient becomes zero and it stops learning forever
  • Not zero-centered: all outputs are positive, which can slow convergence

Variants:

  • Leaky ReLU: f(x) = max(0.01x, x) — allows small gradients when x < 0, preventing death
  • ELU (Exponential Linear Unit): Smooth curve for negative values, better learning dynamics

Sigmoid: f(x) = 1/(1 + e^(-x))

alt text

Why it exists:

  • Squashes any input into range (0, 1)
  • Historically motivated by biological neurons (firing rates between 0 and 1)
  • Output can be interpreted as probability

what's happening?

  • For large positive x: e^(-x) approaches 0, so output approaches 1
  • For large negative x: e^(-x) approaches infinity, so output approaches 0
  • At x = 0: output is 0.5

Why it's problematic:

  • Vanishing gradients: For large positive or negative inputs, the sigmoid is nearly flat. The derivative approaches zero. During backpropagation, gradients get multiplied across layers; zeros multiply to deepen zeros. Deep networks can't learn.
  • Not zero-centered: Outputs always positive (0 to 1), causing zig-zagging during optimization
  • Computationally expensive: Exponential function

Where it's still used:

  • Output layer for binary classification (want probability between 0 and 1)

Tanh: f(x) = (e^x - e^(-x))/(e^x + e^(-x))

alt text
Advantages over sigmoid:

  • Zero-centered: outputs range from -1 to 1
  • Stronger gradients: derivative at zero is 1 (compared to 0.25 for sigmoid)

Still suffers from:

  • Vanishing gradients for extreme values
  • Computational cost of exponentials

Softmax: f(x_i) = e^(x_i) / Σe^(x_j)

Completely different purpose:

  • Not used between hidden layers
  • Exclusively for multi-class classification output layers

In simple term

  • Takes a vector of arbitrary values (logits)
  • Converts them into probabilities that sum to 1
  • Exponentiation ensures all values are positive
  • Division by sum ensures they sum to 1
  • Higher inputs get exponentially higher probabilities

Example:

  • Input: [2.0, 1.0, 0.1]
  • After softmax: [0.659, 0.242, 0.099]
  • Notice: still ordered the same way, but now they're probabilities

Why Different Layers Need Different Activations

  • Hidden Layers: ReLU family (efficiency, avoiding vanishing gradients)
  • Binary Classification Output: Sigmoid (get probability for one class)
  • Multi-class Classification Output: Softmax (get probability distribution over all classes)
  • Regression Output: Often no activation (or linear) — we want the raw value, not a bounded one

3. Forward Propagation: The Prediction Process

What is Propagation?

"Propagation" is just a fancy word for "passing information through." Forward propagation is the process of taking input data and pushing it through every layer until we get a prediction.

Let's build this concept from absolute scratch.

The Single Neuron Case

You have:

  • Input: x = 3
  • Weight: w = 2
  • Bias: b = 1

Step 1: Linear combination z = wx + b = 2(3) + 1 = 7

Step 2: Activation a = ReLU(z) = max(0, 7) = 7

That's it. The neuron outputs 7. This output might be the final prediction (if it's the only neuron), or it might be input to the next layer.

Multiple Inputs, Single Neuron

Now you have three inputs:

  • Inputs: x = [x₁=2, x₂=3, x₃=1]
  • Weights: w = [w₁=0.5, w₂=-1, w₃=2]
  • Bias: b = 1

Step 1: Weighted sum

z = w₁x₁ + w₂x₂ + w₃x₃ + b
z = 0.5(2) + (-1)(3) + 2(1) + 1
z = 1 - 3 + 2 + 1 = 1
Enter fullscreen mode Exit fullscreen mode

Step 2: Activation a = ReLU(1) = 1

Single Layer: Multiple Neurons

Now suppose we have 3 neurons in one layer, all receiving the same 3 inputs.

Neuron 1:

  • Weights: [w₁₁, w₁₂, w₁₃], Bias: b₁
  • Output: a₁ = ReLU(w₁₁x₁ + w₁₂x₂ + w₁₃x₃ + b₁)

Neuron 2:

  • Weights: [w₂₁, w₂₂, w₂₃], Bias: b₂
  • Output: a₂ = ReLU(w₂₁x₁ + w₂₂x₂ + w₂₃x₃ + b₂)

Neuron 3:

  • Weights: [w₃₁, w₃₂, w₃₃], Bias: b₃
  • Output: a₃ = ReLU(w₃₁x₁ + w₃₂x₃₃ + w₃₃x₃ + b₃)

The layer transforms input vector [x₁, x₂, x₃] into output vector [a₁, a₂, a₃]

Matrix Representation: Scaling to Thousands of Neurons

Writing out every neuron individually is tedious. We use matrix notation:

Weight Matrix W:

W = [w₁₁  w₁₂  w₁₃]
    [w₂₁  w₂₂  w₂₃]
    [w₃₁  w₃₂  w₃₃]
Enter fullscreen mode Exit fullscreen mode

Each row represents one neuron's weights.

Input Vector x:

x = [x₁]
    [x₂]
    [x₃]
Enter fullscreen mode Exit fullscreen mode

Forward propagation for the layer:

z = Wx + b
a = ReLU(z)
Enter fullscreen mode Exit fullscreen mode

This single matrix multiplication computes all neurons simultaneously. With modern GPUs optimized for matrix operations, we can process thousands of neurons in parallel.

Deep Networks: Chaining Layers

Now stack multiple layers. The output of layer 1 becomes the input to layer 2:

Layer 1:

z¹ = W¹x + b¹
a¹ = ReLU(z¹)
Enter fullscreen mode Exit fullscreen mode

Layer 2:

z² = W²a¹ + b²
a² = ReLU(z²)
Enter fullscreen mode Exit fullscreen mode

Layer 3 (output):

z³ = W³a² + b³
ŷ = softmax(z³)  [if classification]
Enter fullscreen mode Exit fullscreen mode

The final output ŷ is our prediction.

Concrete Example: Digit Recognition

alt text

Input: 28×28 pixel image of a handwritten digit (flattened to 784 values)

Architecture:

  • Input layer: 784 neurons
  • Hidden layer 1: 128 neurons (with ReLU)
  • Hidden layer 2: 64 neurons (with ReLU)
  • Output layer: 10 neurons (with softmax for digits 0-9)

Forward propagation:

z¹ = W¹x + b¹           [128 values]
a¹ = ReLU(z¹)           [128 values]

z² = W²a¹ + b²          [64 values]
a² = ReLU(z²)           [64 values]

z³ = W³a² + b³          [10 values]
ŷ = softmax(z³)         [10 probabilities summing to 1]
Enter fullscreen mode Exit fullscreen mode

Output might be: [0.01, 0.02, 0.05, 0.7, 0.1, 0.05, 0.03, 0.02, 0.01, 0.01]

The network predicts "3" with 70% confidence (index 3 has highest probability).

Why "Forward"?

Because information flows in one direction: from input → through hidden layers → to output. No loops, no feedback (in standard feedforward networks). Each layer only looks forward, never backward.

Later, during learning, we'll propagate in the opposite direction (backward) to adjust weights. But prediction is always forward.


4. Loss Functions: Quantifying Error

alt text

Why Do We Need Loss?

Imagine you're teaching a child to draw circles. They draw something. How do you tell them how "wrong" it is? You need a measurement some way to quantify the difference between what they drew and a perfect circle.

Neural networks face the same problem. After forward propagation, we have a prediction ŷ. We also have the true answer y. The loss function L(ŷ, y) measures how wrong the prediction is.

This single number is crucial because:

  1. It tells us how well the model is performing
  2. It guides the learning process (we'll adjust weights to minimize this number)
  3. Different problems need different ways of measuring "wrongness"

Property Requirements for Loss Functions

  1. Non-negative: L ≥ 0 always (can't be "negative wrong")
  2. Zero when perfect: L = 0 when ŷ = y exactly
  3. Increases with error: Worse predictions → higher loss
  4. Differentiable: We need gradients for learning (calculus requirement)
  5. Appropriate for the task: Regression vs classification need different measures

Mean Squared Error (MSE): For Regression

The Problem: Predict a continuous value (house price, temperature, stock price)

The most intuitive approach: absolute difference |ŷ - y|

  • If true value is 100 and we predict 90, error = 10
  • Simple, interpretable

But there's a problem: absolute value isn't differentiable at zero (the derivative has a discontinuity). This complicates learning algorithms.

Better approach: Square the difference

L = (ŷ - y)²
Enter fullscreen mode Exit fullscreen mode

Why squaring?

  • Always positive (negative errors don't cancel positive ones)
  • Differentiable everywhere: dL/dŷ = 2(ŷ - y)
  • Penalizes large errors more (error of 10 contributes 100, but error of 1 contributes only 1)
  • Mathematically convenient (leads to elegant solutions)

For multiple predictions (a batch):

MSE = (1/n) Σᵢ(ŷᵢ - yᵢ)²
Enter fullscreen mode Exit fullscreen mode

We average across all samples to get a single loss value.

Concrete Example:

  • Predicting house prices
  • True prices: [200k, 300k, 250k]
  • Predicted: [210k, 280k, 255k]
  • Errors: [10k, -20k, 5k]
  • Squared errors: [100M, 400M, 25M]
  • MSE = (100M + 400M + 25M) / 3 = 175M

The large middle error dominates the loss, signaling that's where improvement is needed most.

Variant: MAE (Mean Absolute Error)

MAE = (1/n) Σᵢ|ŷᵢ - yᵢ|
Enter fullscreen mode Exit fullscreen mode
  • More robust to outliers (doesn't square them)
  • Less sensitive to large errors
  • Harder to optimize (non-smooth at zero)

Cross-Entropy Loss: For Classification

The Problem: Predict discrete categories (cat vs dog, spam vs ham, digit 0-9)

MSE doesn't work well here. Why? Because classification outputs are probabilities, and we need to measure "how wrong" a probability distribution is.

Binary Cross-Entropy (Two Classes)

Setup:

  • True label: y ∈ {0, 1} (e.g., 0 = not spam, 1 = spam)
  • Predicted probability: ŷ ∈ [0, 1] (from sigmoid activation)

If true label is 1 (positive class):

  • If we predict ŷ = 1.0 (certain it's positive): perfect, loss should be 0
  • If we predict ŷ = 0.9 (very confident): small loss
  • If we predict ŷ = 0.5 (uncertain): moderate loss
  • If we predict ŷ = 0.1 (confident it's negative): large loss
  • If we predict ŷ = 0.0 (certain it's negative): infinite loss (catastrophically wrong)

The formula that captures this:

L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]
Enter fullscreen mode Exit fullscreen mode

Why this works:

Case 1: y = 1 (true class is positive)

L = -log(ŷ)
Enter fullscreen mode Exit fullscreen mode
  • If ŷ = 1: L = -log(1) = 0 ✓
  • If ŷ = 0.5: L = -log(0.5) ≈ 0.69
  • If ŷ = 0.1: L = -log(0.1) ≈ 2.30
  • If ŷ → 0: L → ∞ (massive penalty for confident wrong answer)

Case 2: y = 0 (true class is negative)

L = -log(1-ŷ)
Enter fullscreen mode Exit fullscreen mode
  • If ŷ = 0: L = -log(1) = 0 ✓
  • If ŷ = 0.5: L = -log(0.5) ≈ 0.69
  • If ŷ = 0.9: L = -log(0.1) ≈ 2.30
  • If ŷ → 1: L → ∞

The logarithm creates the right penalty structure: small errors have small losses, but confident mistakes are punished severely.

Why "cross-entropy"?

It comes from information theory. Cross-entropy measures the average number of bits needed to encode data from one distribution using another distribution. Here, we're measuring the "distance" between the true distribution (y) and predicted distribution (ŷ).

Categorical Cross-Entropy (Multiple Classes)

Setup:

  • True label: one-hot encoded vector (e.g., [0, 0, 1, 0, 0] for class 3)
  • Predicted: probability distribution from softmax (e.g., [0.1, 0.2, 0.5, 0.15, 0.05])

Formula:

L = -Σᵢ yᵢ·log(ŷᵢ)
Enter fullscreen mode Exit fullscreen mode

Since y is one-hot (only one element is 1, rest are 0), this simplifies to:

L = -log(ŷ_true_class)
Enter fullscreen mode Exit fullscreen mode

Example: Digit classification (0-9)

  • True label: 7 → one-hot: [0,0,0,0,0,0,0,1,0,0]
  • Predicted: [0.05, 0.05, 0.1, 0.05, 0.05, 0.05, 0.1, 0.4, 0.1, 0.05]

Loss = -log(0.4) ≈ 0.916

If the model had predicted 7 with 0.9 probability: Loss = -log(0.9) ≈ 0.105 (much better)

Intuition: We only care about the probability assigned to the correct class. The loss increases as this probability decreases.

Choosing the Right Loss Function

Regression (predicting continuous values):

  • MSE: Standard choice, penalizes large errors heavily
  • MAE: More robust to outliers
  • Huber Loss: Combines benefits of both (MSE for small errors, MAE for large)

Binary Classification:

  • Binary Cross-Entropy: Standard choice when using sigmoid output

Multi-class Classification:

  • Categorical Cross-Entropy: When labels are one-hot encoded
  • Sparse Categorical Cross-Entropy: When labels are integers (more memory efficient)

Custom Loss Functions: Sometimes you need domain-specific losses. For example:

  • Medical diagnosis: False negatives might be more costly than false positives
  • Image generation: Perceptual losses that compare high-level features, not pixels
  • Reinforcement learning: Reward-based losses

The loss function is the objective we're optimizing. Choose it carefully—your model will become excellent at minimizing it, for better or worse.


5. Backpropagation: The Learning Algorithm

alt text

This step is crucial it’s where the real learning happens.

Our neural network has millions of tiny adjustable numbers called weights. We make a prediction, compare it with the correct answer, and realize we’re off. The big question is: how do we tweak those millions of weights to make the next prediction better?

It’s not as simple as it sounds. Each weight affects many others, and changing even one can ripple through the entire network. Should we increase it or decrease it? And by how much?

That’s where backpropagation comes in a beautifully systematic way to figure out exactly how every single weight should change to reduce the overall error.

To really grasp what’s happening here, you’ll need a bit of comfort with calculus, especially with derivatives and how small changes in one variable affect another.

The Core Insight: The Chain Rule of Calculus

Everything in backpropagation stems from one calculus concept: the chain rule**.

Simple example: If z = f(y) and y = g(x), then:

dz/dx = (dz/dy) · (dy/dx)
Enter fullscreen mode Exit fullscreen mode

In words: The rate of change of z with respect to x equals the rate of change of z with respect to y, multiplied by the rate of change of y with respect to x.

This might seem abstract, so let's make it concrete.

Concrete Example: A Tiny Network

Architecture:

  • One input: x = 2
  • One weight: w = 3
  • One bias: b = 1
  • Activation: ReLU
  • True output: y = 15

Forward pass:

z = wx + b = 3(2) + 1 = 7
a = ReLU(z) = 7
L = (a - y)² = (7 - 15)² = 64
Enter fullscreen mode Exit fullscreen mode

Loss is 64. We want to reduce it. Should we increase or decrease w?

Backward pass (backpropagation):

We need dL/dw (how much does loss change when we change w?).

Using the chain rule:

dL/dw = (dL/da) · (da/dz) · (dz/dw)
Enter fullscreen mode Exit fullscreen mode

Let's calculate each piece:

Step 1: dL/da (how does loss change with activation?)

L = (a - y)²
dL/da = 2(a - y) = 2(7 - 15) = -16
Enter fullscreen mode Exit fullscreen mode

Step 2: da/dz (how does activation change with pre-activation?)

a = ReLU(z) = max(0, z)
For z > 0: da/dz = 1
For z ≤ 0: da/dz = 0
Since z = 7 > 0: da/dz = 1
Enter fullscreen mode Exit fullscreen mode

Step 3: dz/dw (how does pre-activation change with weight?)

z = wx + b
dz/dw = x = 2
Enter fullscreen mode Exit fullscreen mode

Combine them:

dL/dw = (dL/da) · (da/dz) · (dz/dw)
dL/dw = (-16) · (1) · (2) = -32
Enter fullscreen mode Exit fullscreen mode

Interpretation: The gradient is -32. This means:

  • If we increase w by a tiny amount, the loss will decrease by approximately 32 times that amount
  • The negative sign tells us to increase w (move opposite to the gradient)
  • The magnitude (32) tells us how sensitive the loss is to changes in w

Update the weight:

w_new = w_old - learning_rate · (dL/dw)
w_new = 3 - 0.01 · (-32) = 3 + 0.32 = 3.32
Enter fullscreen mode Exit fullscreen mode

We've just learned! The network adjusted its weight to reduce the loss.

Scaling to Deep Networks

In real networks with many layers, we calculate gradients layer by layer, moving backward from the output.

Example: 3-layer network

Forward pass:

Layer 1: z¹ = W¹x + b¹,  a¹ = ReLU(z¹)
Layer 2: z² = W²a¹ + b², a² = ReLU(z²)
Layer 3: z³ = W³a² + b³, ŷ = softmax(z³)
Loss: L = CrossEntropy(ŷ, y)
Enter fullscreen mode Exit fullscreen mode

Backward pass:

Layer 3 (output layer):

dL/dz³ = ŷ - y  [derivative of softmax + cross-entropy]
dL/dW³ = (dL/dz³) · a²ᵀ
dL/db³ = dL/dz³
dL/da² = W³ᵀ · (dL/dz³)  [pass gradient to previous layer]
Enter fullscreen mode Exit fullscreen mode

Layer 2:

dL/dz² = (dL/da²) ⊙ ReLU'(z²)  [⊙ is element-wise multiplication]
dL/dW² = (dL/dz²) · a¹ᵀ
dL/db² = dL/dz²
dL/da¹ = W²ᵀ · (dL/dz²)
Enter fullscreen mode Exit fullscreen mode

Layer 1:

dL/dz¹ = (dL/da¹) ⊙ ReLU'(z¹)
dL/dW¹ = (dL/dz¹) · xᵀ
dL/db¹ = dL/dz¹
Enter fullscreen mode Exit fullscreen mode

Notice the pattern:

  1. Calculate gradient with respect to pre-activation (z)
  2. Calculate gradient for weights: dL/dW = (dL/dz) · inputᵀ
  3. Calculate gradient for bias: dL/db = dL/dz
  4. Pass gradient backward: dL/d(previous_activation) = Wᵀ · (dL/dz)

Why "Backpropagation"?

Because we propagate gradients backward through the network, from output to input. Each layer receives the gradient from the layer ahead, computes its own gradients, and passes gradients to the layer behind.

The Vanishing Gradient Problem

alt text

Fundamental issue in deep networks:

When we multiply many small numbers (gradients) together through many layers, the product can become vanishingly small—approaching zero.

Example: If each layer has gradient 0.1, after 10 layers:

0.1¹⁰ = 0.0000000001
Enter fullscreen mode Exit fullscreen mode

The early layers receive essentially zero gradient and stop learning. The network is deep but only the last few layers are actually training.

Solutions:

  • ReLU activation: Gradient is 1 for positive inputs (doesn't shrink)
  • Residual connections: Skip connections that allow gradients to bypass layers
  • Batch normalization: Keeps activations in a healthy range
  • Careful initialization: Start with weights that don't lead to extreme activations

The Exploding Gradient Problem

The opposite issue: gradients grow exponentially.

If each layer has gradient 2, after 10 layers:

2¹⁰ = 1024
Enter fullscreen mode Exit fullscreen mode

Weights update by huge amounts, causing wild oscillations and instability. The model never converges.

Solutions:

  • Gradient clipping: Cap gradients at a maximum value
  • Careful initialization: Start with smaller weights
  • Batch normalization: Stabilizes the scale of activations and gradients
  • Lower learning rates: Smaller update steps

Computational Efficiency: Why Backpropagation is Brilliant

Naive approach to finding gradients: For each weight, we could:

  1. Make a tiny change: w → w + ε
  2. Recalculate the entire loss
  3. Compute: (L_new - L_old) / ε

For a network with 1 million weights, this requires 1 million forward passes. Computationally prohibitive.

Backpropagation insight: Calculate all gradients in a single backward pass by reusing intermediate calculations. For N weights, we need:

  • 1 forward pass
  • 1 backward pass

That's it. Backpropagation computes all million gradients with just two passes through the network. This is why deep learning became practical.

The Mathematics: Derivatives of Common Components

ReLU:

f(x) = max(0, x)
f'(x) = 1 if x > 0, else 0
Enter fullscreen mode Exit fullscreen mode

Sigmoid:

σ(x) = 1/(1 + e^(-x))
σ'(x) = σ(x)(1 - σ(x))
Enter fullscreen mode Exit fullscreen mode

Tanh:

tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))
tanh'(x) = 1 - tanh²(x)
Enter fullscreen mode Exit fullscreen mode

Softmax + Cross-Entropy (combined):

dL/dz = ŷ - y
Enter fullscreen mode Exit fullscreen mode

This remarkably simple gradient is why we use softmax with cross-entropy.

MSE:

L = (ŷ - y)²
dL/dŷ = 2(ŷ - y)
Enter fullscreen mode Exit fullscreen mode

Memory Requirements

Backpropagation requires storing all activations from the forward pass to compute gradients in the backward pass. For a network with:

  • Batch size: 32
  • 4 layers with 1000 neurons each

We must store: 32 × 4 × 1000 = 128,000 activation values in memory.

This is why training large models requires substantial GPU memory, and why techniques like gradient checkpointing (recomputing some activations rather than storing them) become necessary.


6. Gradient Descent: The Optimization Algorithm

Imagine you're standing on a mountain in thick fog. You can't see the bottom of the valley, but you can feel the slope beneath your feet. Your goal: reach the lowest point.

Strategy: Take a step in the direction of steepest descent.

This is gradient descent. The "mountain" is the loss landscape—a high-dimensional surface where each dimension represents one weight, and the height represents the loss.

The Mathematical Foundation

After backpropagation, we have gradients: ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ

Each gradient tells us:

  • Direction: Positive gradient means loss increases when weight increases
  • Magnitude: Large gradient means weight strongly affects loss

Gradient descent update rule:

w_new = w_old - α · (∂L/∂w)
Enter fullscreen mode Exit fullscreen mode

Where α (alpha) is the learning rate.

Why subtract? The gradient points in the direction of increasing loss. We want to decrease loss, so we move in the opposite direction (negative gradient).

The Learning Rate: The Most Critical Hyperparameter

The learning rate controls the step size. Choosing it is an art and science.

Too large (α = 1.0):

Iteration 1: Loss = 100
Iteration 2: Loss = 250  [overshot the minimum]
Iteration 3: Loss = 80
Iteration 4: Loss = 300  [wild oscillations]
...never converges
Enter fullscreen mode Exit fullscreen mode

Too small (α = 0.000001):

Iteration 1: Loss = 100.00
Iteration 2: Loss = 99.99
Iteration 3: Loss = 99.98
...painfully slow, might get stuck in local minimum
Enter fullscreen mode Exit fullscreen mode

Just right (α = 0.01):

Iteration 1: Loss = 100
Iteration 2: Loss = 85
Iteration 3: Loss = 73
...steady progress toward minimum
Enter fullscreen mode Exit fullscreen mode

Typical ranges:

  • Small networks: 0.001 - 0.01
  • Large networks: 0.0001 - 0.001
  • With Adam optimizer: 0.001 (default)

Variants of Gradient Descent

1. Batch Gradient Descent

Approach: Use the entire dataset to compute one gradient update.

for epoch in range(num_epochs):
    # Compute gradient using ALL training samples
    gradient = compute_gradient(all_data)
    weights = weights - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Smooth convergence
  • Guaranteed to find the minimum (for convex functions)

Cons:

  • Slow: One update per epoch
  • Memory intensive: Must load entire dataset
  • Gets stuck in local minima (for non-convex functions)

2. Stochastic Gradient Descent (SGD)

Approach: Use one random sample at a time.

for epoch in range(num_epochs):
    shuffle(data)
    for sample in data:
        # Compute gradient using ONE sample
        gradient = compute_gradient(sample)
        weights = weights - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Fast updates: One update per sample
  • Can escape local minima (due to noise)
  • Memory efficient

Cons:

  • Noisy updates: path to minimum is erratic
  • Doesn't fully utilize parallel computing (GPUs)
  • May oscillate around minimum without settling

3. Mini-Batch Gradient Descent (Most Common)

Approach: Use a small batch of samples (typically 32, 64, 128, or 256).

for epoch in range(num_epochs):
    shuffle(data)
    for batch in create_batches(data, batch_size=32):
        # Compute gradient using BATCH of samples
        gradient = compute_gradient(batch)
        weights = weights - learning_rate * gradient
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Balanced: More stable than SGD, faster than batch GD
  • Efficient: Perfect for GPU parallelization
  • Moderate memory usage
  • Noise helps escape local minima, but not too much

Cons:

  • Another hyperparameter to tune (batch size)

This is the standard in modern deep learning.

Advanced Optimizers: Beyond Basic Gradient Descent

Basic gradient descent treats all parameters equally and uses a fixed learning rate. Modern optimizers are more sophisticated.

Momentum

Problem with basic GD: Imagine a narrow valley: steep sides, gentle slope toward minimum. Basic GD oscillates between sides while slowly progressing forward.

Solution: Momentum

velocity = 0
for iteration:
    gradient = compute_gradient()
    velocity = β * velocity - learning_rate * gradient
    weights = weights + velocity
Enter fullscreen mode Exit fullscreen mode

Intuition: Remember previous gradients. If we keep going in the same direction, accelerate. If we oscillate, dampen the movement.

Effect:

  • Faster convergence in consistent directions
  • Reduced oscillations
  • Can roll through small local minima

Typical β: 0.9 (use 90% of previous velocity)

RMSprop (Root Mean Square Propagation)

Problem: Some parameters need large updates, others need small ones. A single learning rate is suboptimal.

Solution: Adapt the learning rate for each parameter based on recent gradient magnitudes.

squared_gradient_avg = 0
for iteration:
    gradient = compute_gradient()
    squared_gradient_avg = β * squared_gradient_avg + (1-β) * gradient²
    adjusted_gradient = gradient / (sqrt(squared_gradient_avg) + ε)
    weights = weights - learning_rate * adjusted_gradient
Enter fullscreen mode Exit fullscreen mode

Intuition:

  • Parameters with consistently large gradients get smaller effective learning rates (divided by large number)
  • Parameters with small gradients get larger effective learning rates (divided by small number)

Effect: Each parameter gets its own adaptive learning rate.

Adam (Adaptive Moment Estimation)

The gold standard: Combines momentum and RMSprop.

m = 0  # first moment (momentum)
v = 0  # second moment (RMSprop)

for iteration:
    gradient = compute_gradient()

    # Update moments
    m = β * m + (1-β) * gradient
    v = β * v + (1-β) * gradient²

    # Bias correction (important in early iterations)
    m_corrected = m / (1 - β^t)
    v_corrected = v / (1 - β^t)

    # Update weights
    weights = weights - learning_rate * m_corrected / (sqrt(v_corrected) + ε)
Enter fullscreen mode Exit fullscreen mode

Why Adam dominates:

  • Combines best of both worlds: momentum + adaptive learning rates
  • Robust to hyperparameter choices (default values work well)
  • Efficient and converges quickly
  • Works across diverse problem types

Default hyperparameters:

  • learning_rate = 0.001
  • β₁ = 0.9 (momentum)
  • β₂ = 0.999 (RMSprop)
  • ε = 1e-8 (numerical stability)

Learning Rate Schedules

Even with Adam, learning rates can be adjusted during training.

1. Step Decay

Epochs 1-30:   lr = 0.001
Epochs 31-60:  lr = 0.0001
Epochs 61+:    lr = 0.00001
Enter fullscreen mode Exit fullscreen mode

Why: Start with larger steps to quickly find the general region, then smaller steps to fine-tune.

2. Exponential Decay

lr(t) = lr₀ * e^(-kt)
Enter fullscreen mode Exit fullscreen mode

Smoothly decreases learning rate over time.

3. Cosine Annealing

lr(t) = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(πt/T))
Enter fullscreen mode Exit fullscreen mode

Gradually reduces learning rate following a cosine curve.

4. Warm Restarts

Periodically reset learning rate to initial value. Helps escape local minima by occasionally taking large steps again.

5. Learning Rate Warmup

Start with very small learning rate, gradually increase to target value over first few epochs. Prevents instability in early training.

The Convergence Question: When to Stop?

Training loss keeps decreasing but should we keep training?

Early Stopping

Concept: Monitor performance on a validation set (data the model hasn't trained on).

Epoch 1:  Train Loss = 2.5, Val Loss = 2.6
Epoch 5:  Train Loss = 1.2, Val Loss = 1.3
Epoch 10: Train Loss = 0.8, Val Loss = 0.9
Epoch 15: Train Loss = 0.4, Val Loss = 0.85  [val loss stopped decreasing]
Epoch 20: Train Loss = 0.2, Val Loss = 0.9   [val loss increasing!]
Enter fullscreen mode Exit fullscreen mode

Stop at epoch 10: Model is starting to overfit (memorizing training data rather than learning generalizable patterns).

Implementation:

best_val_loss = infinity
patience = 5  # epochs to wait for improvement
patience_counter = 0

for epoch:
    train()
    val_loss = validate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_model()
        patience_counter = 0
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print("Early stopping!")
        break
Enter fullscreen mode Exit fullscreen mode

Challenges in the Optimization Landscape

Local Minima

The loss surface has multiple valleys. Gradient descent might settle into a shallow local minimum instead of the deep global minimum.

Solutions:

  • Momentum (can roll over small bumps)
  • Multiple random initializations
  • Stochastic updates (noise helps escape)

Saddle Points

Points where gradient is zero but it's neither a minimum nor maximum—a "saddle" shape. More common than local minima in high dimensions.

Solutions:

  • Momentum helps push through
  • Second-order methods (Newton's method)

Plateaus

Flat regions where gradients are nearly zero. Progress stalls.

Solutions:

  • Adaptive learning rates (Adam)
  • Patience (eventually gradients increase again)

Batching and Parallelization

Why batches matter for GPUs:

Modern GPUs have thousands of cores. Computing gradients for 32 samples independently is slow. Computing them in parallel is fast.

Matrix operations on batches:

Input batch:  [32 × 784] (32 images, 784 pixels each)
Weights:      [784 × 128]
Output:       [32 × 128] (32 outputs, 128 neurons)
Enter fullscreen mode Exit fullscreen mode

Single matrix multiplication computes all 32 samples simultaneously. This is why GPUs are essential for deep learning.

Batch size trade-offs:

Small batches (e.g., 8-32):

  • More frequent updates
  • More noise (helps generalization)
  • Less memory
  • Slower per epoch

Large batches (e.g., 256-1024):

  • Fewer updates per epoch
  • Smoother gradients
  • More memory required
  • Faster per epoch
  • Risk of poor generalization (too smooth)

Sweet spot: Usually 32-128 for most applications.


The Complete Training Loop: Putting It All Together

Now we understand all the pieces. Here's how they work together:

Initialization

# Initialize weights (Xavier/He initialization)
for layer in network:
    layer.weights = random_normal(0, sqrt(2/n_inputs))
    layer.biases = zeros()

# Initialize optimizer
optimizer = Adam(learning_rate=0.001)
Enter fullscreen mode Exit fullscreen mode

Why careful initialization matters:

  • Too large: Exploding activations and gradients
  • Too small: Vanishing gradients
  • Xavier/He initialization: Scaled to maintain activation variance across layers

The Training Loop

for epoch in range(num_epochs):
    # Shuffle data for randomness
    shuffle(training_data)

    for batch in create_batches(training_data, batch_size=32):
        # 1. FORWARD PROPAGATION
        x, y_true = batch

        z1 = W1 @ x + b1
        a1 = relu(z1)

        z2 = W2 @ a1 + b2
        a2 = relu(z2)

        z3 = W3 @ a2 + b3
        y_pred = softmax(z3)

        # 2. COMPUTE LOSS
        loss = cross_entropy(y_pred, y_true)

        # 3. BACKPROPAGATION
        dL_dz3 = y_pred - y_true
        dL_dW3 = dL_dz3 @ a2.T
        dL_db3 = sum(dL_dz3, axis=0)
        dL_da2 = W3.T @ dL_dz3

        dL_dz2 = dL_da2 * relu_derivative(z2)
        dL_dW2 = dL_dz2 @ a1.T
        dL_db2 = sum(dL_dz2, axis=0)
        dL_da1 = W2.T @ dL_dz2

        dL_dz1 = dL_da1 * relu_derivative(z1)
        dL_dW1 = dL_dz1 @ x.T
        dL_db1 = sum(dL_dz1, axis=0)

        # 4. OPTIMIZATION (using Adam)
        W3, b3 = optimizer.update(W3, b3, dL_dW3, dL_db3)
        W2, b2 = optimizer.update(W2, b2, dL_dW2, dL_db2)
        W1, b1 = optimizer.update(W1, b1, dL_dW1, dL_db1)

    # 5. VALIDATION
    val_loss = evaluate(validation_data)
    print(f"Epoch {epoch}: Train Loss = {loss:.4f}, Val Loss = {val_loss:.4f}")

    # 6. EARLY STOPPING CHECK
    if should_stop(val_loss):
        break

# 7. FINAL EVALUATION
test_accuracy = evaluate(test_data)
print(f"Final Test Accuracy: {test_accuracy:.2%}")
Enter fullscreen mode Exit fullscreen mode

What Happens Over Time

Epoch 1:

  • Weights are random
  • Predictions are terrible (10% accuracy on 10 classes = random guessing)
  • Loss is high (maybe 2.3)
  • Large gradients
  • Big weight updates

Epoch 10:

  • Network learned basic patterns
  • Accuracy improved to 60%
  • Loss decreased to 1.2
  • Moderate gradients
  • Steady learning

Epoch 50:

  • Network refined understanding
  • Accuracy at 92%
  • Loss at 0.3
  • Small gradients
  • Fine-tuning details

Epoch 100:

  • Diminishing returns
  • Accuracy 93% (validation starting to plateau)
  • Risk of overfitting
  • Time to stop

Monitoring Training: What to Watch

1. Training Loss

  • Should decrease steadily
  • If fluctuating wildly: learning rate too high
  • If barely moving: learning rate too low or stuck in minimum

2. Validation Loss

  • Should track training loss initially
  • If diverging: overfitting
  • If much higher from start: train/val data distribution mismatch

3. Gradient Norms

  • Should be moderate (0.001 - 1.0)
  • If very small (< 0.0001): vanishing gradients
  • If very large (> 10): exploding gradients

4. Activation Statistics

  • Mean should be near zero
  • Std should be moderate (~1)
  • If activations saturate (all 0 or all max): architectural problem

5. Learning Rate

  • Can be adjusted based on progress
  • Too aggressive: divergence
  • Too conservative: slow progress

Conclusion: The Symphony of Learning

Machine learning is not one algorithm—it's a carefully orchestrated system:

  1. Architecture provides the capacity to represent complex functions (Universal Approximation Theorem)
  2. Activation functions enable non-linear transformations
  3. Forward propagation generates predictions
  4. Loss functions quantify error
  5. Backpropagation computes gradients efficiently
  6. Gradient descent iteratively improves weights

Each component is essential. Remove any one, and learning fails.

The beauty lies in the simplicity of each piece and the power of their combination. From these building blocks—matrix multiplications, non-linear functions, derivatives, and iterative updates—emerges the capability to:

  • Recognize faces in photos
  • Translate between languages
  • Generate realistic images
  • Play games at superhuman levels
  • Predict protein structures
  • Drive cars autonomously

All from the same fundamental algorithm, repeated billions of times, gradually sculpting random weights into a representation of the world's patterns.

This is how machines learn: not through magic, but through mathematics, iteration, and the elegant interplay of calculus and optimization across high-dimensional spaces.

Top comments (0)