Nishanthan K

Posted on Nov 28

How a Model Really Learns: From Loss to Learning in Machine Learning & Deep Learning

#tutorial #beginners #deeplearning #machinelearning

Machine Learning and Deep Learning are often treated as black boxes filled with complex math and jargon. But at their core, they are built on a few simple ideas: measuring error, understanding direction, and making small improvements over time.

In this article, I break down how a model actually learns — from the moment it makes a mistake, to how that mistake travels backward through the network to update weights and improve future predictions. Starting from a simple equation, we’ll build up to neural networks and the complete training loop, step by step.

1. Intelligence, Patterns, and Models

Human intelligence works by building mental models of the world.

Example:

Black clouds + strong wind → We expect rain
Sometimes this is wrong (it could be a cloud shadow)

Over time, the brain refines these models.

Machine Learning works the same way:

Input → Pattern → Output
The system learns relationships from data, not rules written manually.

A model is simply a mathematical way of representing this pattern.

2. The Core Idea: Model as an Equation

The simplest model looks like this:

y = mX + b

Where:
X = input
y = output
m and b = parameters (values to be learned)

In Machine Learning, this is often written as:

y = wX + b

Here:
w = weight (importance of input)
b = bias (base value added to the output)

The entire goal of training is to find the best values of w and b so the prediction matches reality.

3. Loss Function — Measuring Mistakes

Once the model makes a prediction, it needs to know: “How wrong am I?” That’s the job of the loss function.

A loss function:

Compares actual output and predicted output
Returns a number that shows how bad the prediction was
High loss → very wrong
Low loss → close to correct

This number becomes the main guiding signal for learning.

4. Gradient — Finding the Direction of Improvement

After loss is calculated, the model asks: “In which direction should I change my parameters to reduce this error?” The gradient is that direction.

Intuitively:

A positive gradient means: if we increase this parameter, loss will go up
A negative gradient means: if we increase this parameter, loss will go down
A zero gradient means: we are at a flat point (often a minimum)

5. Learning Rate — Controlling the Speed of Learning

The learning rate controls how much the parameter is allowed to change in each step.

Conceptually, the update rule is:

new_weight = old_weight - (learning_rate × gradient)

If the learning rate is too large → the model jumps wildly, overshoots, and becomes unstable
If the learning rate is too small → the model moves very slowly and takes forever to learn
If the learning rate is reasonable → learning is smooth and stable

So the learning rate is simply the step size of learning.

6. Epoch — Repeating the Learning Process

One epoch means the model has seen the entire training dataset once.
During an epoch, the model would makes predictions on all training data
- Calculates the loss
- Updates the parameters
- Repeats this process for the next epoch
More epochs → more opportunities to learn
Too many epochs → the model may start memorizing instead of generalizing (this is called overfitting)

7. Batch — Dividing the Dataset

Instead of feeding the entire dataset to the model at once, we split it into batches.

Example:

Dataset size: 1000 samples
Batch size: 100
Then per epoch, we have: 1000 / 100 = 10 batches

So the model updates its parameters 10 times in one epoch

Benefits of batching:

Reduces memory usage
Gives more frequent parameter updates
Often leads to better and more stable learning
There is no single “perfect” batch size. It depends on:
- Dataset size
- Available hardware (CPU/GPU, RAM/VRAM)
- Model size and complexity

8. Weights and Bias — Making Real-World Sense

Earlier, we saw:

y = mX + b

In ML terms:

y = wX + b

Where:

w (weight) represents how important an input is
b (bias) is a base value added regardless of the input

Example: Courier Charges

₹100 per kg → weight
₹50 base charge → bias
Even if the parcel weighs 0 kg, you still pay ₹50.
Without the bias term, the model would be forced to predict ₹0 when the input is 0. Bias lets us shift the line up or down and represent real “base” behavior.

So bias is needed so that:

The model is not forced to pass through (0, 0), and can better fit real-world data.

9. Neural Network — From One Equation to Many

A single neuron performs a calculation like:

y = w1·x1 + w2·x2 + ... + wn·xn + b

- It takes many inputs
- Multiplies each by its own weight
- Adds them up and adds a bias

A layer is a group of neurons working in parallel on the same input.
A neural network is a stack of such layers.

Why use neural networks?

A single linear equation is limited to straight-line relationships
Neural networks, combined with activation functions, can model complex, non-linear patterns
This is what enables them to handle images, text, audio, and other complex data

10. Activation Function — The Source of Non-Linearity

After a neuron computes:

z = w·x + b

We apply an activation function to get the final output:

a = activation(z)

Activation functions:

Introduce non-linearity
Allow the network to learn complex patterns and decision boundaries
Enable tasks such as:
- Image classification
- Object detection
- Natural language processing

Without activation functions:

Every neuron is just a linear transformation
Stacking multiple linear layers still gives a linear function
The network would be no more powerful than a single linear model

In short:

Activation functions are what make neural networks more powerful than simple linear models.

11. Forward Propagation — Data Moving Through the Network

Forward propagation is the process of passing data through the network to produce an output.
Steps:
- Input data is fed into the input layer
- Each neuron in the next layer:
- Multiplies inputs by weights
- Adds bias
- Applies activation function
The output of one layer becomes the input to the next layer
This continues until the final output layer produces the prediction
Flow: Input → Hidden layer 1 → Hidden layer 2 → ... → Output layer

Important:

Forward propagation only computes the outputs.
No learning or weight updates happen in this step.

12. Backward Propagation — Learning From Mistakes

After forward propagation, we:
- Compare the prediction with the actual target
- Compute the loss (error)
- Now the model needs to adjust its weights to reduce this loss.
- This is where backward propagation (backprop) comes in.
Backpropagation:
- Sends the error signal backward from the output layer toward the input layer
- Computes, for each weight and bias, how much it contributed to the error (its gradient)
- Once gradients are known, we update parameters using gradient descent:
  - new_weight = old_weight - (learning_rate × gradient)
  - new_bias = old_bias - (learning_rate × gradient)
- So:
  - Backpropagation calculates gradients;
  - Gradient descent + learning rate perform the actual update.
- Backpropagation is the core of how the network learns.

13. The Complete Training Cycle

Putting everything together, a full training loop for a neural network looks like this:

Input data is passed into the network
Forward propagation computes the prediction
Loss function measures how wrong the prediction
Backward propagation computes gradients for all weights and biases
Gradient descent updates the parameters using the learning rate
This is repeated for:
- Every batch
- Across multiple epochs

In one line:

Input → Forward Propagation → Loss → Backpropagation → Update Weights → Repeat

With each cycle:
- The model becomes a bit less wrong
- The loss usually goes down
- The predictions improve
- This is the entire training process in a nutshell.

14. Inference — Using the Trained Model

Once the model is trained:
- We freeze the learned weights and biases
- We give it new input data it has never seen
- It runs only forward propagation to produce predictions
- No backpropagation or weight updates happen during inference.
This is the phase where the model is actually used in production:
- Recommending products
- Classifying images
- Translating text
- Detecting fraud, etc.

Final Thought

From a simple line like y = wX + b to deep neural networks, everything in Machine Learning and Deep Learning is built on a few core ideas:

- Representing patterns with models
- Measuring error with a loss function
- Using gradients to find a better direction
- Controlling change with learning rate
- Repeating the process over batches and epochs

The concepts are simple.
The power comes from scale, data, and iteration.

DEV Community