DEV Community

Cover image for How a Model Really Learns: From Loss to Learning in Machine Learning & Deep Learning
Nishanthan K
Nishanthan K

Posted on

How a Model Really Learns: From Loss to Learning in Machine Learning & Deep Learning

Machine Learning and Deep Learning are often treated as black boxes filled with complex math and jargon. But at their core, they are built on a few simple ideas: measuring error, understanding direction, and making small improvements over time.

In this article, I break down how a model actually learns — from the moment it makes a mistake, to how that mistake travels backward through the network to update weights and improve future predictions. Starting from a simple equation, we’ll build up to neural networks and the complete training loop, step by step.


1. Intelligence, Patterns, and Models

Human intelligence works by building mental models of the world.

Example:

  • Black clouds + strong wind → We expect rain
  • Sometimes this is wrong (it could be a cloud shadow)

Over time, the brain refines these models.

Machine Learning works the same way:

  • Input → Pattern → Output
  • The system learns relationships from data, not rules written manually.

A model is simply a mathematical way of representing this pattern.


2. The Core Idea: Model as an Equation

The simplest model looks like this:

y = mX + b

Where:
X = input
y = output
m and b = parameters (values to be learned)
Enter fullscreen mode Exit fullscreen mode

In Machine Learning, this is often written as:

y = wX + b

Here:
w = weight (importance of input)
b = bias (base value added to the output)

The entire goal of training is to find the best values of w and b so the prediction matches reality.
Enter fullscreen mode Exit fullscreen mode

3. Loss Function — Measuring Mistakes

Once the model makes a prediction, it needs to know: “How wrong am I?” That’s the job of the loss function.

A loss function:

  • Compares actual output and predicted output
  • Returns a number that shows how bad the prediction was
  • High loss → very wrong
  • Low loss → close to correct

This number becomes the main guiding signal for learning.

4. Gradient — Finding the Direction of Improvement

After loss is calculated, the model asks: “In which direction should I change my parameters to reduce this error?” The gradient is that direction.

Intuitively:

  • A positive gradient means: if we increase this parameter, loss will go up
  • A negative gradient means: if we increase this parameter, loss will go down
  • A zero gradient means: we are at a flat point (often a minimum)

5. Learning Rate — Controlling the Speed of Learning

The learning rate controls how much the parameter is allowed to change in each step.

Conceptually, the update rule is:

new_weight = old_weight - (learning_rate × gradient)
Enter fullscreen mode Exit fullscreen mode
  • If the learning rate is too large → the model jumps wildly, overshoots, and becomes unstable
  • If the learning rate is too small → the model moves very slowly and takes forever to learn
  • If the learning rate is reasonable → learning is smooth and stable

So the learning rate is simply the step size of learning.

6. Epoch — Repeating the Learning Process

  • One epoch means the model has seen the entire training dataset once.
  • During an epoch, the model would makes predictions on all training data
    • Calculates the loss
    • Updates the parameters
    • Repeats this process for the next epoch
  • More epochs → more opportunities to learn
  • Too many epochs → the model may start memorizing instead of generalizing (this is called overfitting)

7. Batch — Dividing the Dataset

  • Instead of feeding the entire dataset to the model at once, we split it into batches.
Example:

Dataset size: 1000 samples
Batch size: 100
Then per epoch, we have: 1000 / 100 = 10 batches

So the model updates its parameters 10 times in one epoch
Enter fullscreen mode Exit fullscreen mode

Benefits of batching:

  • Reduces memory usage
  • Gives more frequent parameter updates
  • Often leads to better and more stable learning
  • There is no single “perfect” batch size. It depends on:
    • Dataset size
    • Available hardware (CPU/GPU, RAM/VRAM)
    • Model size and complexity

8. Weights and Bias — Making Real-World Sense

Earlier, we saw:

y = mX + b
Enter fullscreen mode Exit fullscreen mode

In ML terms:

y = wX + b
Enter fullscreen mode Exit fullscreen mode

Where:

  • w (weight) represents how important an input is
  • b (bias) is a base value added regardless of the input

Example: Courier Charges

  • ₹100 per kg → weight
  • ₹50 base charge → bias
  • Even if the parcel weighs 0 kg, you still pay ₹50.
  • Without the bias term, the model would be forced to predict ₹0 when the input is 0. Bias lets us shift the line up or down and represent real “base” behavior.

So bias is needed so that:

The model is not forced to pass through (0, 0), and can better fit real-world data.

9. Neural Network — From One Equation to Many

A single neuron performs a calculation like:

y = w1·x1 + w2·x2 + ... + wn·xn + b

- It takes many inputs
- Multiplies each by its own weight
- Adds them up and adds a bias
Enter fullscreen mode Exit fullscreen mode
  • A layer is a group of neurons working in parallel on the same input.
  • A neural network is a stack of such layers.

Why use neural networks?

  • A single linear equation is limited to straight-line relationships
  • Neural networks, combined with activation functions, can model complex, non-linear patterns
  • This is what enables them to handle images, text, audio, and other complex data

10. Activation Function — The Source of Non-Linearity

After a neuron computes:

z = w·x + b
Enter fullscreen mode Exit fullscreen mode

We apply an activation function to get the final output:

a = activation(z)
Enter fullscreen mode Exit fullscreen mode

Activation functions:

  • Introduce non-linearity
  • Allow the network to learn complex patterns and decision boundaries
  • Enable tasks such as:
    • Image classification
    • Object detection
    • Natural language processing

Without activation functions:

  • Every neuron is just a linear transformation
  • Stacking multiple linear layers still gives a linear function
  • The network would be no more powerful than a single linear model

In short:

Activation functions are what make neural networks more powerful than simple linear models.

11. Forward Propagation — Data Moving Through the Network

  • Forward propagation is the process of passing data through the network to produce an output.
  • Steps:
    • Input data is fed into the input layer
    • Each neuron in the next layer:
    • Multiplies inputs by weights
    • Adds bias
    • Applies activation function
  • The output of one layer becomes the input to the next layer
  • This continues until the final output layer produces the prediction
  • Flow: Input → Hidden layer 1 → Hidden layer 2 → ... → Output layer

Important:

  • Forward propagation only computes the outputs.
  • No learning or weight updates happen in this step.

12. Backward Propagation — Learning From Mistakes

  • After forward propagation, we:
    • Compare the prediction with the actual target
    • Compute the loss (error)
    • Now the model needs to adjust its weights to reduce this loss.
    • This is where backward propagation (backprop) comes in.
  • Backpropagation:
    • Sends the error signal backward from the output layer toward the input layer
    • Computes, for each weight and bias, how much it contributed to the error (its gradient)
    • Once gradients are known, we update parameters using gradient descent:
      • new_weight = old_weight - (learning_rate × gradient)
      • new_bias = old_bias - (learning_rate × gradient)
    • So:
      • Backpropagation calculates gradients;
      • Gradient descent + learning rate perform the actual update.
    • Backpropagation is the core of how the network learns.

13. The Complete Training Cycle

Putting everything together, a full training loop for a neural network looks like this:

  1. Input data is passed into the network
  2. Forward propagation computes the prediction
  3. Loss function measures how wrong the prediction
  4. Backward propagation computes gradients for all weights and biases
  5. Gradient descent updates the parameters using the learning rate
  6. This is repeated for:
    • Every batch
    • Across multiple epochs

In one line:

Input → Forward Propagation → Loss → Backpropagation → Update Weights → Repeat

  • With each cycle:
    • The model becomes a bit less wrong
    • The loss usually goes down
    • The predictions improve
    • This is the entire training process in a nutshell.

14. Inference — Using the Trained Model

  • Once the model is trained:
    • We freeze the learned weights and biases
    • We give it new input data it has never seen
    • It runs only forward propagation to produce predictions
    • No backpropagation or weight updates happen during inference.
  • This is the phase where the model is actually used in production:
    • Recommending products
    • Classifying images
    • Translating text
    • Detecting fraud, etc.

Final Thought

From a simple line like y = wX + b to deep neural networks, everything in Machine Learning and Deep Learning is built on a few core ideas:

- Representing patterns with models
- Measuring error with a loss function
- Using gradients to find a better direction
- Controlling change with learning rate
- Repeating the process over batches and epochs
Enter fullscreen mode Exit fullscreen mode

The concepts are simple.
The power comes from scale, data, and iteration.

Top comments (0)