Olabamipe Taiwo

Posted on Feb 1

Loss Functions for Beginners

#machinelearning #deeplearning #beginners #tutorial

Loss functions are the quiet engine behind every machine learning model. They serve as the critical feedback loop, translating the abstract concept of error into a value that a computer can minimize. By quantifying the difference between a model’s prediction and the ground truth, the loss function provides the gradient signal that the optimizer uses to update the network's weights.

In essence, if the model architecture is the body of an AI, the data is its fuel, and the loss function is its central nervous system, constantly measuring pain (error) and instructing the model how to move to avoid it. Understanding which loss function to use is often the difference between a model that converges in minutes and one that never learns.

This guide introduces loss functions from first principles, explains the most common ones, and shows how to use them effectively in PyTorch.

The Two Pillars of Machine Learning: Regression vs. Classification

At the base of every machine learning problem, the objectives generally converge into two main classes: Regression and Classification. Having understood this, we can see how the choice of a loss function is not arbitrary; it is a direct consequence of the mathematical nature of your output.

Once we understand whether our task is predicting continuous values (regression) or discrete categories (classification), the landscape of loss functions becomes far easier to navigate. Every loss function in PyTorch is essentially a specialized tool built on top of these two pillars.

With that foundation in place, we can now explore how this split shapes the design of loss functions in PyTorch and how different tasks extend these two core ideas into more advanced forms, such as multi‑label classification, segmentation, and detection.

Regression Losses (Continuous Outputs)

Regression problems involve predicting continuous numerical values such as house prices, a person’s age, tomorrow’s temperature, or even pixel intensities in an image. In these tasks, the “error” is simply the distance between two points on a number line: the true value and the predicted value.

As a result, regression loss functions are fundamentally distance‑based. They quantify how far predictions deviate from targets and penalize larger deviations more heavily (or more gently), depending on the specific loss function.

Common PyTorch Regression Losses

MSELoss: penalizes squared error
L1Loss (MAE): penalizes absolute error

All these losses share one goal: They measure the distance between predicted and true values.

Mean Squared Error (MSE)

Mean Squared Error (MSE) is the most widely used loss function for regression. It measures the average of the squared differences between predicted and actual values.

By squaring the error, MSE ensures two important properties:

The loss is always non‑negative
Larger errors are penalized significantly more than smaller ones. If a prediction is off by 10 units, the penalty is 100; if off by 2, the penalty is only 4.

Minimizing MSE is equivalent to maximizing the likelihood of the data under a Gaussian (Normal) noise model. This makes MSE particularly effective when you want the model to strongly avoid large deviations.

PyTorch Implementation

import torch.nn as nn

criterion = nn.MSELoss()

You would typically use it inside a training loop like:

loss = criterion(predictions, targets)

import torch
import torch.nn as nn

# 1. Initialize the Loss
criterion = nn.MSELoss()

# 2. Example Data (Batch size of 2)
predictions = torch.tensor([2.5, 0.0], requires_grad=True)
targets = torch.tensor([3.0, -0.5])

# 3. Calculate Loss
loss = criterion(predictions, targets)
print(f"MSE Loss: {loss.item()}")

# Manual Calculation:
# ((2.5 - 3.0)**2 + (0.0 - (-0.5))**2) / 2
# = (0.25 + 0.25) / 2
# = 0.25

L1 Loss (MAE)

L1 Loss, also known as Mean Absolute Error (MAE), measures the average absolute difference between predicted and true values. Unlike MSE, which squares the error, L1 applies a linear penalty. This makes it more robust to outliers, since large errors do not explode quadratically. If your dataset contains corrupted data or extreme anomalies, MSE tends to overfit to them (skewing the model), whereas MAE treats them with less urgency.

Where MSE aggressively punishes large deviations, L1 treats all errors proportionally. This often leads to models that learn the median of the target distribution rather than the mean, as in engineering, there is a trade-off. The gradient is constant (either 1 or -1), meaning it doesn't decrease as you get closer to the target. This can make it harder for the model to make fine-tuned adjustments at the very end of training compared to MSE.

L1 Loss is useful when:

Your data contains outliers
You want a model that is robust rather than overly sensitive
You prefer a sparser gradient signal

Optimization can be slower and less smooth than MSE. However, the trade‑off is improved stability in noisy environments.

import torch.nn as nn

criterion = nn.L1Loss()

Usage inside a training loop:

loss = criterion(predictions, targets)


import torch
import torch.nn as nn

# 1. Initialize the Loss
criterion = nn.L1Loss()

# 2. Example Data (Batch size of 2)
predictions = torch.tensor([2.5, 0.0], requires_grad=True)
targets = torch.tensor([3.0, -0.5])

# 3. Calculate Loss
loss = criterion(predictions, targets)
print(f"L1 Loss (MAE): {loss.item()}")

# Manual Calculation:
# (|2.5 - 3.0| + |0.0 - (-0.5)|) / 2
# = (0.5 + 0.5) / 2
# = 0.5

Classification Loss Functions

Classification problems deal with discrete categories, not continuous values. Instead of predicting a single numeric output, the model produces a probability distribution over possible classes. The goal is not to minimize distance on a number line, but to assign high probability to the correct class and low probability to all others.

Because of this, classification loss functions measure how well the predicted probability distribution aligns with the true distribution. They quantify the uncertainty, surprise, or information mismatch between what the model believes and what is actually correct.

At their core, classification losses answer one fundamental question:

“How wrong is the model’s predicted probability for the correct class, and how confidently wrong is it?”

This matters because a model that is confidently wrong should be penalized more heavily than one that is uncertain. We have different types of classification, such as:

Binary classification involves choosing between two possible classes, where the model outputs a single probability representing the likelihood of the positive class.
Multi‑class classification involves selecting exactly one correct class from three or more categories, with the model predicting a probability distribution over all classes.
Multi‑label classification allows multiple classes to be correct simultaneously, treating each class as an independent binary decision with its own probability.
Multi‑class, multi‑target classification predicts multiple independent labels simultaneously, where each label has its own multi‑class distribution and loss term.

BCELoss (Binary Classification)

BCE is used when the task has two classes and the model outputs a single probability (after the sigmoid activation). It measures how close the predicted probability is to the true binary label. To use this function, the input must be probabilities (values between 0 and 1), and the Sigmoid activation function must be applied to your model's last layer before passing the output to this loss.

Note that if the model outputs exactly 0 or 1, the log term becomes −∞, which can lead to numerical instability,It's sensitive to numerical instability, so BCEWithLogitsLoss is preferred

import torch
import torch.nn as nn

criterion = nn.BCELoss()

preds = torch.tensor([0.8, 0.2], requires_grad=True)  # probabilities
targets = torch.tensor([1.0, 0.0])

loss = criterion(preds, targets)
print(f"BCE Loss: {loss.item()}")

CrossEntropyLoss (Multi‑Class Classification)

It is the standard loss function for multi‑class, single‑label classification, where each input belongs to exactly one class (e.g., MNIST digits 0-9 or ImageNet). It combines nn.LogSoftmax() and nn.NLLLoss() in a single class, which quantifies information loss when the model’s predicted distribution replaces the true distribution. High probability for the correct class leads to low loss, while confident wrong predictions lead to a very high loss

import torch
import torch.nn as nn

criterion = nn.CrossEntropyLoss()

logits = torch.tensor([[2.0, 1.0, 0.1]])  # raw scores
targets = torch.tensor([0])               # correct class index

loss = criterion(logits, targets)
print(f"CrossEntropy Loss: {loss.item()}")

NLLLoss (Negative Log‑Likelihood Loss)

NLLLoss computes the negative log‑likelihood of the correct class. It is used when the model outputs log‑probabilities, typically via nn.LogSoftmax.It is essentially Cross‑Entropy without the softmax step. It doesn’t compute logs or likelihoods; it simply selects the log‑probability of the correct class from your model’s output (e.g., picking −0.5 from [-1.2, -0.5, -2.3] when the target index is 1) and returns its negative as the loss.

You must apply log_softmax manually before passing values to NLLLoss.

import torch
import torch.nn as nn

# 1. The Model Output (Must be Log-Probabilities!)
# Imagine we have 3 classes.
# We MUST use LogSoftmax first.
m = nn.LogSoftmax(dim=1)
logits = torch.tensor([[0.1, 2.0, -1.0]]) # Raw scores
log_probs = m(logits) 
# log_probs is now approx [-2.1, -0.2, -3.2]

# 2. The Target
target = torch.tensor([1]) # The correct class is index 1

# 3. The Loss
criterion = nn.NLLLoss()
loss = criterion(log_probs, target)

print(f"Calculated Loss: {loss.item()}") 
# It simply grabbed the value at index 1 (-0.2), 
# and flipped the sign to 0.2.

BCEWithLogitsLoss (Binary or Multi‑Label )

BCEWithLogitsLoss is simply Binary Cross‑Entropy applied directly to raw logits, with a built‑in sigmoid activation. Instead of asking you to apply sigmoid() yourself and then compute BCE, PyTorch wraps both steps into one stable operation.

This matters because manually applying a sigmoid can cause numerical instability, an extremely large or small logits can overflow or underflow when converted to probabilities. By combining the sigmoid and BCE into a single optimized function, PyTorch avoids these issues and produces more reliable gradients.

This makes BCEWithLogitsLoss the recommended choice for both binary classification and multi‑label classification, where each class is treated as an independent yes/no prediction.

It accepts raw logits, applies sigmoid internally, and then computes BCE safely and efficiently.

import torch
import torch.nn as nn

# 1. Initialize the Loss
criterion = nn.BCEWithLogitsLoss()

# 2. Example Data (Binary or Multi‑Label)
logits = torch.tensor([1.2, -0.8], requires_grad=True)  # raw model outputs
targets = torch.tensor([1.0, 0.0])                      # true labels

# 3. Calculate Loss
loss = criterion(logits, targets)
print(f"BCEWithLogits Loss: {loss.item()}")

# Internally:
# - Applies sigmoid to logits
# - Computes Binary Cross‑Entropy on the resulting probabilities

How to Choose the Right Loss Function

Choosing the right loss function is one of the most important decisions in any machine learning project. The loss determines what the model learns, how it learns, and how stable training will be. A model can have the perfect architecture and optimizer, but with the wrong loss function, it will fail to converge or learn the wrong objective entirely.

The key is to match the loss function to three things:

The type of prediction you are making: The type of prediction you are making matters because every loss function is designed for a specific output structure. Continuous values require distance‑based losses like MSE or MAE, single‑class predictions require softmax‑based losses like CrossEntropyLoss, and multi‑label or binary predictions require sigmoid‑based losses like BCEWithLogitsLoss.
The distribution of your data: It matters because losses behave differently when classes are imbalanced, noisy, or skewed; Imbalanced datasets require class weights to prevent the model from collapsing to majority classes, while noisy or heavy‑tailed data may need more robust losses like MAE or CrossEntropy to ensure stable learning.
The structure of your outputs: Every loss function expects predictions in a specific shape. Single logits for binary tasks, a vector of class logits for multi‑class tasks, or multi‑hot vectors for multi‑label tasks, and if your model’s output format doesn’t match what the loss is designed for, the gradients become meaningless and training breaks down.

Once you understand these three dimensions, choosing a loss becomes systematic rather than a matter of guesswork.

Common Mistakes When Using Loss Functions in PyTorch

Using softmax or sigmoid before the loss: CrossEntropyLoss and BCEWithLogitsLoss are designed to take raw logits; adding these activations manually distorts the gradients, causes numerical instability, and leads to slower or failed training.
Choosing the wrong loss for the task: Each loss is designed for a specific prediction structure. Using CrossEntropyLoss for multi‑label data or BCE for multi‑class problems produces incorrect gradients and prevents the model from learning the intended objective
Incorrect target format: Loss function expects labels in a very specific structure. CrossEntropyLoss requires class indices (not one‑hot vectors), while BCEWithLogitsLoss requires float labels for each class, so giving the wrong format leads to shape mismatches, silent errors, or completely incorrect gradients.
Ignoring class imbalance: This is a common mistake because models naturally favor majority classes, and without using class weights or pos_weight, the loss becomes misleadingly low, and the model learns to ignore rare but important classes.
Misunderstanding logits: Logits are raw, unbounded scores, not probabilities, and treating them as probabilities leads to incorrect preprocessing and broken training.
Shape mismatches: They are equally common because loss functions expect predictions and targets to have compatible dimensions, and even a missing or extra batch or class dimension can cause cryptic runtime errors or silently incorrect learning.

DEV Community

Loss Functions for Beginners

Top comments (0)