DEV Community: Olabamipe Taiwo

Loss Functions for Beginners

Olabamipe Taiwo — Sun, 01 Feb 2026 06:25:57 +0000

Loss functions are the quiet engine behind every machine learning model. They serve as the critical feedback loop, translating the abstract concept of error into a value that a computer can minimize. By quantifying the difference between a model’s prediction and the ground truth, the loss function provides the gradient signal that the optimizer uses to update the network's weights.

In essence, if the model architecture is the body of an AI, the data is its fuel, and the loss function is its central nervous system, constantly measuring pain (error) and instructing the model how to move to avoid it. Understanding which loss function to use is often the difference between a model that converges in minutes and one that never learns.

This guide introduces loss functions from first principles, explains the most common ones, and shows how to use them effectively in PyTorch.

The Two Pillars of Machine Learning: Regression vs. Classification

At the base of every machine learning problem, the objectives generally converge into two main classes: Regression and Classification. Having understood this, we can see how the choice of a loss function is not arbitrary; it is a direct consequence of the mathematical nature of your output.

Once we understand whether our task is predicting continuous values (regression) or discrete categories (classification), the landscape of loss functions becomes far easier to navigate. Every loss function in PyTorch is essentially a specialized tool built on top of these two pillars.

With that foundation in place, we can now explore how this split shapes the design of loss functions in PyTorch and how different tasks extend these two core ideas into more advanced forms, such as multi‑label classification, segmentation, and detection.

Regression Losses (Continuous Outputs)

Regression problems involve predicting continuous numerical values such as house prices, a person’s age, tomorrow’s temperature, or even pixel intensities in an image. In these tasks, the “error” is simply the distance between two points on a number line: the true value and the predicted value.

As a result, regression loss functions are fundamentally distance‑based. They quantify how far predictions deviate from targets and penalize larger deviations more heavily (or more gently), depending on the specific loss function.

Common PyTorch Regression Losses

MSELoss: penalizes squared error
L1Loss (MAE): penalizes absolute error

All these losses share one goal: They measure the distance between predicted and true values.

Mean Squared Error (MSE)

Mean Squared Error (MSE) is the most widely used loss function for regression. It measures the average of the squared differences between predicted and actual values.

By squaring the error, MSE ensures two important properties:

The loss is always non‑negative
Larger errors are penalized significantly more than smaller ones. If a prediction is off by 10 units, the penalty is 100; if off by 2, the penalty is only 4.

Minimizing MSE is equivalent to maximizing the likelihood of the data under a Gaussian (Normal) noise model. This makes MSE particularly effective when you want the model to strongly avoid large deviations.

PyTorch Implementation

import torch.nn as nn

criterion = nn.MSELoss()

You would typically use it inside a training loop like:

loss = criterion(predictions, targets)

import torch
import torch.nn as nn

# 1. Initialize the Loss
criterion = nn.MSELoss()

# 2. Example Data (Batch size of 2)
predictions = torch.tensor([2.5, 0.0], requires_grad=True)
targets = torch.tensor([3.0, -0.5])

# 3. Calculate Loss
loss = criterion(predictions, targets)
print(f"MSE Loss: {loss.item()}")

# Manual Calculation:
# ((2.5 - 3.0)**2 + (0.0 - (-0.5))**2) / 2
# = (0.25 + 0.25) / 2
# = 0.25

L1 Loss (MAE)

L1 Loss, also known as Mean Absolute Error (MAE), measures the average absolute difference between predicted and true values. Unlike MSE, which squares the error, L1 applies a linear penalty. This makes it more robust to outliers, since large errors do not explode quadratically. If your dataset contains corrupted data or extreme anomalies, MSE tends to overfit to them (skewing the model), whereas MAE treats them with less urgency.

Where MSE aggressively punishes large deviations, L1 treats all errors proportionally. This often leads to models that learn the median of the target distribution rather than the mean, as in engineering, there is a trade-off. The gradient is constant (either 1 or -1), meaning it doesn't decrease as you get closer to the target. This can make it harder for the model to make fine-tuned adjustments at the very end of training compared to MSE.

L1 Loss is useful when:

Your data contains outliers
You want a model that is robust rather than overly sensitive
You prefer a sparser gradient signal

Optimization can be slower and less smooth than MSE. However, the trade‑off is improved stability in noisy environments.

import torch.nn as nn

criterion = nn.L1Loss()

Usage inside a training loop:

loss = criterion(predictions, targets)


import torch
import torch.nn as nn

# 1. Initialize the Loss
criterion = nn.L1Loss()

# 2. Example Data (Batch size of 2)
predictions = torch.tensor([2.5, 0.0], requires_grad=True)
targets = torch.tensor([3.0, -0.5])

# 3. Calculate Loss
loss = criterion(predictions, targets)
print(f"L1 Loss (MAE): {loss.item()}")

# Manual Calculation:
# (|2.5 - 3.0| + |0.0 - (-0.5)|) / 2
# = (0.5 + 0.5) / 2
# = 0.5

Classification Loss Functions

Classification problems deal with discrete categories, not continuous values. Instead of predicting a single numeric output, the model produces a probability distribution over possible classes. The goal is not to minimize distance on a number line, but to assign high probability to the correct class and low probability to all others.

Because of this, classification loss functions measure how well the predicted probability distribution aligns with the true distribution. They quantify the uncertainty, surprise, or information mismatch between what the model believes and what is actually correct.

At their core, classification losses answer one fundamental question:

“How wrong is the model’s predicted probability for the correct class, and how confidently wrong is it?”

This matters because a model that is confidently wrong should be penalized more heavily than one that is uncertain. We have different types of classification, such as:

Binary classification involves choosing between two possible classes, where the model outputs a single probability representing the likelihood of the positive class.
Multi‑class classification involves selecting exactly one correct class from three or more categories, with the model predicting a probability distribution over all classes.
Multi‑label classification allows multiple classes to be correct simultaneously, treating each class as an independent binary decision with its own probability.
Multi‑class, multi‑target classification predicts multiple independent labels simultaneously, where each label has its own multi‑class distribution and loss term.

BCELoss (Binary Classification)

BCE is used when the task has two classes and the model outputs a single probability (after the sigmoid activation). It measures how close the predicted probability is to the true binary label. To use this function, the input must be probabilities (values between 0 and 1), and the Sigmoid activation function must be applied to your model's last layer before passing the output to this loss.

Note that if the model outputs exactly 0 or 1, the log term becomes −∞, which can lead to numerical instability,It's sensitive to numerical instability, so BCEWithLogitsLoss is preferred

import torch
import torch.nn as nn

criterion = nn.BCELoss()

preds = torch.tensor([0.8, 0.2], requires_grad=True)  # probabilities
targets = torch.tensor([1.0, 0.0])

loss = criterion(preds, targets)
print(f"BCE Loss: {loss.item()}")

CrossEntropyLoss (Multi‑Class Classification)

It is the standard loss function for multi‑class, single‑label classification, where each input belongs to exactly one class (e.g., MNIST digits 0-9 or ImageNet). It combines nn.LogSoftmax() and nn.NLLLoss() in a single class, which quantifies information loss when the model’s predicted distribution replaces the true distribution. High probability for the correct class leads to low loss, while confident wrong predictions lead to a very high loss

import torch
import torch.nn as nn

criterion = nn.CrossEntropyLoss()

logits = torch.tensor([[2.0, 1.0, 0.1]])  # raw scores
targets = torch.tensor([0])               # correct class index

loss = criterion(logits, targets)
print(f"CrossEntropy Loss: {loss.item()}")

NLLLoss (Negative Log‑Likelihood Loss)

NLLLoss computes the negative log‑likelihood of the correct class. It is used when the model outputs log‑probabilities, typically via nn.LogSoftmax.It is essentially Cross‑Entropy without the softmax step. It doesn’t compute logs or likelihoods; it simply selects the log‑probability of the correct class from your model’s output (e.g., picking −0.5 from [-1.2, -0.5, -2.3] when the target index is 1) and returns its negative as the loss.

You must apply log_softmax manually before passing values to NLLLoss.

import torch
import torch.nn as nn

# 1. The Model Output (Must be Log-Probabilities!)
# Imagine we have 3 classes.
# We MUST use LogSoftmax first.
m = nn.LogSoftmax(dim=1)
logits = torch.tensor([[0.1, 2.0, -1.0]]) # Raw scores
log_probs = m(logits) 
# log_probs is now approx [-2.1, -0.2, -3.2]

# 2. The Target
target = torch.tensor([1]) # The correct class is index 1

# 3. The Loss
criterion = nn.NLLLoss()
loss = criterion(log_probs, target)

print(f"Calculated Loss: {loss.item()}") 
# It simply grabbed the value at index 1 (-0.2), 
# and flipped the sign to 0.2.

BCEWithLogitsLoss (Binary or Multi‑Label )

BCEWithLogitsLoss is simply Binary Cross‑Entropy applied directly to raw logits, with a built‑in sigmoid activation. Instead of asking you to apply sigmoid() yourself and then compute BCE, PyTorch wraps both steps into one stable operation.

This matters because manually applying a sigmoid can cause numerical instability, an extremely large or small logits can overflow or underflow when converted to probabilities. By combining the sigmoid and BCE into a single optimized function, PyTorch avoids these issues and produces more reliable gradients.

This makes BCEWithLogitsLoss the recommended choice for both binary classification and multi‑label classification, where each class is treated as an independent yes/no prediction.

It accepts raw logits, applies sigmoid internally, and then computes BCE safely and efficiently.

import torch
import torch.nn as nn

# 1. Initialize the Loss
criterion = nn.BCEWithLogitsLoss()

# 2. Example Data (Binary or Multi‑Label)
logits = torch.tensor([1.2, -0.8], requires_grad=True)  # raw model outputs
targets = torch.tensor([1.0, 0.0])                      # true labels

# 3. Calculate Loss
loss = criterion(logits, targets)
print(f"BCEWithLogits Loss: {loss.item()}")

# Internally:
# - Applies sigmoid to logits
# - Computes Binary Cross‑Entropy on the resulting probabilities

How to Choose the Right Loss Function

Choosing the right loss function is one of the most important decisions in any machine learning project. The loss determines what the model learns, how it learns, and how stable training will be. A model can have the perfect architecture and optimizer, but with the wrong loss function, it will fail to converge or learn the wrong objective entirely.

The key is to match the loss function to three things:

The type of prediction you are making: The type of prediction you are making matters because every loss function is designed for a specific output structure. Continuous values require distance‑based losses like MSE or MAE, single‑class predictions require softmax‑based losses like CrossEntropyLoss, and multi‑label or binary predictions require sigmoid‑based losses like BCEWithLogitsLoss.
The distribution of your data: It matters because losses behave differently when classes are imbalanced, noisy, or skewed; Imbalanced datasets require class weights to prevent the model from collapsing to majority classes, while noisy or heavy‑tailed data may need more robust losses like MAE or CrossEntropy to ensure stable learning.
The structure of your outputs: Every loss function expects predictions in a specific shape. Single logits for binary tasks, a vector of class logits for multi‑class tasks, or multi‑hot vectors for multi‑label tasks, and if your model’s output format doesn’t match what the loss is designed for, the gradients become meaningless and training breaks down.

Once you understand these three dimensions, choosing a loss becomes systematic rather than a matter of guesswork.

Common Mistakes When Using Loss Functions in PyTorch

Using softmax or sigmoid before the loss: CrossEntropyLoss and BCEWithLogitsLoss are designed to take raw logits; adding these activations manually distorts the gradients, causes numerical instability, and leads to slower or failed training.
Choosing the wrong loss for the task: Each loss is designed for a specific prediction structure. Using CrossEntropyLoss for multi‑label data or BCE for multi‑class problems produces incorrect gradients and prevents the model from learning the intended objective
Incorrect target format: Loss function expects labels in a very specific structure. CrossEntropyLoss requires class indices (not one‑hot vectors), while BCEWithLogitsLoss requires float labels for each class, so giving the wrong format leads to shape mismatches, silent errors, or completely incorrect gradients.
Ignoring class imbalance: This is a common mistake because models naturally favor majority classes, and without using class weights or pos_weight, the loss becomes misleadingly low, and the model learns to ignore rare but important classes.
Misunderstanding logits: Logits are raw, unbounded scores, not probabilities, and treating them as probabilities leads to incorrect preprocessing and broken training.
Shape mismatches: They are equally common because loss functions expect predictions and targets to have compatible dimensions, and even a missing or extra batch or class dimension can cause cryptic runtime errors or silently incorrect learning.

A Practical Guide to Classifying Non-Linear Datasets with Pytorch

Olabamipe Taiwo — Mon, 22 Dec 2025 11:34:58 +0000

Classification is a fundamental form of supervised learning in which we predict a target variable (or label) from a set of input features. In traditional workflows using libraries like Scikit-Learn, the heavy lifting is often abstracted away, making the application feel straightforward. However, when we move to Deep Learning with PyTorch, we lose that 'black box' simplicity. Suddenly, we must make some technical-based decisions. Chief among these is analyzing the geometry of our data: is it linear, or does it require the complex non-linear capabilities of a neural network?

This article details the implementation of non-linear classification models from a foundational standpoint. It is based on technical coursework from the PyTorch for Deep Learning Bootcamp.

Before we dive in: What is Data Linearity?

Linearity of data refers to the geometric relationship between your input features and your target labels. Simply put, linear data is a dataset that can be modeled or separated by a straight line, e.g., student exam scores. Conversely, non-linear data is a dataset in which the relationships are curved, complex, or clustered, meaning a straight line cannot capture the pattern or effectively separate the classes, e.g, stock prices over time.

A common misconception is that deeper networks(two or more hidden layers) automatically solve complex problems. However, without non-linearity, a neural network of any depth is mathematically equivalent to a single linear regression model.

The Problem Space: Concentric Circles

Consider a binary classification dataset generated via Scikit-Learn's make_circles. The data consists of two concentric circles: a smaller inner circle and a larger outer circle. Geometrically, there is no single straight line that can separate these two classes.

The Failed Architecture (Linear Stack)

If we build a PyTorch model using only nn.Linear layers, we restrict the model to learning linear transformations.

class LinearBaseline(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(in_features=2, out_features=3)

    def forward(self, x):
        return self.linear(x)

Even if you stack 100 such layers, the composition of linear functions remains a linear function. The model will essentially try to draw a straight line through the circles, resulting in a maximum accuracy of roughly 50% (random guessing). This is a classic case of underfitting (when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and test sets).

To classify non-linear data, the model must warp the input space. We achieve this by injecting non-linear activation functions between the linear layers.

An activation function is a mathematical component in a neural network that determines whether a specific neuron should activate and passes a transformed signal to the next layer. To handle non-linear data, we modify our architecture to include ReLU (Rectified Linear Unit).

Mathematically defined as

f(x)= max(0,x)

ReLU forces all negative input values to zero. This simple operation introduces the necessary nonlinearity, allowing the network to learn complex, curved boundaries rather than just straight lines.

class CircleModelV2(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2, out_features=10)
        self.layer_2 = nn.Linear(in_features=10, out_features=10)
        self.layer_3 = nn.Linear(in_features=10, out_features=1)
        self.relu = nn.ReLU() # The non-linear activation

    def forward(self, x):
        # Linear -> ReLU -> Linear -> ReLU -> Output
        return self.layer_3(self.relu(self.layer_2(self.relu(self.layer_1(x)))))

Now let's experiment with our newly built architecture to see if it can really handle non-linear data.

Case Study: Classifying Spiral Data

Applying these principles, let's conduct an experiment using a Spiral Dataset generated from the Stanford Deep Learning class, which is geometrically more complex than circular data.

The dataset consists of 3 distinct classes arranged in a spiral pattern, totaling 300 samples.

Experiment I: The Linear Baseline (Confirmation of Failure)

We first attempted to fit the spiral data using a pure linear model with no hidden layers.

Result: The model failed to capture the spiral structure, stalling at an accuracy of approximately 45% on the test data after 1,000 epochs.
Visual Analysis: The decision boundaries formed rigid straight lines that sliced through the spirals, misclassifying significant portions of the data.

From the results of this initial experiment, it is evident that our model failed to capture the structure of the spiral data. Now, it is time to put our ReLU activation to work

Experiment II:

To solve the curvature problem, we constructed a model (NLinear) with two hidden layers (10 neurons each) and ReLU activation functions between them.


class NLinear(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_layer_stack = nn.Sequential(
            nn.Linear(in_features=2, out_features=10),
            nn.ReLU(),
            nn.Linear(in_features=10, out_features=10),
            nn.ReLU(),
            nn.Linear(in_features=10, out_features=3),
        )

    def forward(self, x):
        return self.linear_layer_stack(x)

model = NLinear()

Since we are dealing with a multi-class classification problem, the architecture must change in two key areas: the Loss Function and the Activation Strategy.

Component	Binary Classification	Multi-Class Classification (Our Goal)
Final Activation	Sigmoid (Output 0 to 1)	Softmax (Probabilities sum to 1)
Loss Function	`BCEWithLogitsLoss`	`CrossEntropyLoss`

Critical Note on CrossEntropyLoss

PyTorch's nn.CrossEntropyLoss expects raw logits as input, not probabilities. The function internally applies LogSoftmax.

Do not apply Softmax to your model's output layer before passing it to this loss function, or you will effectively use the operation twice, degrading model performance.

Result: The addition of non-linearity yielded a drastic improvement. The model achieved 91.25% accuracy on the training set and 96.67% on the test set.
Visual Analysis: The decision boundary successfully adjusted to separate the three spiral arms, validating that the combination of hidden layers and ReLU units enables the approximation of complex nonlinear functions.

Experiment III: ReLU vs. Tanh

To test the stability of different activation strategies, we replicated our successful architecture but replaced the nn.ReLU() activation with nn.Tanh().

What is Tanh? While ReLU cuts off negative values at zero, Tanh is a smooth, S-shaped curve that squashes input values to a range of -1 to 1.
Tanh is zero-centered, meaning its output has an average closer to 0, which theoretically helps center the data for the next layer.

The Results

Performance: Despite having the same parameter count and architecture depth, the Tanh model flatlined at 36.67% accuracy.
Comparison: The ReLU model achieved high accuracy, while Tanh failed to improve beyond the baseline.

Diagnosis: The Vanishing Gradient Problem

With three distinct classes, an accuracy of ~36% is equivalent to random guessing. Effectively, the model learned nothing.

Saturation: In deep networks, inputs can easily become large (positive or negative). On the Tanh curve, large inputs land in the flat regions where the slope is nearly horizontal.
Zero Gradients: When the slope is near zero, the gradient calculated during backpropagation is also near zero.
No Learning: Because the gradient determines how much we update the weights, a near-zero gradient means the weights never change. The signal vanishes before it reaches the early layers, halting the learning process.

Summary of Findings

Geometry Dictates Architecture
Linear models are fundamentally incapable of solving non-linear problems. As demonstrated by our baseline experiment, no amount of hyperparameter tuning or extended training duration (epochs) can force a linear model to capture curved data structures, such as spirals. The limitation is mathematical, not computational.
Activation Functions Are Critical
The choice of activation function is not just a detail, it is a structural necessity. Our experiments revealed that ReLU is superior to Tanh for this type of data.

The Tanh model suffered from the vanishing gradient problem, causing it to flatline at a random-guess accuracy of ~36%.

The ReLU model maintained healthy gradient flow, enabling it to learn the complex decision boundary and achieve a final accuracy of 96.67%

Final Conclusion

Building neural networks is not just about stacking layers; it is about matching the model's geometric capacity to the data's shape.

Navigating Early Career Hurdles: Security (Keeping User Data Safe as a Frontend Engineer)

Olabamipe Taiwo — Mon, 29 Apr 2024 23:47:03 +0000

In software product development, security is one foundational principle on which your product relies heavily. Put, security encompasses all practices aimed at safeguarding digital solutions from attackers, ensuring unauthorized access to systems and data is prevented. With this understanding in mind, it becomes a shared responsibility for every engineer to prioritize. While security comprises various components, this post will focus on addressing one of these integral parts.

Challenge

In the early stages of my software engineering journey, I encountered a notable obstacle: securely storing essential user data within the browser, while also ensuring it remained encrypted and inaccessible to unauthorized individuals. This challenge proved to be a frequent hurdle as I navigated the beginnings of my career. Unfortunately, many of the resources available to me at the time overlooked the critical nature of this issue. Consequently, I found myself inadvertently storing and inadvertently exposing sensitive user data within the browser environment.

Primarily, there are three common methods for storing data in a browser:

Cookies: These are small pieces of data stored as key-value pairs. Cookies are commonly used to store user-specific information and are sent with every HTTP request to the server.
Local Storage: Offering a larger storage capacity than cookies, local storage does not accompany every HTTP request, thereby improving performance. Data stored in local storage persists even after the browser is closed, remaining on the user's device as key-value pairs.
Session Storage: Similar to local storage, session storage also stores data as key-value pairs. However, it is limited by the system memory and only lasts for the duration of the user's session. Data stored in session storage is available until the browser is closed.

While each of these methods continues to serve its purpose across different scenarios, a crucial aspect often goes unnoticed: the accessibility of stored data during the lifespan of these storage methods. This oversight directly conflicts with one of the fundamental principles of the CIA Triad: Confidentiality. Depending on the sensitivity of the data, exposing it to potential threats poses substantial risks. Thus, it becomes imperative to devise a solution that guarantees the full security and confidentiality of this data.

Solution:

After thorough consideration of the issue and collaborating with a well-structured team comprising seasoned engineers, I discovered a viable solution: ENCRYPTION.

Encryption is a process of encoding information in such a way that only authorized parties can access it. It involves converting plaintext (readable data) into ciphertext (encoded data) using an encryption algorithm and a cryptographic key. The encrypted data can only be decoded back into plaintext by someone possessing the corresponding decryption key.

The fundamental aim of encryption lies in safeguarding sensitive data by upholding both its confidentiality and integrity, By encrypting data, it becomes unintelligible to unauthorized users or attackers who may attempt to intercept or access it unlawfully. Encryption serves as a powerful tool for protecting data privacy, preventing unauthorized access, and securing communication channels.

Implementation:

Implementing the solution involved integrating encryption techniques into the application's data storage mechanism. I encrypted the user data before storing it in the browser's sessionStorage or localStorage.

The implementation process begins with a thorough assessment of the encryption methods available, ranging from symmetric and asymmetric encryption to hashing algorithms and tokenization techniques. Each method comes with its own set of strengths and weaknesses, which must be weighed against the requirements of the application.

Once a suitable method has been selected, it is essential to ensure proper integration within the frontend architecture. This may involve leveraging built-in encryption libraries, implementing custom solutions, or integrating with third-party services for enhanced security features.

For most of my apps, I tend to gravitate towards using AES (Advanced Encryption Standard) encryption as my preferred method for several reasons:

Security Strength: AES is widely acknowledged as a highly secure encryption algorithm, having undergone rigorous testing and scrutiny by cryptographers.
Performance: AES is optimized for efficient execution across a wide array of hardware and software platforms. Its fast encryption and decryption speeds make it ideal for front-end applications where responsiveness is paramount.
Versatility: AES offers support for various key lengths, ranging from 128-bit to 256-bit. This flexibility enables developers to tailor the level of security to meet specific requirements, striking a balance between security and performance.
Ease of Implementation: AES encryption is well-supported in popular programming languages and libraries commonly used in frontend engineering. Additionally, there are third-party libraries available that simplify the setup and utilization of AES encryption, further streamlining the implementation process

Outcome:

The implementation of encryption proved highly effective. Users' sensitive information remained shielded from prying eyes, even in the event of unauthorized access to the browser's storage. This not only bolstered the security of the application but also instilled confidence in users regarding the protection of their data.

Conclusion:

Navigating the challenges of early career endeavors in data security underscored the importance of proactive measures and innovative solutions. By leveraging different encryption algorithms, I not only addressed a pressing issue but also gained valuable insights into the intricacies of safeguarding user data. This experience led me to understand security as a core component of software systems, emphasizing the critical role of security in building trust and reliability in digital systems.