Olabamipe Taiwo

Posted on Dec 22, 2025

A Practical Guide to Classifying Non-Linear Datasets with Pytorch

#deeplearning #machinelearning #datascience #python

Classification is a fundamental form of supervised learning in which we predict a target variable (or label) from a set of input features. In traditional workflows using libraries like Scikit-Learn, the heavy lifting is often abstracted away, making the application feel straightforward. However, when we move to Deep Learning with PyTorch, we lose that 'black box' simplicity. Suddenly, we must make some technical-based decisions. Chief among these is analyzing the geometry of our data: is it linear, or does it require the complex non-linear capabilities of a neural network?

This article details the implementation of non-linear classification models from a foundational standpoint. It is based on technical coursework from the PyTorch for Deep Learning Bootcamp.

Before we dive in: What is Data Linearity?

Linearity of data refers to the geometric relationship between your input features and your target labels. Simply put, linear data is a dataset that can be modeled or separated by a straight line, e.g., student exam scores. Conversely, non-linear data is a dataset in which the relationships are curved, complex, or clustered, meaning a straight line cannot capture the pattern or effectively separate the classes, e.g, stock prices over time.

A common misconception is that deeper networks(two or more hidden layers) automatically solve complex problems. However, without non-linearity, a neural network of any depth is mathematically equivalent to a single linear regression model.

The Problem Space: Concentric Circles

Consider a binary classification dataset generated via Scikit-Learn's make_circles. The data consists of two concentric circles: a smaller inner circle and a larger outer circle. Geometrically, there is no single straight line that can separate these two classes.

The Failed Architecture (Linear Stack)

If we build a PyTorch model using only nn.Linear layers, we restrict the model to learning linear transformations.

class LinearBaseline(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(in_features=2, out_features=3)

    def forward(self, x):
        return self.linear(x)

Even if you stack 100 such layers, the composition of linear functions remains a linear function. The model will essentially try to draw a straight line through the circles, resulting in a maximum accuracy of roughly 50% (random guessing). This is a classic case of underfitting (when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and test sets).

To classify non-linear data, the model must warp the input space. We achieve this by injecting non-linear activation functions between the linear layers.

An activation function is a mathematical component in a neural network that determines whether a specific neuron should activate and passes a transformed signal to the next layer. To handle non-linear data, we modify our architecture to include ReLU (Rectified Linear Unit).

Mathematically defined as

f(x)= max(0,x)

ReLU forces all negative input values to zero. This simple operation introduces the necessary nonlinearity, allowing the network to learn complex, curved boundaries rather than just straight lines.

class CircleModelV2(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2, out_features=10)
        self.layer_2 = nn.Linear(in_features=10, out_features=10)
        self.layer_3 = nn.Linear(in_features=10, out_features=1)
        self.relu = nn.ReLU() # The non-linear activation

    def forward(self, x):
        # Linear -> ReLU -> Linear -> ReLU -> Output
        return self.layer_3(self.relu(self.layer_2(self.relu(self.layer_1(x)))))

Now let's experiment with our newly built architecture to see if it can really handle non-linear data.

Case Study: Classifying Spiral Data

Applying these principles, let's conduct an experiment using a Spiral Dataset generated from the Stanford Deep Learning class, which is geometrically more complex than circular data.

The dataset consists of 3 distinct classes arranged in a spiral pattern, totaling 300 samples.

Experiment I: The Linear Baseline (Confirmation of Failure)

We first attempted to fit the spiral data using a pure linear model with no hidden layers.

Result: The model failed to capture the spiral structure, stalling at an accuracy of approximately 45% on the test data after 1,000 epochs.
Visual Analysis: The decision boundaries formed rigid straight lines that sliced through the spirals, misclassifying significant portions of the data.

From the results of this initial experiment, it is evident that our model failed to capture the structure of the spiral data. Now, it is time to put our ReLU activation to work

Experiment II:

To solve the curvature problem, we constructed a model (NLinear) with two hidden layers (10 neurons each) and ReLU activation functions between them.


class NLinear(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_layer_stack = nn.Sequential(
            nn.Linear(in_features=2, out_features=10),
            nn.ReLU(),
            nn.Linear(in_features=10, out_features=10),
            nn.ReLU(),
            nn.Linear(in_features=10, out_features=3),
        )

    def forward(self, x):
        return self.linear_layer_stack(x)

model = NLinear()

Since we are dealing with a multi-class classification problem, the architecture must change in two key areas: the Loss Function and the Activation Strategy.

Component	Binary Classification	Multi-Class Classification (Our Goal)
Final Activation	Sigmoid (Output 0 to 1)	Softmax (Probabilities sum to 1)
Loss Function	`BCEWithLogitsLoss`	`CrossEntropyLoss`

Critical Note on CrossEntropyLoss

PyTorch's nn.CrossEntropyLoss expects raw logits as input, not probabilities. The function internally applies LogSoftmax.

Do not apply Softmax to your model's output layer before passing it to this loss function, or you will effectively use the operation twice, degrading model performance.

Result: The addition of non-linearity yielded a drastic improvement. The model achieved 91.25% accuracy on the training set and 96.67% on the test set.
Visual Analysis: The decision boundary successfully adjusted to separate the three spiral arms, validating that the combination of hidden layers and ReLU units enables the approximation of complex nonlinear functions.

Experiment III: ReLU vs. Tanh

To test the stability of different activation strategies, we replicated our successful architecture but replaced the nn.ReLU() activation with nn.Tanh().

What is Tanh? While ReLU cuts off negative values at zero, Tanh is a smooth, S-shaped curve that squashes input values to a range of -1 to 1.
Tanh is zero-centered, meaning its output has an average closer to 0, which theoretically helps center the data for the next layer.

The Results

Performance: Despite having the same parameter count and architecture depth, the Tanh model flatlined at 36.67% accuracy.
Comparison: The ReLU model achieved high accuracy, while Tanh failed to improve beyond the baseline.

Diagnosis: The Vanishing Gradient Problem

With three distinct classes, an accuracy of ~36% is equivalent to random guessing. Effectively, the model learned nothing.

Saturation: In deep networks, inputs can easily become large (positive or negative). On the Tanh curve, large inputs land in the flat regions where the slope is nearly horizontal.
Zero Gradients: When the slope is near zero, the gradient calculated during backpropagation is also near zero.
No Learning: Because the gradient determines how much we update the weights, a near-zero gradient means the weights never change. The signal vanishes before it reaches the early layers, halting the learning process.

Summary of Findings

Geometry Dictates Architecture
Linear models are fundamentally incapable of solving non-linear problems. As demonstrated by our baseline experiment, no amount of hyperparameter tuning or extended training duration (epochs) can force a linear model to capture curved data structures, such as spirals. The limitation is mathematical, not computational.
Activation Functions Are Critical
The choice of activation function is not just a detail, it is a structural necessity. Our experiments revealed that ReLU is superior to Tanh for this type of data.

The Tanh model suffered from the vanishing gradient problem, causing it to flatline at a random-guess accuracy of ~36%.

The ReLU model maintained healthy gradient flow, enabling it to learn the complex decision boundary and achieve a final accuracy of 96.67%

Final Conclusion

Building neural networks is not just about stacking layers; it is about matching the model's geometric capacity to the data's shape.

DEV Community