Julie Fisher

Posted on Oct 30

Fitting KNN: From Overfit to Underfit and Everything Between

#machinelearning #python #datascience

Machine learning models, like clothes, are all about fit — too tight, and they can’t move; too loose, and they lose their shape.

TLDR;

Fit Type	Characteristics	Model Behavior
Overfit	High variance, low bias, jagged boundaries	Memorizes noise
Generalizable	Balanced bias/variance	Follows data boundaries
Underfit	Low variance, high bias, smooth boundaries	Overgeneralizes

Finding the Right Fit

Just like choosing clothes that flatter the shape of the wearer, a well-fit model captures the underlying pattern of data without clinging too tightly to its quirks or hanging too loosely from its actual structure. In this post, we’ll explore what it means for a KNN model to fit “just right” — not too tight, not too loose — and how we can visualize that balance in action.

There are three terms to be familiar with when discussing model fit:

Generalization: a model is able to make accurate predictions on unseen data, i.e. it is able to generalize from the training set to the test set
Overfit: a model is fit too strictly to the training set, including its noise and outliers, making it perform poorly on new data
Underfit: a model is fit too loosely / simply and can't capture the underlying patterns in the training data, so it also performs poorly on new data

A model that is generalizable is the goal of model training. You want a trained model that can correctly predict some outcome.

There are many reasons that a model may not be able to generalize to new, unseen data. The two reasons we'll explore in this post are overfitting and underfitting.

The outcome of overfitting and underfitting is the same: the model isn't able to generalize to unseen data. However, the reasons for this failure are different and the methods to fix it are different.

Taking Model Measurements

Before you tailor anything, you need good measurements. Here, those “measurements” come from our dataset, our preprocessing, and our choice of model parameters. We’ll prepare the data, define our helper functions, and run KNN models across a range of neighbor values to see how the fit changes.

Just like in the last post, we'll train KNN models using a number of neighbors ranging from 2 - 100. Why 100? Because this range covers the full spectrum from overfit to generalizable to underfit.

I picked accuracy as the single performance metric to use. From the Evaluating KNN: From Training Field to Scoreboard post, you'll recognize this as the "impress stakeholders" (i.e., you the reader) metric. I'm (kinda) joking. We'll explore all the metrics in a later post once we've tackled model fit, variance, and bias, but accuracy is a very common metric, so we'll develop the code using accuracy.

We'll use the same helper functions from the last post and add a couple more. We'll be doing a lot of visualizations in this post, so I threw the accuracy plots and fit plots into functions as well.

I updated the fit plots to display a grid of results at selected neighbor values for each random_state. This helps us confirm that observed patterns hold across different data splits rather than being artifacts of one specific split.

# Load in our libraries
# These should always be at the top of your notebook/script
import os
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import kagglehub

# Helper functions
def split_data(df, random_state=52, train_size=0.7, target_col='Purchased'):
    train_data, test_data = train_test_split(df, train_size=train_size, random_state=random_state, stratify=df[target_col])

    trainX = train_data.drop(target_col, axis=1)
    trainY = train_data[target_col]

    testX = test_data.drop(target_col, axis=1)
    testY = test_data[target_col]
    return trainX, trainY, testX, testY

def train_eval(model, trainX, trainY, testX, testY):
    model.fit(trainX, trainY)
    test_preds = model.predict(testX)
    accuracy = accuracy_score(testY, test_preds)
    return accuracy

def plot_accuracy_vs_neighbors(df, nbr_neighbors = range(2, 100), random_states=[52, 9, 130, 404, 20, 119]):
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))
    axes = axes.flatten()

    all_accuracies = []

    for idx, random_state in enumerate(random_states):
        trainX, trainY, testX, testY = split_data(df, random_state=random_state)

        scaler = StandardScaler()
        trainX_scaled = scaler.fit_transform(trainX)
        testX_scaled = scaler.transform(testX)


        accuracies = []
        for x in nbr_neighbors:
            model = KNeighborsClassifier(n_neighbors=x)
            metrics = train_eval(model, trainX_scaled, trainY, testX_scaled, testY)
            accuracies.append(metrics)

        all_accuracies.extend(accuracies)

        ax = axes[idx]
        ax.plot(nbr_neighbors, accuracies, marker='o')
        ax.set_title(f'Random state: {random_state}')
        ax.set_xlabel('Number of Neighbors (k)')
        ax.set_ylabel('Accuracy')
        ax.grid(True)

    # Set consistent y-axis limits
    y_min = min(all_accuracies) - 0.01
    y_max = max(all_accuracies) + 0.01
    for ax in axes:
        ax.set_ylim([y_min, y_max])

    plt.tight_layout()
    plt.show()

def plot_fit_at_fixed_neighbors(df, fixed_n_neighbors, random_states = [52, 9, 130, 404, 20, 119]):
    # Create subplots
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))
    axes = axes.flatten()

    for idx, random_state in enumerate(random_states):
        # Split and scale the data
        trainX, trainY, testX, testY = split_data(df, random_state=random_state)

        scaler = StandardScaler()
        trainX_scaled = scaler.fit_transform(trainX)
        testX_scaled = scaler.transform(testX)

        # Create mesh grid for decision boundary
        x_min, x_max = trainX['Age'].min() - 1, trainX['Age'].max() + 1
        y_min, y_max = trainX['EstimatedSalary'].min() - 1, trainX['EstimatedSalary'].max() + 1
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                            np.linspace(y_min, y_max, 100))

        # Scale mesh grid
        mesh_scaled = scaler.transform(np.c_[xx.ravel(), yy.ravel()])

        # Fit KNN model
        model = KNeighborsClassifier(n_neighbors=fixed_n_neighbors)
        model.fit(trainX_scaled, trainY)

        # Predict on mesh grid
        Z = model.predict(mesh_scaled)
        Z = Z.reshape(xx.shape)

        # Plot decision boundary and scatter plot
        ax = axes[idx]
        ax.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
        sns.scatterplot(data=trainX.assign(Purchased=trainY), x='Age', y='EstimatedSalary',
                        hue='Purchased', palette='bright', ax=ax)
        ax.set_title(f'Random State: {random_state}, Nbr Neighbors {fixed_n_neighbors}')
        ax.set_xlabel('Age')
        ax.set_ylabel('EstimatedSalary')
        ax.grid(True)

    plt.tight_layout()
    plt.show()

The code to load the data and build a model should look familiar now from the previous posts.

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))

df = df.drop(columns=["User ID", "Gender"], axis=1)

As a reminder from the last post, here is the set of plots showing model performance for different random_states.

plot_accuracy_vs_neighbors(df)

I see three distinct areas in each of these plots:

Few neighbors: unstable, lower performance: overfit
Moderate neighbors: stable, higher performance: generalizable
Many neighbors: declining performance: underfit

Overfit: The Restrictive Fit

Sometimes a model clings the data so closely that it captures every wrinkle and crease, even the ones that shouldn’t matter. That’s overfitting: when a KNN model memorizes the training data instead of learning its general shape. The result looks impressive on known data but uncomfortable and restrictive when faced with something new.

Let's zoom in on the region with few neighbors to see this in action.

plot_accuracy_vs_neighbors(df, range(2, 15))

Each of the random_state plots is slightly different, but we see increasing performance that stabilizes somewhere between 5 - 12 neighbors.

I'm interested in how fit changes at these low values depending on the train/test split. Specifically, whether a KNN model really can find a generalizable fit with as few as 5 neighbors.

For a baseline that we know is overfit, let's first take a look at the fit at 2 neighbors for each of these random_states.

plot_fit_at_fixed_neighbors(df, 2)

At n_neighbors == 2 the overfitting is obvious for all random_states. The decision boundary is jagged and even has islands of fit for single points that are mixed in with the opposing class. With such a small number of neighbors, each prediction is heavily influenced by just one or two nearby points, leading to decision boundaries that perfectly trace the training data but fail to generalize.

Now let's take a look at 5 neighbors. This is where the plots for random_state values 20, 119, and 9 seemed like they might have started generalizing.

plot_fit_at_fixed_neighbors(df, 5)

These plots show improved ability to generalize to the patters we can see. However, random_state values 9 and 119 still have islands of prediction mixed in and all of the boundaries are still pretty jagged. These models all still look overfit to me.

Generalizable: The Perfect Fit

A perfectly tailored model moves with the data, it's flexible enough to adapt, but structured enough to hold its shape. In this middle zone, the KNN model generalizes well: it captures the key relationships without being distorted by noise. Here, we’ll look at what that balance looks like in both accuracy plots and decision boundaries.

Let's jump to a number of neighbors of 12. By this point, all the accuracy plots show stabilization in performance, indicating that the model has reached a more generalizable state.

plot_fit_at_fixed_neighbors(df, 12)

At this point we can see that the decision boundary has become much more stable and good at differentiating between the different regions for our classes.

For most of our data splits, this stability lasts until around 50 neighbors. Let's zoom in on the 12 - 50 neighbor region of the performance.

plot_accuracy_vs_neighbors(df, range(12, 51))

The random_states of 52 and 20 see a sharp decline in performance around 45 neighbors, while random_states 9 and 130 look like they continue to enjoy stability beyond 50 neighbors.

Let's look at 40 neighbors. This number of neighbors should show good results for all of our random_states and give us a comparison against the beginning of our stable range of 12 neighbors.

plot_fit_at_fixed_neighbors(df, 40)

The decision boundaries here all look pretty clean, and still very similar to the plots from n_neighbors=12. There are no extreme attempts at trying to include or exclude any particular point.

Underfit: The Baggy Fit

At the other far end of the spectrum, an underfit model is like clothing that’s too baggy, it smooths over every detail, losing definition and shape. In KNN terms, this happens when we use too many neighbors. The model becomes overly simple, predicting broad averages instead of meaningful distinctions.

Let's zoom in on our third region, and see what happens during declining accuracy.

plot_accuracy_vs_neighbors(df, range(45, 101))

The decline in accuracy is obvious in all of our random_states. Since we already know what a good fit looks like, let's jump right to the underfit extreme n_neighbors=100.

plot_fit_at_fixed_neighbors(df, 100)

All of our plots for the different random_states show that we've lost the ability to predict 1 along a wide band where our previous decision boundaries had existed.

We’ve lost too much detail in the decision boundary. It becomes overly smooth and shifts toward the upper-right region of the plots, showing that the model is averaging across both classes rather than distinguishing between them.

If we look back at the count of purchased 0 vs 1, we can see that 0 makes up about 2/3rds of our observations/rows. As such, our model will default more and more toward the majority value as it gets more underfit.

We often use the class distribution itself as a baseline, sometimes called a "naive" or "majority class" model. If our trained model performs better than simply predicting the majority class, we’ve successfully improved beyond baseline.

p_counts = df['Purchased'].value_counts()
p_ratios = df['Purchased'].value_counts(normalize=True)

frequency_df = pd.DataFrame({
    'Count': p_counts,
    'Ratio': p_ratios
})
frequency_df

Purchased	Count	Ratio
0	257	0.6425
1	143	0.3575

Tailoring Model Fit

Every good fit, whether in fashion or machine learning modeling, comes from iteration. We measure, test, adjust, and refine until the result balances structure and flexibility. By exploring overfitting and underfitting side by side, we’ve built an intuition for what “fit” really means in KNN and how to choose parameters that let the model move gracefully between precision and generalization.

By visualizing model performance and fit, we were able to see three distinct areas:

Overfit: 2 - 12 neighbors
Generalizable: 12 - 45 neighbors
Underfit: 45+ neighbors

These visualizations help give us an intuitive understanding of fit, from overfit to generalizable to underfit. This gives us a foundation for building an intuitive mental model of what’s happening under the hood.

Based on these visualizations, if I were going to choose a number of neighbors for a production model for this use case, I would pick something in the range of 20 - 40 neighbors.

However, this model was built with only two features. Two features are easy to visualize. As we add more features, which is common in machine learning, it gets harder and harder to visualize how the features interact and how that impacts model performance. In the next post, we'll explore how we can determine model fit based on performance metrics alone.