KNN: The Importance of Being Scaled

#machinelearning #datascience #python #knn

What is Scaling?

In short, scaling is a data preprocessing step that transforms features so they have similar ranges or distributions. This is especially important when features have vastly different units or magnitudes. For example, in our Social_Network_Ads dataset, the Age column has values ranging from 18 - 60. Whereas the EstimatedSalary column has values ranging from 15,000 - 150,000. That's a pretty substantial difference in magnitude.

Without scaling, models that rely on distance (like KNN) will be biased toward features with larger numeric ranges. According to Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurelien Geron: “With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.”

In my experience, that's not entirely true. Tree-based algorithms generally don't care about scale and those are the algorithms that tend to fit my use case best. The way I think about tree-based models is that they bin features, then split based on the bins. All to say, basically I've stopped bothering to scale features.

When using a distance based algorithm such as KNN, it turns out that ignoring this difference in scale is a mistake. When calculating the distance between points, if one feature has a much larger range than another, it can dominate the distance calculation, skewing the results. We'll look at the different distance metrics in another post of the series. For now, let's prove that scaling matters.

Want to know more about scaling? I covered it more thoroughly in an old blog post Entry 8: Centering and Scaling. Or feel free to turn to your favorite data science/machine learning book. They pretty much all cover scaling, as does section 7.3 Preprocessing Data of the scikit-learn documentation. It's a common data transformation.

Unscaled Features: KNN Baseline

We're going to train ~100 models using neighbor ranges from 2 - 100 and visualize their fit. We'll do this using the untransformed features, then compare it to a scaled version of the features.

The code to load the modules we'll need should look familiar by now. I turned the train/test split code into a function, as well as the code to fit a model and predict on our test values. These helper functions will let us easily run the ~200 different models we'll be training.

# Load in our libraries
# These should always be at the top of your notebook/script
import os
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import kagglehub

def split_data(df, random_state=52, train_size=0.7, target_col='Purchased'):
    train_data, test_data = train_test_split(df, train_size=train_size, random_state=random_state, stratify=df[target_col])

    trainX = train_data.drop(target_col, axis=1)
    trainY = train_data[target_col]

    testX = test_data.drop(target_col, axis=1)
    testY = test_data[target_col]
    return trainX, trainY, testX, testY

def train_eval(model, trainX, trainY, testX, testY):
    model.fit(trainX, trainY)
    test_preds = model.predict(testX)
    accuracy = accuracy_score(testY, test_preds)
    return accuracy

Next we load the data. This code should look familiar from the last two posts.

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))

df = df.drop(columns=["User ID", "Gender"], axis=1)

Here comes the fun part. (Yes, I'm aware of how that sounds. Don't worry, I already know that I'm a nerd.)

Now we'll iterate through the neighbor range of 2 -100 and store the accuracies in a list. Just because we can, we'll also run it using 6 different random_state values. Using multiple random_state values helps us understand how sensitive our model is to different train/test splits. If performance varies wildly, it may indicate instability in the model or dataset.

random_states = [52, 9, 130, 404, 20, 119]
nbr_neighbors = range(2, 100)

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 18))
axes = axes.flatten()

all_accuracies = []

for idx, random_state in enumerate(random_states):
    trainX, trainY, testX, testY = split_data(df, random_state=random_state)

    accuracies = []
    for x in nbr_neighbors:
        model = KNeighborsClassifier(n_neighbors=x)
        metrics = train_eval(model, trainX, trainY, testX, testY)
        accuracies.append(metrics)

    all_accuracies.extend(accuracies)

    ax = axes[idx]
    ax.plot(nbr_neighbors, accuracies, marker='o')
    ax.set_title(f'Random state: {random_state}')
    ax.set_xlabel('Number of Neighbors (k)')
    ax.set_ylabel('Accuracy')
    ax.grid(True)

# Set consistent y-axis limits
y_min = min(all_accuracies) - 0.01
y_max = max(all_accuracies) + 0.01
for ax in axes:
    ax.set_ylim([y_min, y_max])

plt.tight_layout()
plt.show()

Across the different random_state values, there is high variability in the first 10 or so neighbors. Then the pattern changes based on the random_state we used. The range of accuracy across the different versions, is pretty consistently between 0.725 - 0.9, mostly declining depending on the number of neighbors used to fit the model. In short, in the current, unscaled version, KNN’s performance is sensitive to both the number of neighbors and the train/test split.

Visualizing Fit: An Unscaled Hot Mess

The real kicker to prove out how scaling impacts our model though turns out to be the visualization of fit.

To create the below plot, I gave CoPilot the code for the scatter plot from the first post of the series (sns.scatterplot(data=df, x="Age", y="EstimatedSalary", hue="Purchased")) and asked it to give me code that would visualize the boundary.

When I first ran this code, I thought it was because CoPilot gave me bad code. Take a look for yourself.

trainX, trainY, testX, testY = split_data(df, random_state=random_state)

# Create mesh grid for decision boundary
x_min, x_max = trainX['Age'].min() - 1, trainX['Age'].max() + 1
y_min, y_max = trainX['EstimatedSalary'].min() - 1, trainX['EstimatedSalary'].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(18, 22))
axes = axes.flatten()

for idx, nbr_neighbors in enumerate([2, 5, 7, 10, 20, 40, 70, 100]):
    model = KNeighborsClassifier(n_neighbors=nbr_neighbors)
    model.fit(trainX, trainY)

    # Predict on mesh grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary and scatter plot
    ax = axes[idx]
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
    sns.scatterplot(data=trainX.assign(Purchased=trainY), x='Age', y='EstimatedSalary', hue='Purchased', palette='bright', ax=ax)
    ax.set_title(f'KNN Decision Boundary for {nbr_neighbors} Neighbors')
    ax.set_xlabel('Age')
    ax.set_ylabel('EstimatedSalary')
    ax.grid(True)

plt.tight_layout()
plt.show()

What the what is that hot mess? The decision boundaries look more like the print out from an InkJet that was running out of ink. All of the boundaries are horizontal and don't adhere to our value very well at all. It doesn't look anything like the nice neat, highly intuitive plot that was in An Introduction to Statistical Learning with Applications in Python.

I took a screenshot of one of the above plots and fed it back into CoPilot with the highly restrained and professional remark of "Why am I not getting a nice decision boundary?" I 100% expected it to give me new code.

Instead I got a paragraph back on the "scale of your features" along with a suggestion to use sklearn.preprocessing.StandardScaler.

Visualizing Fit: Beautiful Scaling

Ah ha.

Turns out our models were weighting the EstimatedSalary feature so much, that the boundaries were almost exclusively determined on that single feature. As CoPilot succinctly stated "the model is overly influenced by the EstimatedSalary feature."

That's easy enough to test with some minor alterations to our early plotting code. We can just transform our features using StandardScaler.

StandardScaler doesn't have a set value range (i.e. it won't always fall between 0 and 1). Technically, it could fall anywhere from negative infinity to positive infinity. However, it uses mean and standard deviation to bring it into a generally useful magnitude. The actual equation is $z = \frac{x - u}{s}$ where:

z: standard score
x: the observed value
u: the mean of the training samples
s: the standard deviation of the training samples

trainX, trainY, testX, testY = split_data(df, random_state=random_state)

scaler = StandardScaler()
trainX_scaled = scaler.fit_transform(trainX)
testX_scaled = scaler.transform(testX)

# Create mesh grid for decision boundary
x_min, x_max = trainX['Age'].min() - 1, trainX['Age'].max() + 1
y_min, y_max = trainX['EstimatedSalary'].min() - 1, trainX['EstimatedSalary'].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

xx_scaled, yy_scaled = scaler.transform(np.c_[xx.ravel(), yy.ravel()]).T

fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(18, 22))
axes = axes.flatten()

for idx, nbr_neighbors in enumerate([2, 5, 7, 10, 20, 40, 70, 100]):
    model = KNeighborsClassifier(n_neighbors=nbr_neighbors)
    model.fit(trainX_scaled, trainY)

    # Predict on mesh grid
    Z = model.predict(np.c_[xx_scaled.ravel(), yy_scaled.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary and scatter plot
    ax = axes[idx]
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
    sns.scatterplot(data=trainX.assign(Purchased=trainY), x='Age', y='EstimatedSalary', hue='Purchased', palette='bright', ax=ax)
    ax.set_title(f'KNN Decision Boundary for {nbr_neighbors} Neighbors')
    ax.set_xlabel('Age')
    ax.set_ylabel('EstimatedSalary')
    ax.grid(True)

plt.tight_layout()
plt.show()

Those plots look much better. We can now clearly see how the model is overfitting at neighbor numbers 2 and 7. We can also see underfitting when we get up into the neighbor number range of 75 and 100.

Scaling and Model Performance

Let's do a quick test to see if this impacts our accuracy too.

random_states = [52, 9, 130, 404, 20, 119]
nbr_neighbors = range(2, 100)

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 18))
axes = axes.flatten()

all_accuracies = []

for idx, random_state in enumerate(random_states):
    trainX, trainY, testX, testY = split_data(df, random_state=random_state)

    scaler = StandardScaler()
    trainX_scaled = scaler.fit_transform(trainX)
    testX_scaled = scaler.transform(testX)

    accuracies = []
    for x in nbr_neighbors:
        model = KNeighborsClassifier(n_neighbors=x)
        metrics = train_eval(model, trainX_scaled, trainY, testX_scaled, testY)
        accuracies.append(metrics)

    all_accuracies.extend(accuracies)

    ax = axes[idx]
    ax.plot(nbr_neighbors, accuracies, marker='o')
    ax.set_title(f'Random state: {random_state}')
    ax.set_xlabel('Number of Neighbors (k)')
    ax.set_ylabel('Accuracy')
    ax.grid(True)

# Set consistent y-axis limits
y_min = min(all_accuracies) - 0.01
y_max = max(all_accuracies) + 0.01
for ax in axes:
    ax.set_ylim([y_min, y_max])

plt.tight_layout()
plt.show()

Our accuracy now ranges from around 0.77 to 0.95. This is better than when the features were unscaled. More importantly, we have a more stable trend in the change of accuracy across the different random_state and n_neighbors parameters:

Fewer than 10 neighbors gives low, or highly variable, results because the model has overfit to the data
Then there is a steady section where we mostly get the same accuracy due to the model fitting well
Finally we see a steady decline as the model underfits

We'll discuss what overfitting and underfitting mean in the next post. For now, the take away is that while tree-based models may shrug off unscaled features, distance-based models like KNN demand careful preprocessing. Scaling isn’t just a formality, it can make or break your model’s performance.