Julie Fisher

Posted on Oct 22

Evaluating KNN: From Training Field to Scoreboard

#machinelearning #python #datascience #tutorial

In the last post we looked at the Social_Network_Ads dataset and figured out what features we'll use. In this post, we'll build a KNN model and discuss how to figure out if it's any good.

TLDR;

To train a model and return performance metrics, follow these steps:

1. Load the data

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))

2. Prepare the data

df = df.drop(columns=["User ID", "Gender"], axis=1)

3. Split the data into train and test sets

train_data, test_data = train_test_split(df, train_size=.7, random_state=52, stratify=df['Purchased'])

4. Separate features from the target

trainX = train_data.drop('Purchased', axis=1)
trainY = train_data['Purchased']

testX = test_data.drop('Purchased', axis=1)
testY = test_data['Purchased']

5. Train/Fit the model

model = KNeighborsClassifier(n_neighbors=2)
model.fit(trainX, trainY)

6. Predict on the test set

test_preds = model.predict(testX)

7. Evaluate the model's performance

accuracy = accuracy_score(testY, test_preds)
precision = precision_score(testY, test_preds)
recall = recall_score(testY, test_preds)
f1 = f1_score(testY, test_preds)
roc_auc = roc_auc_score(testY, test_preds)

Pre-Train Prep: Loading and Splitting the Data

The first step is to load the data and reduce it to just the columns we'll be using. This code was created and explained in the last post. Review Exploring K-NN Data: A Beginner’s Guide to EDA and Feature Selection for more information on this process.

# import all the libraries
# imports always go at the top of the file so they're easy to find
import os
import pandas as pd
import kagglehub
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
    confusion_matrix)
from IPython.display import display, HTML

# load the data
path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))
# drop unneeded columns
df = df.drop(columns=["User ID", "Gender"], axis=1)

Draft Day: Picking the Training and Test Sets

Next we need to split to data into train and test datasets. Here are some terms I'll be using:

Dataset: the full set of observations
Training dataset or training set: a group of observations, usually comprising 70-90% of the dataset
Test dataset or test set: a group of observations, usually comprising 10-30% of the dataset

Why split the data, you ask? Surely we can get better results if we train with all the data, you say?

Using all the data to train a model is like a practice test that never changes the questions. Regardless of your intentions, you'll end up memorizing most of the answers without understanding how to solve the problem.

Models have the same problem. They'll memorize the answers to the training data they're given, but then fall flat when you put them into production and they start seeing observations that weren't in the training data. Don't believe me? Scroll down to the section titled "Practice Scores Don’t Win Championships" to see this phenomenon in action.

To get an accurate approximation for how a model will perform after being put into production, you need to test it on data it's never seen before. This test set is sometimes called a "hold-out" dataset because you keep it in reserve and never let the model see it during training.

The Draft Board: Choosing Players for Training and Testing

Splitting the dataset into training and test datasets is straight forward with scikit-learn's train_test_split function. Just feed it your dataset and tell it the portion of data you want to use for training by specifying the train_size parameter (alternately you can tell it what portion to hold out for testing called test_size; only use one of these parameters, not both).

train_test_split takes care of other housekeeping issues like randomizing the data too. This means that if your data was sorted on some value that the value won't be overrepresented in either the training or test dataset.

You can also include the optional random_state parameter. I recommend using this parameter as it allows you to reproduce the same results over and over. In the context of a tutorial like this, it also allows you to get the same results I did (assuming you use the same dataset).

The last parameter I recommend is stratify. The datasets I use are large and highly imbalanced. This means that the number of true observations are significantly outnumbered by the false observations. Only 2-5% of my data tends to be true. If I'm looking at 1 million observations and only 2% of them are true, then those true values could mostly end up in either the training or the test dataset. The stratify parameter makes sure that doesn't happen. It ensures that the proportion of values are evenly distributed between the two datasets.

If you need to ensure some other feature is evenly distributed between training and test datasets, you can specify this feature instead. I only ever use it to make sure my target values are evenly represented.

train_data, test_data = train_test_split(df, train_size=.7, random_state=52, stratify=df['Purchased'])

Training Day: Fitting the KNN Model

Next we train the model. Pretty much all of scikit-learn's algorithms expect the features and the target to be in different dataframes. So first we'll split them into features (commonly referred to as "X") and target (commonly referred to as "y").

There are various naming schemas out there for what to call the X and y dataframes for the training and test datasets. I lean toward the "Readability counts" precept from the Zen of Python. Code is read much more often that it's written, do your future self a favor and name all of your variables something that will still make sense to you six months from now.

trainX = train_data.drop('Purchased', axis=1)
trainY = train_data['Purchased']

testX = test_data.drop('Purchased', axis=1)
testY = test_data['Purchased']

Now comes the easy part. Training our KNN model using scikit-learn's kNeighborsClassifier.

We're going to start simple with a single KNN model that uses a parameter of two n_neighbors to fit the model. This basically just means that the model uses the two closest neighbors to decide what it should label the observation it's currently predicting on.

There are a bunch of options called distance metrics for how to calculate which neighbors are "closest." We'll be getting into how those metrics effect the results more later in the series.

model = KNeighborsClassifier(n_neighbors=2)
model.fit(trainX, trainY)

There's only so much I can cover in a blog post. If you want to know more about how KNN works or the math behind it, I recommend the following resources:

Introduction to Statistical Learning with Applications in Python: freely available online
The Elements of Statistical Learning: freely available online
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition
- All of the editions are good, but in a fast moving field like machine learning, the newer the better
- Not free, you'll have to buy this one
- But there is an accompanying GitHub repo that's free
K-Nearest Neighbors for Machine Learning: an older blog that I learned a lot from
scikit-learn's 1.6. Nearest Neighbors: for anyone that likes reading documentation

Game Time: Evaluating KNN’s Performance

There are a lot of metrics for measuring both regression models (the output prediction is a number) and classification models (the output prediction is whether the observation belongs to a category). The ones implemented in scikit-learn can be found on the sklearn.metrics page.

In my experience, the most useful metrics for classification are F1, precision, and recall - not necessarily in that order. When I use these metrics on real world data they allow me to pick a "best" model based on my needs:

Recall: targets models that are better at identifying true positives/reducing false negatives
Precision: targets models that reduce false positives
F1: targets models that are the best of both worlds (F1 is the harmonic mean of precision and recall)

Accuracy and ROC AUC are very common metrics, so I'm including them in our analysis.

I'd love to tell you about all of the metrics, but I already did that in my old blog:

Entry 16: Model Evaluation: an overview type post
Entry 21: Scoring Regression Models - Theory: concepts and equations
Entry 22: Scoring Regression Models - Implementation: some additional equations, plus an accompanying notebooks that implements model evaluation
Entry 23: Scoring Classification Models - Theory: concepts and equations
Entry 24: Scoring Classification Models - Implementation: some additional equations, plus an accompanying notebooks that implements model evaluation
Entry 26: Setting thresholds - precision, recall, and ROC: the accompanying notebooks have precision-recall curves and ROC AUC curves if you'd like to see those applied to data

Yes, I dedicated six blog posts to evaluation metrics. Evaluating how well a model performs has two very important uses:

Choose a "best" model for use/deployment
Approximate how well a model will perform once deployed

So it's extremely important to choose the right metric for your use case.

* Pro Tip *: Keep in mind that accuracy can be misleading. In my dataset example where only 2-5% has a positive value for the target, if I guess that everything is false I'll be right 95-98% of the time. This is super hard to beat as far as the score goes. However, it's totally useless as an actual predictive model.

* Advanced Topic Introduction *: ROC AUC is usually the top choice for imbalanced data. In practice I've found that it performs similarly to F1, but I get slightly better results with F1 for my use case. If you have imbalanced data, I'd recommend looking at the results from both and figuring out which works best for your particular situation.

Detailed descriptions of the five metrics we'll be using as well as their equations are below. Most of these metrics are derived from the confusion matrix though, so we'll discuss that first.

The Confusion Matrix: Fundamentals First

Aptly named, the confusion matrix inspires many a glassy-eyed stare whenever I present it. Joking aside, the term was used by JT Townsend in 1971 and spread through the literature and was in popular use by the 1980s. It's purported to be called "confusion" because it shows the errors (or confused) values.

In practice, it's pretty simple, and very helpful.

When making a prediction, there are two states that any single prediction can be: the prediction was correct or the prediction was incorrect.

In my experience, binary classification is the most common type of classification problem. This just means that there are only two different values for the target. For example, in our Social_Network_Ads dataset, the Purchased value can only be 1 or 0. This means that there are only two states that an observation can be: the target value or not the target value.

When we combine these states, we get a 2 x 2 grid:

	Predicted Values
		Negative	Positive
Actual Values	Negative	True Negative (TN)	False Positive (FP)
Actual Values	Positive	False Negative (FN)	True Positive (TP)

The true values are easiest to understand:

True Positive: the prediction was Positive and the observation was Positive, i.e. the model correctly identified a Positive observation
True Negative: the prediction was Negative and the observation was Negative, i.e. the model correctly identified a Negative observation

The false values are a little trickier to keep straight:

False Positive: the prediction was Positive and the observation was Negative
False Negative: the prediction was Negative and the observation was Positive

Let's say we're trying to predict whether a vegetable is a carrot or not a carrot and we have pictures that contain either a carrot or an eggplant. Here are the possibilities for our model's prediction:

True Positive: the model predicts that a carrot is a carrot
True Negative: the model predicts that an eggplant is not a carrot
False Positive: the model predicts that an eggplant is a carrot
False Negative: the model predicts that a carrot is not a carrot

For more details, explanations, and examples check out the Wikipedia Confusion Matrix entry. I also discuss it in a little more detail in Entry 23: Scoring Classification Models - Theory.

Accuracy, Precision, Recall, and More — the Stats that Separate MVPs from Benchwarmers

The evaluation metric impacts what your model is good at, especially once we get into hyperparameter tuning. Chose the right metric for your use case. When in doubt, run multiple metrics and compare them to determine which best fits what you need out of your model.

Here's a quick reference guide to the most common classification metrics. For more information on each of these metrics, just click on the metric name, which links to the specific section of Entry 23: Scoring Classification Models - Theory for that metric.

Metric	Description	Use Case	Equation
Accuracy	How often predictions were correct, regardless of whether that correct prediction was for the positive or negative class	When you want to impress your boss, your boss's boss, or stakeholders	$\frac{TP + TN}{TP + TN + FP + FN}$
Precision	Of all positive predictions, how often was that prediction correct (rate of correctly identified positive predictions)	When you need to reduce false positives	$\frac{TP}{TP + FP}$
Recall	Of all positive observations, how often were they correctly identified by the model (true positive rate)	When you need to increase true positives	$\frac{TP}{TP + FN}$
F1	The harmonic mean of precision and recall.	When you need to find the sweet spot between increasing true positives and reducing false positives	$2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{2TP}{2TP + FP + FN}$
ROC AUC	ROC plots TPR (recall) vs. FPR; AUC is the area under the curve (higher is better)	Recommended for imbalanced datasets	—

Practice Scores Don’t Win Championships

Now that we have our trained model and understand the metrics we'll use to evaluate it, let's compare how the model does on the training data and on the test data.

train_preds = model.predict(trainX)
test_preds = model.predict(testX)

train_conf_matrix = confusion_matrix(trainY, train_preds)
train_accuracy = accuracy_score(trainY, train_preds)
train_precision = precision_score(trainY, train_preds)
train_recall = recall_score(trainY, train_preds)
train_f1 = f1_score(trainY, train_preds)
train_roc_auc = roc_auc_score(trainY, train_preds)

test_conf_matrix = confusion_matrix(testY, test_preds)
test_accuracy = accuracy_score(testY, test_preds)
test_precision = precision_score(testY, test_preds)
test_recall = recall_score(testY, test_preds)
test_f1 = f1_score(testY, test_preds)
test_roc_auc = roc_auc_score(testY, test_preds)

For anyone that doubted the importance of evaluating a model on data it's never seen before, the proof is in the pudding. Even for our simple example, the model does substantially better on the training data that it did on the test data.

metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
train_values = [train_accuracy, train_precision, train_recall, train_f1, train_roc_auc]
test_values = [test_accuracy, test_precision, test_recall, test_f1, test_roc_auc]

df = pd.DataFrame({
    'Metric': metrics,
    'Training': train_values,
    'Testing': test_values
})

df

	Metric	Training	Testing
0	Accuracy	0.900000	0.808333
1	Precision	1.000000	0.833333
2	Recall	0.720000	0.581395
3	F1 Score	0.837209	0.684932
4	ROC AUC	0.860000	0.758230

A precision score of 1 should always raise a red flag. It’s extremely rare for a model to perfectly predict all positives without any false positives. This could indicate data leakage—where the model has access to information it shouldn’t—or another issue in your pipeline.

Additionally, the large gap between our training and test metrics suggest that our model may be overfitting. For example, the drop in recall from 0.72 to 0.58 means it's missing more true positives in the test set. In the next post, we’ll discuss identifing overfitting and underfitting and what they mean for our trained model. Later in the series we'll explore how to tune our model to better generalize beyond the training set.

We can also look at the confusion matrices, but they're harder to compare since there were different quantities between the training and test datasets.

Training

Testing

Confusion Matrix

180	0
28	72

72	5
18	25

Recap

In this post, we trained a KNN model and saw how performance metrics like precision, recall, F1, and ROC AUC help us evaluate it. The key takeaway? Models often perform better on training data than on unseen data, and that gap is crucial to understand before deploying. In the next post we'll dive into overfitting and underfitting—what they mean, how to detect them, and why they matter when building robust models.

DEV Community