How do we validate a machine learning model?
When evaluating a machine learning model, training and testing on the same dataset is not a great idea; Why?Let us draw a relatable analogy.
Ever since school days, we’ve been giving exams, and how are our exams designed? Well, they’ve been designed so as to test our understanding of the subjects rather than our ability to memorize! The same analogy can be transposed to our machine learning model as well!
Here’s the answer to the question ‘Why can we not evaluate a model on the same data that it was trained on?’ It is because this process inherently encourages the model to memorize the training data, so it performs extremely well on the training data but generalizes rather poorly and performs badly on the data that it has never seen before, typically overfitting on the training dataset.
As a model's performance on data that it has never seen before is a more reliable estimate of its performance, we usually validate the model by checking how it performs on out-of-sample data, that is, on data that it has never seen before. If you remember, it is for this reason, we use the train_test_split
method, in our very friendly and nifty library, scikit-learn.
This train_test_split
splits the available data into two sets: the train and test sets in certain proportion; for example, train on 70% of the available data and test on the remaining 30% of the data. In this way, we can ensure that every record in the dataset can either be in the training set or the test set but not both! In this way, we are making sure that we test the model’s performance on unseen data.
But, is this good enough? Or do we take this with a pinch of salt?
Let us train a simple KNeighborsClassifier
in scikit-learn on the iris dataset. As shown below, we import the necessary modules.
# Necessary imports
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
Let us load in the iris data and separate out the features(Sepal Length, Sepal Width, Petal Length and Petal Width) and the target variables which are the class labels indicating the iris type. (Setosa, Versicolour, and Virginica)
# read in the iris data
iris = load_iris()
# create X (features) and y (response)
X = iris.data
y = iris.target
Let’s create the train and test sets with random_state=4
; Setting the random_state
ensures reproducibility. In this case, it ensures that the records that went into the train and test sets stay the same every time our code is run.
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)
We now instantiate the KNeighborsClassifier
with n_neighbors=5
and fit the classifier on the training set and predict on the test set.
# check classification accuracy of KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
# Output
0.9736842105263158
The accuracy obtained is 0.9736842105263158
. Now, let us change the random_state
to a different value, say 20. What do you think the accuracy score would be? It’s now 0.9473684210526315
. Setting the random_state
to another value, we would get another value for the accuracy score.
The evaluation metric thus obtained is therefore susceptible to high variance.
The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made. Clearly, this doesn’t seem like the best way to validate our model’s performance!
How do we reach a consensus on how to calculate the accuracy score?
One very natural thing to do, would be to create multiple train/test splits, calculate the accuracy for each such split, and compute the average of all the accuracy scores thus obtained. This definitely seems like a better estimate of the accuracy, doesn’t it? This is precisely the essence of cross-validation, which we shall see in the subsequent section.
Understanding K-fold cross-validation
Steps in K-fold cross-validation
- Split the dataset into K equal partitions (or “folds”).
- Use fold 1 for testing and the union of the other folds as the training set.
- Calculate accuracy on the test set.
- Repeat steps 2 and 3 K times, using a different fold for testing each time.
- Use the average accuracy on different test sets as the estimate of out-of-sample accuracy.
Let us try to visualize this by splitting a dataset of 25 observations into 5 equal folds as shown below. Dataset contains 25 observations (numbered 0 through 24). 5-fold cross-validation runs for 5 iterations.
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False).split(range(25))
# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))
# Output
Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
We observe the following:
- For each iteration, every observation is either in the training set or the testing set, but not both.
- Every observation is in the test set exactly once.
- Each fold is used as the test set exactly once and in the training set (K-1) times.
The average accuracy thus obtained is a more accurate estimate of out-of-sample accuracy. This process uses data more efficiently as every observation is used for both training and testing. It is recommended to use stratified sampling for creating the folds, as this ensures that all class labels are represented in equal proportions in each fold. And, scikit-learn’s cross_val_score
does this by default.
In practice, we can do even better by doing the following:
- “Hold out” a portion of the data before beginning the model building process.
- Find the best model using cross-validation on the remaining data, and test it using the hold-out set.
- This gives a more reliable estimate of out-of-sample performance since hold-out set is truly out-of-sample.
Cross-validation for hyperparameter tuning
For the KNN classifier on the iris dataset, can we possibly use cross-validation to find the optimal value for K
? That is, to search for the optimal value of n_neighbors
?
Remember, K
in KNN classifier is the number of neighbors (n_neighbors
) that we take into account for predicting the class label of the test sample. Not to be confused with the K in K-fold cross-validation.
from sklearn.model_selection import cross_val_score
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
# Output
[1. 0.93333333 1. 1. 0.86666667 0.93333333
0.93333333 1. 1. 1. ]
# use average accuracy as an estimate of out-of-sample accuracy
print(scores.mean())
# Output
0.9666666666666668
Now, we shall run the K fold cross-validation for the models with different values of n_neighbors
, as shown below.
# search for an optimal value of K for KNN
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
print(k_scores)
# Output k_scores
[0.96, 0.9533333333333334, 0.9666666666666666, 0.9666666666666666, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9800000000000001, 0.9666666666666666, 0.9666666666666666, 0.9733333333333334, 0.96, 0.9666666666666666, 0.96, 0.9666666666666666, 0.9533333333333334, 0.9533333333333334, 0.9533333333333334]
That’s so hard to see; Let us plot the values to get a better idea.
import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
We see that n_neighbors
(K) values from 13 to 20 yield higher accuracy, especially K=13,18 and 20.As a larger value of K
yields a less complex model, we choose K=20 .
This process of searching for the optimal values of hyperparameters is called hyperparameter tuning.
In this example, we chose the values of K
that resulted in higher mean accuracy score under 10-fold cross validation.
This is how cross-validation can be used to search for the best hyperparameters and this process can be done much more efficiently in scikit-learn.
References
[1] Here’s the link to Google Colab notebook for the example discussed above.
[2] Introduction to Machine Learning in Python with scikit-learn by DataSchool.
[3] Scikit-learn Documentation: http://scikitlearn.org/stable/modules/cross_validation.html
Cover Image: Photo by Katie Harp on Unsplash
Top comments (0)