DEV Community

Cover image for Hyperparameter Tuning: Understanding Randomized Search
Bala Priya C
Bala Priya C

Posted on

Hyperparameter Tuning: Understanding Randomized Search

Hi everyone!
This is the third of the series of articles on cross-validation and hyperparameter tuning. Here are the links to the first two if you’d like to give them a read before going ahead with reading this blog post.

Let us quickly outline the key ideas that we’ve covered thus far.

  • In the first article, we observed how train/test split depends heavily on the particular data records that end up in the training and test sets, which led us to understand the need for cross-validation as a way to effectively evaluate a model’s performance.

  • We then saw how cross-validation can be used in hyperparameter search; If you can remember, we decided to choose that value of the hyperparameters that yielded the highest cross-validated accuracy.

  • In the second article, we looked at searching for the best hyperparameters more conveniently using GridSearchCV in scikit-learn. If you remember, towards the end of the article, we touched upon the fact that performing Grid Search to tune hyperparameters is actually computationally inefficient.

We shall now try to rephrase it better, in a more formal way.

  • Say we have to search for M parameters; Let p_1, p_2,p_3, …, p_M be the M parameters. Let the number of values that we would like to search over for p_1 be n1, for p_2 be n2, and so on, with nM values for p_M.
  • Grid Search considers all possible hyperparameter settings (combinations) into account and creates a model for each possible setting to choose the best model with optimal hyperparameters.

  • To understand it better, assume that out of M parameters, we decide to freeze the values of all hyperparameters except one, say the M_th parameter p_M. So, Grid Search involves searching through the list of nM values for the M_th hyperparameter; So, there are nM models created.

  • Suppose we now freeze the values of all hyperparameters except two, say the last two (p_M and p_(M-1)). We now have to search through all possible combinations of p_M and p_(M-1), each having nM and n_(M-1) possible values that we could search over.

  • We now take a step back and freeze the value of p_M-1 and search through all values for p_M; To account for all possible combinations, we should repeat the procedure for all n_M-1 values for p_M-1. So, this process would leave us with n_(M-1) * nM models.

Hope it’s clear how the complexity scales with increasing values of the number of values each hyperparameter could take. For the above example with M hyperparameters, we would have n1*n2*n3*…*n_M models.This is why we said that things could scale up quickly and become computationally intractable with Grid Search.

With this motivation to make hyperparameter search computationally more efficient, let us proceed to understand Randomized Search.

Understanding RandomizedSearchCV

In contrast to GridSearchCV, not all parameter values are tried out in RandomizedSearchCV, but rather a fixed number of parameter settings is sampled from the specified distributions/ list of parameters.

If some of the hyperparameters that we’re searching for are continuous, then we should specify the distribution rather than the list of values, while defining the parameter grid. How do we define the fixed number of parameter settings?

The number of parameter settings that are tried is given by n_iter. There’s quality vs computational cost trade-off in picking n_iter.

A very small value of n_iter would imply that we’re more likely to find a suboptimal solution, because we are actually considering too few combinations.

A very high value of n_iter would mean we can ideally get closer to finding the best hyperparameters that yield the best model, but this again comes with a high computation cost as before. In fact, if we set n_iter= n1*n2*n3*…*n_M from the previous example, then, we’re essentially considering all possible hyperparameter combinations and now Randomized Search and Grid Search are equivalent.

Let us build on the same example of KNNClassifier from the previous blog posts; In addition to n_neighbors, let us also search for the optimal weighting strategy — ‘uniform’ where all points are weighted equally and ‘distance’ option weights points by the inverse of their distance. And now, let us implement Randomized Search in scikit-learn and do the following steps, as we did for Grid Search.
1. Import RandomizedSearchCV class

from sklearn.model_selection import RandomizedSearchCV
Enter fullscreen mode Exit fullscreen mode

2. Define the parameter grid

# specify "parameter distributions" rather than a "parameter grid"
param_dist = dict(n_neighbors=k_range, weights=weight_options)
Enter fullscreen mode Exit fullscreen mode

3. Instantiate the grid; Set n_iter=10, Fit the grid & View the results

# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
rand.fit(X, y)
pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

#DataFrame

mean_test_score std_test_score  params
0   0.973333    0.032660    {'weights': 'distance', 'n_neighbors': 16}
1   0.966667    0.033333    {'weights': 'uniform', 'n_neighbors': 22}
2   0.980000    0.030551    {'weights': 'uniform', 'n_neighbors': 18}
3   0.966667    0.044721    {'weights': 'uniform', 'n_neighbors': 27}
4   0.953333    0.042687    {'weights': 'uniform', 'n_neighbors': 29}
5   0.973333    0.032660    {'weights': 'distance', 'n_neighbors': 10}
6   0.966667    0.044721    {'weights': 'distance', 'n_neighbors': 22}
7   0.973333    0.044222    {'weights': 'uniform', 'n_neighbors': 14}
8   0.973333    0.044222    {'weights': 'distance', 'n_neighbors': 12}
9   0.973333    0.032660    {'weights': 'uniform', 'n_neighbors': 15}
Enter fullscreen mode Exit fullscreen mode

4. Examine the best score and best hyperparameters

# examine the best model
print(rand.best_score_)
print(rand.best_params_)

# Output
0.9800000000000001
{'weights': 'uniform', 'n_neighbors': 18}
Enter fullscreen mode Exit fullscreen mode

Parameters of the best model

  • Surprisingly, we see that the highest accuracy score obtained in this case, where we only looked at 10 different parameter settings instead of 60 in Grid Search, is the same as before: 0.98 ✔
  • And the value for n_neighbors= 18, which is also one of the optimal values that we got when we initially searched for the optimal value of n_neighbors. (Recall from the earlier blog post with the same example)

Maybe we just got lucky?
What is the guarantee that we will always get the best results?
Ah, this question makes perfect sense, doesn’t it?

Let us do the following now: Let us run RandomizedSearchCV for multiple times and see how many times we really end up getting lucky!

➡️Run RandomizedSearchCV 20 times and see what happens; We log the best_score_ for every run.

# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in range(20):
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, return_train_score=False)
rand.fit(X, y)
best_scores.append(round(rand.best_score_, 3))
Let us examine all the 20 best scores now.
print(best_scores)
# Output: Best Scores
[0.973, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.973, 0.98, 0.973, 0.973, 0.98, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.973]
Enter fullscreen mode Exit fullscreen mode

Upon examining the best scores above for all the 20 runs, we see that we get the best accuracy score of 0.98 about 13 times.
Looks like we’re lucky indeed! What about the other 7 times when we didn't quite get the best accuracy score? These accuracy scores are around 0.973 which is pretty close to 0.98.

This observation convinces us that even though Randomized Search may not always give the hyperparameters of the best performing model, the models obtained by using these hyperparameters do not perform much worse compared to the best model obtained from Grid Search. This means, the best models thus obtained, with the hyperparameters from randomized search are clearly very close to the optimal model.

In essence, these may not be the best hyperparameters, but certainly close to the best hyperparameters, except that these are found under resource-constrained settings. Hope you all understood how we could use Randomized Search for hyperparameter tuning.
Thanks for reading! Happy Learning! Until next time!

References

[1] Here’s the link to the Google Colab notebook for the example discussed above
[2] Introduction to Machine Learning in Python with scikit-learn by DataSchool.
[3] Scikit-learn documentation- RandomizedSearchCV
http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Photo by Chris Lawton on Unsplash

Top comments (0)