Writing Custom Cross-Validation Methods For Grid Search in Scikit-learn

#machinelearning #python #datascience

Recently I was interested in applying Blocking Time Series Split following this lovely post in a Grid Search hyper-parameter tuning setting using scikit-learn library to maintain the time order and prevent information leakage. In this post, I will try to document some knowledge that I build while reading through the articles, documentation, and blog posts about custom cross-validation generators in Python.

It is great that scikit-learn provides a class called TimeSeriesSplit, and by using that we can generate fixed time interval training and test sets. Here is a basic example using scikit-learn data generators. I generate a regression dataset with 5 features and 30 samples. Then I generate 3 splits. For those 3 splits, we obtain 10 training examples and n_samples//(n_splits + 1) test examples:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import TimeSeriesSplit

X_experiment, y_experiment = make_regression(
    n_samples=30, n_features=5, noise=0.2)

tscv = TimeSeriesSplit(max_train_size=10, n_splits=3)

for idx, (x, y) in enumerate(tscv.split(X_experiment)):
    print(f"Split number: {idx}")
    print(f"Training indices: {x}")
    print(f"Test indices: {y}\n")

Here the output will be, and it will follow a Walk Forward Cross Validation pattern:

Split number: 0
Training indices: [0 1 2 3 4 5 6 7 8]
Test indices: [ 9 10 11 12 13 14 15]

Split number: 1
Training indices: [ 6  7  8  9 10 11 12 13 14 15]
Test indices: [16 17 18 19 20 21 22]

Split number: 2
Training indices: [13 14 15 16 17 18 19 20 21 22]
Test indices: [23 24 25 26 27 28 29]

However, the setting that I found was using dates instead of timestamps. This was leading to discrete numeric values as anchor points for cross-validation splits, instead of continuous. Hence, I was not able to leverage the TimeSeriesSplit from scikit-learn. Instead, I wrote a simple generator object with groupings for date splits to use in Grid Search.

class CustomCrossValidation:

    @classmethod
    def split(cls,
              X: pd.DataFrame,
              y: np.ndarray = None,
              groups: np.ndarray = None):
        """Returns to a grouped time series split generator."""
        assert len(X) == len(groups),  (
            "Length of the predictors is not"
            "matching with the groups.")
        # The min max index must be sorted in the range
        for group_idx in range(groups.min(), groups.max()):

            training_group = group_idx
            # Gets the next group right after
            # the training as test
            test_group = group_idx + 1
            training_indices = np.where(
                groups == training_group)[0]
            test_indices = np.where(groups == test_group)[0]
            if len(test_indices) > 0:
                # Yielding to training and testing indices
                # for cross-validation generator
                yield training_indices, test_indices

CustomCrossValidation is a simple class with one method (split) uses X (predictors), y (target values), and groups corresponding to the date groups. Those can be months or quarters for your dataset, however, I assumed that those can be mapped into integers to keep the order of time. Hence, if I have 3 quarters in the dataset, I can first have Q1, Q2, and Q3 as of date values. But I can simply map those into 0, 1, 2 to keep the order and use those in my validation generator class method.

The split method, with this naming, is required for GridSearchCV in scikit-learn. Here, I created a range of integers (groups) to keep the order of date. Then assigned the first group indices (t) to be training indices and the next (t + 1) to be validation indices. Then, in the end, the method yields to training and testing indices as the cv parameter of the GridSearchCV method requires a generator object with returning training and testing indices.

Here the example displays how the custom split works with the groups. To have different sizes of date groups, I created 4 groups with 5 instances of 0s, 10 instances of 1s, 10 instances of 2s, and 10 instances of 3s:

X_experiment, y_experiment = make_regression(
    n_samples=30, n_features=5, noise=0.2)

groups_experiment = np.concatenate([np.zeros(5),  # 5 0s
                                    np.ones(10),  # 10 1s
                                    2 * np.ones(10),  # 10 2s
                                    3 * np.ones(5)  # 10 3s
                                    ]).astype(int)

for idx, (x, y) in enumerate(
    CustomCrossValidation.split(X_experiment,
                                y_experiment,
                                groups_experiment)):
    print(f"Split number: {idx}")
    print(f"Training indices: {x}")
    print(f"Test indices: {y}\n")

The example dataset will look like with the groupings:

# The first 5 predictor values...
          0         1         2         3         4
0 -0.566298  0.099651  2.190456 -0.503476 -0.990536
1  0.174578  0.257550  0.404051 -0.074446  1.886186
2  0.314247 -0.908024 -0.562288 -1.412304 -1.012831
3 -1.106335 -1.196207 -0.479174  0.812526 -0.185659
4 -0.013497 -1.057711 -0.601707  0.822545  1.852278

# The first 5 target values...
            0
0   73.398681
1  195.221637
2 -139.402678
3 -124.863423
4   94.753517

# Groupings for the example dataset...
# The 0s are older date anchor values, whereas 3s the newest...
[0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3]

The groups will be used for having an order in the validation flow. Hence first the 0s are going to be used as the training set, and 1s as validation. Then the 1s are going to be used as training, the 2s as validation... The output of the example generated indices will be:

Split number: 0
Training indices: [0 1 2 3 4]
Test indices: [ 5  6  7  8  9 10 11 12 13 14]

Split number: 1
Training indices: [ 5  6  7  8  9 10 11 12 13 14]
Test indices: [15 16 17 18 19 20 21 22 23 24]

Split number: 2
Training indices: [15 16 17 18 19 20 21 22 23 24]
Test indices: [25 26 27 28 29]

To have an example setup, I will be using the Lasso Regression and try to optimize the alpha with Grid Search. In Lasso, when we have a larger alpha, this forces more coefficients to be 0. It is very common to search for the optimum values of alpha in a Lasso Regression.

# Instantiating the Lasso estimator
reg_estimator = linear_model.Lasso()
# Parameters
parameters_to_search = {"alpha": [0.1, 1, 10]}
# Splitter
custom_splitter = CustomCrossValidation.split(
    X=X_experiment,
    y=y_experiment,
    groups=groups_experiment)

# Search setup
reg_search = GridSearchCV(
    estimator=reg_estimator,
    param_grid=parameters_to_search,
    scoring="neg_root_mean_squared_error",
    cv=custom_splitter)
# Fitting
best_model = reg_search.fit(
    X=X_experiment,
    y=y_experiment,
    groups=groups_experiment)

This will output the best estimator as follows, using the custom cross-validation. There will be 3 splits as we used 4 groups.

# Best model:
Lasso(alpha=0.1)

# Number of splits:
3

Voila, having a simple generator helped me to have a custom validation flow in a Grid Search optimization. I enjoy reading scikit-learn documentation. Besides the fact that reading is fun, it helps me to understand some statistical implementations better and tweak whenever it is necessary.

To have a complete set of examples, please refer to the Github repository. Happy reading the documentation!

DEV Community

Writing Custom Cross-Validation Methods For Grid Search in Scikit-learn

Top comments (0)

Read next

Batch, Mini-Batch & Stochastic Gradient Descent

Top 7 Artificial Intelligence Concepts Every Beginner Should Learn

Machine Learning in Algorithmic Trading: The Global Impact and India’s Rising Role

การใช้งาน Polyglot notebook กับ Python