Recently I have written an article about the risks of using the
train_test_split() function provided by the scikit-learn Python package. That article has raised a lot of comments, some positives, and others with some concerns. The main concern in the article was that I used a small dataset to demonstrate my theory, which was: be careful when you use the
train_test_split()function, because the different seeds may produce very different models.
The main concern was that the
train_test_split() function does not behave strangely; the problem is that I used a small dataset to demonstrate my thesis.
In this article, I try to discover which is the performance of a Linear Regression model by varying the dataset size. In addition, I compare the obtained performance of the algorithm with that obtained by varying the random seed in the train_test_split() function.
I organize the article as follows:
- Possible issues with a small dataset
- Possible countermeasures
- Practical Example
A small dataset is a dataset with a little number of samples. The quantity small depends on the nature of the problem to solve. For example, if we want to analyze the average opinion about a given product, 100,000 reviews may be a lot, but if we have the same number of samples to calculate the most discussed topic on Twitter, the number of samples is really small.
Let us suppose that we have a small dataset, i.e. the number of samples is not sufficient to represent our problem. We could encounter at least the following issues:
- Outliers — an outlier is a sample that significantly deviates from the rest of the dataset.
- Overfitting — a model performs well with the training set, but it has poor performance with the test test
- Sampling Bias — the dataset does not reflect reality.
- Missing Values — a sample is not complete, some features could miss.
One obvious countermeasure to the issue of having a small dataset could be to increase the size of the dataset. We could achieve this result by collecting new data or producing new synthetic data.
Another possible solution could be using an ensemble approach, where instead of using just one best model, we can train different models and then combine them to get the best model.
Other countermeasures could include the usage of regularization, confidence intervals, and consortium approach, as described in this very interesting article entitled Problems of Small Data and How to Handle Them.
In this example, we use the Weather Conditions in World War Two available on Kaggle, under the U.S. Government Works license. The experiment builds a very simple linear regression model that tries to predict the maximum temperature, provided the minimum temperature.
We run two batteries of tests: the first varies the dataset size, the second varies the random seed provided as input to the
In the first battery of tests, we run 1190 tests with a variable number of samples (from 100 up to the full dataset size), extracted randomly, and then, for each test, we calculate the Root Mean Squared Error (RMSE).
In the second battery of tests, we run other 1000 tests with a variable value for random_seed provided as input to the
train_test_split(), and we calculate RMSE. Finally, we compare the results of the two batteries of tests, in terms of mean and standard deviation.
First, we load the dataset as a Pandas dataframe:
import pandas as pd df = pd.read_csv('Summary of Weather.csv')
The dataset has 119,040 rows and 31 columns. For our experiment, we use only the MinTemp and MaxTemp columns.
Continue reading on Towards Data Science