DEV Community

Cover image for How to Handle Missing Data Better [A scikit-learn Tutorial]
Bala Priya C
Bala Priya C

Posted on • Updated on

How to Handle Missing Data Better [A scikit-learn Tutorial]

Why is Data Cleaning Important?

Data from the real world is often messy and cleaning up the data is an integral step in any machine learning project. To ensure optimal model performance, data cleaning is important as machine learning algorithms are sensitive to the quality of data that we feed in.

Dealing with missing data, experimenting with the optimal imputation strategies and ensuring that the data is ready for use in the rest of the pipeline is therefore crucial.

In this blog post, we shall see useful features of scikit-learn that help us handle missing data all the more gracefully! In general, how do we handle missing values in input data? The following are the usual approaches.

  • By dropping columns containing NaNs.
  • By dropping rows containing NaNs.
  • By imputing the missing values suitably.

Wouldn’t it be cool if we could do the following instead?

  • Encode ‘missingness’ as a feature.
  • Use HistGradientBoostingClassifier that automatically imputes missing values.

Encoding ‘Missingness’ as a Feature

When imputing missing values, if we would like to preserve information about which values were missing and would like to use that as a feature, then we can do it by setting the add_indicator attribute in scikit-learn’s SimpleImputer to True. Here’s an example. Let’s import numpy and pandas using their usual aliases np and pd.

# Necessary imports
import numpy as np
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Let’s create a pandas DataFrame X with one missing value.

X = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})
Enter fullscreen mode Exit fullscreen mode

Now, we shall import the SimpleImputer from scikit-learn

from sklearn.impute import SimpleImputer
Enter fullscreen mode Exit fullscreen mode

We shall now instantiate a SimpleImputer that by default does mean imputation, by replacing all missing values with the average of the other values present. The missing value is calculated as (20+30+10+10)/4=17.5. Let's verify the output.

# Mean Imputation
imputer = SimpleImputer()
imputer.fit_transform(X)

# After Imputation
array([[20. ],
       [30. ],
       [10. ],
       [17.5],
       [10. ]])
Enter fullscreen mode Exit fullscreen mode

In order to encode the missingness of values as a feature, we can set the add_indicator argument to True and observe the output.

# impute the mean and add an indicator matrix (new in scikit-learn 0.21)
imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X)

# After adding missingness indicator
array([[20. ,  0. ],
       [30. ,  0. ],
       [10. ,  0. ],
       [17.5,  1. ],
       [10. ,  0. ]])
Enter fullscreen mode Exit fullscreen mode

In the output, we observe that the indicator value of 1 is inserted at index 3 where the original data was missing. This feature is new in scikit-learn version 0.21 and above. In the next section, we shall see how we can use the HistGradientBoosting Classifier that natively handles missing values.


Using HistGradientBoosting Classifier

To use this new feature in scikit-learn version 0.22 and above, let’s download the very popular Titanic-Machine learning from Disaster dataset from kaggle.

import pandas as pd
train = pd.read_csv('http://bit.ly/kaggletrain')
test = pd.read_csv('http://bit.ly/kaggletest', nrows=175)
Enter fullscreen mode Exit fullscreen mode

Now that we’ve imported the dataset, let’s go ahead and create the datasets for training and testing.

train = train[['Survived', 'Age', 'Fare', 'Pclass']]
test = test[['Age', 'Fare', 'Pclass']]
Enter fullscreen mode Exit fullscreen mode

To better understand the missing values, let’s compute the number of missing values in each column of the training and test sets.

# count the number of NaNs in each column
print(train.isna().sum())

Survived 0
Age 177
Fare 0
Pclass 0
dtype: int64

print(test.isna().sum())

Age 36
Fare 1
Pclass 0
dtype: int64
Enter fullscreen mode Exit fullscreen mode

We see that both train and test subsets contain missing values. Let the output label for the classifier be Survived indicated by 1 if the passenger survived and 0 if the passenger did not.

label = train.pop('Survived')
Enter fullscreen mode Exit fullscreen mode

Let’s import HistGradientBoostingClassifier from scikit-learn.

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
Enter fullscreen mode Exit fullscreen mode

As always, let us instantiate the classifier, fit on the training set train and predict on the test set test . Note that we did not impute the missing values; Ideally, when there are missing values NaN, we do get errors. Let us check what happens now.

clf = HistGradientBoostingClassifier()
# no errors, despite NaNs in train and test sets!
clf.fit(train, label)
clf.predict(test)

# Output
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,

       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,

       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,

       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,

       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
Enter fullscreen mode Exit fullscreen mode

Surprisingly, there are no errors and we get predictions for all records in the test set even though there were missing values. Isn’t this cool? Be sure to try out these features in your next project. Happy Learning!


References

[1] Useful scikit-learn tips by Kevin Markham from DataSchool

Cover Image: Photo by Andrew Neel on Unsplash

Top comments (0)