- Basic knowledge of Python
- Working Juypter notebook for testing the code
To understand sampling bias let us consider an example -
Supposedly we are a Statistician conducting a study on the effects of a new drug introduced in the market for producing more serotonin. We have been asked to identify if the new drug is effective on the male or female population more.
We collect data on the population who have taken the drug and create more features around it where our target
(y) is gender;
1 if it's a female 0 if its a male
As you guessed it we have a binary classification problem at our hands.
When we load the dataset we realize -
That there are 4 females and 1 male in the dataset (Since its a hypothetical scenario and my drawing is horrible I could only make 5 stick figures)
But imagine we get a dataset where the ratio of the classes in target is imbalanced - for every 4 females there is 1 male in our dataset.
The analysis we do and the model we build would be incorrect since our target is highly imbalanced the model gets biased towards the majority class.
Feeling a bit stressed, eh?
Either we reduce our target where gender is female and make sure that we have an equal proportion of both the classes, resulting in a case of undersampling of the female gender.
Or we increase our target where gender is male and make sure that we have an equal proportion of both the classes, resulting in a case of oversampling of the male gender.
There are several techniques to deal with oversampling such as -
- Random oversampling
For undersampling we have -
- Random undersampling
- Tomek links
In this article, we will learn about imblearn and resolve the case of sampling bias.
Imblearn library offers us the methods by which we can generate a data set that has an equivalent proportion of classes.
Let us build a classification model and see this working. For the dataset and code used in the article please refer here.
The data has been generated using Excel.
As we can see above the distribution of female vs male in our target variable is imbalanced. Let us use the sampling technique to overcome the situation.
import imblearn #Importing all libraries. import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from collections import Counter from xgboost import XGBClassifier y = data["Gender"] #Creating target and feature set. X = data.drop("Gender", axis=1) #Splitting data to create train, valid & test sets. X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle = True, test_size = 0.2, random_state=42, stratify = y) X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, shuffle = True, test_size = 0.2, random_state=42, stratify = y_train) print(sorted(Counter(y_train).items())) #Lets see what the group count looks like in our training set prior to sampling.
As we can see before correcting the data for sampling we have 261 instances that belong to class 0 and 635 instances which belong to class 1. If we oversample 0 class we will have a balanced dataset lets do that using SMOTE.
The important thing to focus here is that
we fit the sampling object only on the training data and not on valid & test sets otherwise data leakage occurs.
ros = imblearn.over_sampling.RandomOverSampler(random_state=42) x_ros, y_ros = ros.fit_resample(X_train, y_train) #Fitting the oversampling object to our training set. print(sorted(Counter(y_ros).items()))
Now the classes are balanced. We will build an XGBoost model on the new data.
model = XGBClassifier(objective='binary:logistic') model.fit(x_ros, y_ros) y_pred = model.predict(X_valid) print(model.score(X_valid, y_valid))
We get a 0.59 accuracy score without any hyperparameter tuning.
ros_under = imblearn.under_sampling.RandomUnderSampler(random_state=42) x_ros_under, y_ros_under = ros_under.fit_resample(X_train, y_train) #Fitting the oversampling object to our training set. print(sorted(Counter(y_ros_under).items()))
model = XGBClassifier(objective='binary:logistic') model.fit(x_ros_under, y_ros_under) y_pred = model.predict(X_valid) print(model.score(X_valid, y_valid))
We get a 0.51 accuracy score on the validation dataset when we undersample the data without any hyperparameter tuning.
One reason for the accuracy to be less in undersample is that we give less data for our model to learn. Hence oversampling in our case without hyperparameter tuning wins on the validation dataset.
In this article, we discussed how we can pre-process the imbalanced class data set before building predictive models.
We first did oversampling and then performed undersampling.You can check the documentation here.
Before we part away most of the ensemble algorithms today offer a parameter by which we can handle imbalanced datasets. In XGBoost we have
scale_pos_weight parameter, to handle cases but the
room for configuration gets less using SMOTE we have the control over resolving biases.
Edit 01: You may notice difference in accuracy score since I didn't set the
random_state parameter while creating my classifier.
Shoutout to Barbara Bredner for pointing this out :)