DEV Community: Herumb Shandilya

PyTorch Lightning: DataModules, Callbacks, TPU, and Loggers

Herumb Shandilya — Wed, 09 Jun 2021 19:11:21 +0000

When I was a young man,

I had liberty but I didn’t see it,

I had time but I didn’t know it,

And I had PyTorch Lightning but I didn't use it.

- Newbie PyTorch User

Another Blog another great video game quote butchered by these hands. Anyways, when I was getting started with PyTorch one of the things that made me jealous was the fact that Tensorflow has so much support for monitoring the model performance. I mean I have to write a training loop with redundant steps while Tensorflow beginners were just passing and chilling.

So I did what most PyTorch newbies did, learned and wrote the training loop code until it became muscle memory and that is something you shouldn't do. Few tips for that is just understanding why each step works and once you do that every line of code starts to make sense, trust me once I did that I never forgot it again.

Back to the topic, one thing that I love about PyTorch is the extent to which I can customize it and how easy it is to debug. But still, it would be better if somehow I was able to keep those features and reduce the redundancy.

Well there you have it, the solution is PyTorch Lightning I mean anyone with such a cool name is already destined for greatness but if that reason doesn't convince you then I hope that by the end of this article you will be.

P.S. it's actually better if you have some basic idea about Lightning. If you wanna learn its basics try this article I wrote a while back on Training Neural Networks using PyTorch Lighting.

Datasets, DataLoaders and DataModule

There are two things that give me immense satisfaction in life, first is watching a newbie Vim user trying to exit Vim, and the second is watching a newbie PyTorch user trying to create Dataloader for custom data. So let's take a look at how you can create DataLoaders in PyTorch using an awesome utility called Dataset.

The Dataset Class

One of the things that is essential if you are learning PyTorch is how to create DataLoaders there are many ways to go for it I mean for Image you have ImageFolder utility in torchvision and for Text Data we have BucketIterator which I won't lie are quite handy but still what if the data you load isn't in the desired format? In that case, you can use the Dataset class, and not just that the thing that I love about Dataset class is that they are customizable to an extent you can't imagine.

In Dataset class you have 3 main functions that you must define, let's take a look at them:-

init(self, *): The constructor function. there are infinite possibilities of things you can do here. Think of it as the staging area of our data. Usually, you pass either a path, a dataframe, or an array as the argument to this class's instance. Here you usually define class attributes like feature matrix and target vector. If you are working with an image you can assign transform variable, if you are working with the text you can assign a tokenizer variable etc.
len(self): The length function. this is where you'll return the length of the dataset, you basically return the length of the feature matrix or the dataset in whatever format you have.
getitem(self, idx): The fetching function. This is the function where you define how your data will be returned. You can apply various preprocessing to be applied on the data of at idx index here and then return its tensor along with other variables like mask or target by packing them in a list, tuple, dict, etc. I used to pack them in a list but when I saw Abhishek Thakur's video on this I started using dicts and I never looked back since that day.

The guy is a PyTorch madlad go watch his videos if you haven't already brilliant and to-the-point hands-on videos on many interesting topics. Let's keep all that aside for now and take an example, shall we? I'll be using make_classification to create the dataset to keep things simple.

import torch
from sklearn.datasets import make_classification
from torch.utils.data import Dataset

class TabularData(Dataset):
    def __init__(self, X, Y, train = True):
        self.X = X
        self.Y = Y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        features = self.X[idx]

        if self.Y is not None:
            target = self.Y[idx]
            return {
                'X' : torch.tensor(features, dtype = torch.float32),
                'Y' : torch.tensor(target, dtype = torch.long)
            }

        else:
            return {
                'X' : torch.tensor(features, dtype = torch.float32)
            }

X, Y = make_classification()
train_data = TabularData(X, Y)
train_data[0]

Output:-

{'X': tensor([ 1.1018, -0.0042,  2.1382, -0.7926, -0.6016,  1.5499, -0.4010,  0.3327,
          0.1973, -1.3655,  0.4870,  0.7568, -0.7460, -0.8977,  0.1395,  0.0814,
         -1.4849, -0.2000,  1.2643,  0.4178]), 'Y': tensor(1)}

Subarashi! That seems to be working correctly you can try and experiment with what happens if you change the output of the classes a bit and check the results. But for this tutorial, we'll be working on Fashion MNIST data and thankfully in torchvision there already is a dataset for that so let's load that.

from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor()
])

train = datasets.FashionMNIST('',train = True, download = True, transform=transform)
test = datasets.FashionMNIST('',train = False, download = True, transform=transform)

Now that we have the data it's time to move onto DataLoaders.

DataLoaders

DataLoaders are responsible to take input a dataset and then pack the data in them into batches and create an iterator to iterate over these batches. They really make the whole batching process easier while keeps the customizability to the fullest. I mean you can define how to batch your data by writing your own collate_fn, what more do you want?

We saw how we can create a dataset class, to create its DataLoader you just pass that Dataset instance to the DataLoader and you are done. Let's see how to do it for the MNIST dataset that we created above and check it's output.

import matplotlib.pyplot as plt
from torch.utils.data import DataLoader

trainloader = DataLoader(train, batch_size= 32, shuffle=True)
testloader = DataLoader(test, batch_size= 32, shuffle=True)

#Plotting a Batch of DataLoader
images, labels = iter(trainloader).next()
plt.figure(figsize = (12,16))
for e,(img, lbl) in enumerate(zip(images, labels)):
    plt.subplot(8,4,e+1)
    plt.imshow(img[0])
    plt.title(f'Class: {lbl.item()}')

plt.subplots_adjust(hspace=0.6)

Output:-

Damn, that looks beautiful and correct, well there you go seems like our DataLoaders work as expected. But is it just me or does this seem too messy? I mean it's great we batched the data but the variables seem to be everywhere, it doesn't seem much organized. Well, that's basically where DataModules come in handy 😎.

DataModules

DataModule is a reusable and shareable class that encapsulates the DataLoaders along with the steps required to process data. Creating DataLoaders can get messy that’s why it’s better to club the dataset in the form of DataModule. DataModule has few methods that must define the format of DataModule is as follows:-

import pytorch-lightning as pl

class DataModuleClass(pl.LightningDataModule):
    def __init__(self):
        # Define class attributs here

    def prepare_data(self):
        # Define steps that should be done
        # on only one GPU, like getting data.

    def setup(self, stage=None):
        # Define steps that should be done on 
        # every GPU, like splitting data, applying
        # transform etc.

    def train_dataloader(self):
        # Stage DataLoader for Training Data

    def val_dataloader(self):
        # Stage DataLoader for Validation Data

    def test_dataloader(self):
        # Stage DataLoader for Testing Data

That seems great so let's go ahead and create DataModule for our Fashion MNIST Data.

import pytorch_lightning as pl 

class DataModuleFashionMNIST(pl.LightningDataModule):
    def __init__(self):
        super().__init__()

        self.dir = ''
        self.batch_size = 32
        self.transform = transforms.Compose([
            transforms.ToTensor()
        ])

    def prepare_data(self):
        datasets.FashionMNIST(self.dir, train = True, download = True)
        datasets.FashionMNIST(self.dir, train = False, download = True)

    def setup(self, stage=None):
        data = datasets.FashionMNIST(self.dir,
                                     train = True, 
                                     transform = self.transform)

        self.train, self.valid = random_split(data, [52000, 8000])

        self.test = datasets.FashionMNIST(self.download_dir,
                                               train = False,
                                               transform = self.transform)

    def train_dataloader(self):
        return DataLoader(self.train, batch_size = self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.valid, batch_size = self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.test_data, batch_size = self.batch_size)

data = DataModuleFashionMNIST()

Perfect that's basically it but if you want an in-depth explanation on them you can refer this article I wrote explaining about DataModules.

Lightning Callbacks & Hooks

Callbacks are basically programs that contain code that's run when it is required. When and what a callback should do is defined by using Callback Hooks, a few of them are on_epoch_end, on_validation_epoch_end, etc. You can maybe define a logic to monitor a metric, save a model, or various other cool stuff. If I have to define callbacks as a meme it'll be the following.

Well necessary for training but useful for other stuff. There are many inbuilt callbacks that can be used for various important tasks. A few of them are:-

CallBack	Description
EarlyStopping	Monitor a metric and stop training when it stops improving.
LearningRateMonitor	Automatically monitors and logs learning rate for learning rate schedulers during training.
ModelCheckpoint	Save the model periodically by monitoring a quantity.
Callback	Base Class to define custom callbacks.
LambdaCallback	Basically Lambda Function of Callbacks.

We'll be using the first two for this article but you can refer to docs to learn more about them. Lightning's video tutorials on their website are pretty good.

TPU - Hulkified GPU?

TPUs are accelerators used to speed up Machine Learning Tasks. The catch is that they are platform dependant i.e. TensorFlow. TPUs are optimized for Tensorflow mainly which I think is quite selfish given PyTorch is so awesome.

But we can actually use them in PyTorch by making and passing a TPU Sampler in the DataLoader. It's one hell of a messy task you have to replace the device type with xm.xla_device() and add 2 extra steps for optimizer and not just that you'll have to install PyTorch/XLA to do all that. It goes something like this:-

import torch_xla.core.xla_model as xm

dev = xm.xla_device()

# TPU sampler
data_sampler = torch.utils.data.distributed.DistributedSampler(
    dataset,
    num_replicas=xm.xrt_world_size(),
    rank=xm.get_ordinal())

dataloader = DataLoader(dataset, batch_size=32, sampler = data_sampler)

# Training loop
for batch in dataloader:
    ...
    xm.optimizer_step(optimizer)
    xm.mark_step()
    ...

Man the above code is a mess but in Lightning all that is reduced to a simple single line. All you need to do is pass the number of tpu_cores to be you and you are done for the day. Really it's that simple.

trainer = pl.Trainer(tpu_cores = 1)
trainer.fit(model)

I mean it literally couldn't get any simpler than this but still, there is one more thing that's left to talk about and that is Loggers. Let's take a look at that.

Loggers

Loggers are a kind of utility that you can use to monitor metrics and hyperparameters and much more cool stuff. In fact you probably already have tried one, Tensorboard. Ever heard of it? If you are here chances are you might have. Along with Tensorboard, PyTorch Lightning supports various 3rd party loggers from Weights and Biases, Comet.ml, MlFlow, etc.

In fact, in Lightning, you can use multiple loggers together. To use a logger you can create its instance and pass it in Trainer Class under logger parameter individually or as a list of loggers.

from pytorch_lightning.loggers import WandbLogger

# Single Logger
wandb_logger = WandbLogger(project='Fashion MNIST', log_model='all')
trainer = Trainer(logger=wandb_logger)

# Multiple Loggers
from pytorch_lightning.loggers import TensorBoardLogger
tb_logger = TensorBoardLogger('tb_logs', name='my_model')
trainer = Trainer(logger=[wandb_logger, tb_logger])

The point is how to log the values? We'll take a look at how you can do that along with applying all the stuff I talked about above.

Putting it All Together

The Ingredient to a model is the data that you feed it, we talked about DataModules in this article so let's start by creating a DataModule for out Fashion MNIST data.

import pytorch_lightning as pl 
from torchvision import datasets, transforms

class DataModuleFashionMNIST(pl.LightningDataModule):
    def __init__(self, batch_size = 32):
        super().__init__()

        self.dir = ''
        self.batch_size = batch_size
        self.transform = transforms.Compose([
            transforms.ToTensor()
        ])

    def prepare_data(self):
        datasets.FashionMNIST(self.dir, train = True, download = True)
        datasets.FashionMNIST(self.dir, train = False, download = True)

    def setup(self, stage=None):
        data = datasets.FashionMNIST(self.dir,
                                     train = True, 
                                     transform = self.transform)

        self.train, self.valid = random_split(data, [52000, 8000])

        self.test = datasets.FashionMNIST(self.download_dir,
                                          train = False,
                                          transform = self.transform)

    def train_dataloader(self):
        return DataLoader(self.train, batch_size = self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.valid, batch_size = self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.test_data, batch_size = self.batch_size)

data = DataModuleFashionMNIST()

Now let's create our model class and setup logs for the WandbLogger and then create a model instance for the same.

from torch import nn, optim
import torch.nn.functional as F

class FashionMNISTModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        # 28 * 28 * 3
        self.conv1 = nn.Conv2d(1,16, stride = 1, padding = 1, kernel_size = 3)
        # 14 * 14 * 16
        self.conv2 = nn.Conv2d(16,32, stride = 1, padding = 1, kernel_size = 3)
        # 7 * 7 * 32
        self.conv3 = nn.Conv2d(32,64, stride = 1, padding = 1, kernel_size = 3)
        # 3 * 3 * 64

        self.fc1 = nn.Linear(3*3*64,128)
        self.fc2 = nn.Linear(128,64)
        self.out = nn.Linear(64,10)

        self.pool = nn.MaxPool2d(2,2)
        self.loss = nn.CrossEntropyLoss()

    def forward(self,x):
        x = F.relu(self.pool(self.conv1(x)))
        x = F.relu(self.pool(self.conv2(x)))
        x = F.relu(self.pool(self.conv3(x)))

        batch_size, _, _, _ = x.size()

        x = x.view(batch_size,-1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.out(x)

    def configure_optimizers(self):
        return optim.Adam(self.parameters())

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.loss(logits,y)

        # Logging the loss
        self.log('train/loss', loss, on_epoch=True)
        return loss

    def validation_step(self, valid_batch, batch_idx):
        x, y = valid_batch
        logits = self.forward(x)
        loss = self.loss(logits,y)

        # Logging the loss
        self.log('valid/loss', loss, on_epoch=True)
        return loss

As you can see in the above code we are using self.log() to log our loss values for which chart will be generated. Now let's create our logger and fit our trainer.

from pytorch_lightning.loggers import WandbLogger

model = FashionMNISTModel()
wandb_logger = WandbLogger(project='Fashion MNIST', log_model='all')

trainer = pl.Trainer(max_epochs=10, tpu_cores = 1, logger = wandb_logger)
wandb_logger.watch(model)

trainer.fit(model, data)

Once you run the above code the logs will be plotted in runtime. I am plotting loss values using the self.log() and logging gradients using watch().

If you get MisconfigurationException: No TPU devices were found. then run the following command to fix it.

%%capture
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py > /dev/null
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev > /dev/null
!pip install pytorch-lightning > /dev/null

Since the model is extremely simple I didn't use any callbacks but you can use if you like. In Learning Rate Monitor Callback you need to have a scheduler to make it work and pass the Model Checkpoint Callback in checkpoint_callback instead of callback.

From Me to You...

Wow, that seemed like an exciting ride. Honestly, I think Lightning has many cool stuff that a person can utilize the fact that we can use all these while keeping it as close to PyTorch proves how powerful it is. I hope now you are convinced that Lightning is a great tool to be learned and used.

Class Imbalance comes in Like a Lion

Herumb Shandilya — Wed, 09 Jun 2021 19:00:27 +0000

In a world without class imbalance we might've been heroes.

- Neural Networks

Keeping aside the fact that I butchered one of the greatest Video Game quotes of all time class imbalance can be a tricky thing to handle especially if you are a beginner. When I first encountered class imbalance I treated it normally, I know right, and not just that I measured the accuracy to judge the performance. Needless to say, that went quite badly, and to avoid this happening to you let me help out in avoiding an embarrassing situation in front of your teacher or whoever you report to.

Class Imbalance refers to a condition where the no. of data points corresponding to a class overpowers the other in a significant way. This could happen cause of bias towards a particular class during data collection, error during labeling, etc. I mean the cause doesn't matter once the data is served so all you can do now is see what you can do with whatever you have.

We'll see how to tackle class imbalance in different domains like structured data, NLP, and CV. We'll see some of the techniques you can use to modify your data to balance out the class ratio and we'll talk about how you can fix this thing on a model level without modifying the data itself.

Loading Our Data

I believe that the correct way to learn a concept is by applying what you learn in theory and that's why I'll be putting code for you to see how we are actually going to apply what we are talking about. For the purpose of this article I've decided to use the classic dataset used to teach class imbalance i.e. Credit Card Fraud Detection.

df = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')

X = df.drop('Class', axis = 1)
Y = df['Class']

Y.value_counts()

Output:-

0    284315
1     492
Name: Class, dtype: int64

I guess it's safe to say that our data is messed up. YAY! So now that we have our data let's get some action.

Choosing The Correct Metric

First things first, whenever you see class imbalance you have to ditch Accuracy then and there no questions asked. Think about it you give your model data to tell if a patient is diabetic or not but the data only has 10% samples for diabetic entries, which means the model can attain 90% accuracy just by predicting not diabetic every time. What you wanna see is how well the model can classify the diabetic entries or our minority class. For this, we can use various metrics:-

Precision: Out of all entries classified as class A how many were correctly classified.
Recall: How many entries of class A was our model able to recall correctly.
F1-Score: Harmonic mean of Precision and Recall.
ROC-AUC Score: Area under Curve of plot between Specificity and Sensitivity Values at different thresholds.
PR Curve: Plot between Precision and Recall Values at different thresholds.

MCC and Kappa Score

So you get the gist right, Accuracy is not always accurate. But apart from the above-mentioned, I wanna talk about one more matrix. The Dark Horse of the Evaluation Metrics and arguably the best classification metric Matthew's Coherent Coefficient or MCC Score if you may.

The above metrics are fine too but MCC Score is much more reliable since it gives a good score only when all the portions of the confusion matrix give good results i.e. TP, FP, TN, and FN.

MCC is designed for binary classification but it can be used for multi-class classification using micro or macro averaging. We also have Kappa Score that can be used for both imbalanced and multi-class data. Note that there are many papers that argue the reliability of Kappa score and many papers that defend it. Mostly revolving around its unwanted behavior but let's leave it a topic for another blog.

Scoring The Baseline Results

Well, I hope you were able to grasp the importance of proper metrics when dealing with an imbalanced dataset. So let's start by checking the performance of our baseline model. Now in order to score our model, we'll have to split the data into training and testing splits but our data is imbalanced so we can't just do random splits. We need the data to retain the original class ratio and for that, we have the stratify parameter in train_test_split itself.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, 
                                                    test_size = 0.25, 
                                                    random_state = 1, 
                                                    stratify = Y)

Well that was easy, wasn't it? Now let's go ahead and train our baseline model and check its baseline metrics using classification_report and confusion_matrix.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

clf = RandomForestClassifier()
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Output:-

Confusion Matrix:-
 [[71073     6]
 [   17   106]]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.95      0.86      0.90       123

    accuracy                           1.00     71202
   macro avg       0.97      0.93      0.95     71202
weighted avg       1.00      1.00      1.00     71202

Choosing Suitable Algorithm?

Now that we understand the importance of metrics in measuring the performance of a model in class imbalance, we can move on and check if there are any algorithms that aren't really bothered by class imbalance.

I mean on paper KNN shouldn't be bothered with class imbalance but there is something Hellinger Distance Decision Trees, basically decision trees that use Hellinger Distance as the split criterion. They were created to tackle the effect of imbalance on decision trees.

There is also a way by which you can modify your algorithm to give importance to minority class prediction by the use of class weights. Let's talk more about this cause why not.

Cost-Sensitive Algorithms

Please don't be intimidated by the name it's a rather simple concept, basically you assign weights to each class what these signify is that how much will the algorithm be penalized for a misclassification for an entry of a class. There are many algorithms in sklearn that support class weighing and few that don't support it.

The weights are assigned to the class such that the minority class has a higher weight than the majority class. We would expect that the algorithm trained on class weights will perform better as compared to the standard one.

In order to pass weights to the algorithm, you can simply pass the dictionary with key as class and value as the weight to the corresponding key to the class_weight parameter. Let's try doing this in our classifier.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

clf = RandomForestClassifier(verbose = 100, class_weight = {0:600,1:1})
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Output:-

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.92      0.88      0.90       123

    accuracy                           1.00     71202
   macro avg       0.96      0.94      0.95     71202
weighted avg       1.00      1.00      1.00     71202

Recall value seems to have increased a bit that means our cost-sensitive model is able to recall more values from minority class in the testing set. You can try out different combinations to check if you can get better results.

Now one question that may arise in your mind is, what weights should I assign to which class? There is a simple answer to this question i.e. by tuning the weights. You can select a range of values and using GridSearch to find which ones work the best.

Let's say you are a daredevil and wanna tune the weights then, you can try using compute_class_weight() utility in sklearn to compute class weights and use them as the weights for the algorithm. In my experience, it rarely gives the best result as compared to tuned ones. But tuning is actually pretty simple, below I'll tell you how to do it and I want you to try it out by redefining the param_grid according to what you think is the best. You can comment on your findings if you like too 😃

from sklearn.model_selection import GridSearchCV

param_grid = {
    'class_weight' : [{0:600,1:1}, {0:700,1:1}, {0:800,1:1}]
}

grid = GridSearchCV(RandomForestClassifier(), param_grid = param_grid)
grid.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

In neural networks to you can train your model with assigned class weights to tackle the issue of class imbalance. The syntax is pretty similar in the sense you just pass the class weights to the network. In Tensorflow you pass weights in the fit() function and in PyTorch you pass weights in the Loss function.

# PyTorch - Pass weight tensor in loss function
pytorch_weights = torch.tensor([0.99, 0.1])
criterion = nn.NLLLoss(weight =  pytorch_weights)

# TensorFlow - Pass weight dictionary in fit function
tf_weights = { 0 : 99, 
               1 : 1 }
model.fit(x_train, y_train, batch_size = 50, class_weight = tf_weights)

Handling Class Imbalance with Data Modification

No no I'm not talking about modifying the values of the entries but I'm talking about how we can remove or add entries corresponding to a class to the existing data in order to balance the class ratio. There are quite a few ways to go about it and we'll explore almost all of them in-depth. Once you have balanced out the classes it basically you good old classification problem. The ones that we'll learn about are:-

Undersampling - Fix for the Lazy
Oversampling - Jugaad
SMOTE - Fix from Logic
ADASYN - SMOTE's Sibling

It'll be better If you install imblearn library since that's what we'll be using to implement the above. You can install it via the following command:-

pip install imblearn

Undersampling - Fix for the Lazy

Random Undersampling is a way to balance class by removing entries from the majority class randomly. I know right makes no sense, you are basically randomly deleting entries from the majority class. Not many people are fond of this mainly cuz it leads to loss of information, I too think it's stupid to lose that precious labeled data.

from imblearn.under_sampling import RandomUnderSampler

us = RandomUnderSampler()
x_train, y_train = us.fit_resample(x_train, y_train)

clf = RandomForestClassifier()
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Output:-

Confusion Matrix:-
 [[68666  2413]
 [    8   115]]

              precision    recall  f1-score   support

           0       1.00      0.97      0.98     71079
           1       0.05      0.93      0.09       123

    accuracy                           0.97     71202
   macro avg       0.52      0.95      0.53     71202
weighted avg       1.00      0.97      0.98     71202

But, what if we undersample our data smartly? There are ways and algorithms that can be used to removed redundant data and reduce the issue of losing important information. Not just but sometimes a combination of both oversampling and undersampling can achieve good results too. But enough with that let's learn about those smart undersampling methods.

Near Miss Undersampling

This type of undersampling basically determines the samples from the majority class that should be retained based on the distance between that sample and minority class samples. It has 3 variations to it:-

*Version - 1: * Keeps the ones that have the minimum average distance from the nearest three minority class samples.
*Version - 2: * Keeps the ones that have the minimum average distance from the farthest three minority class samples.
*Version - 3: * Keeps the ones that have the minimum average distance from all minority class samples.

from imblearn.under_sampling import NearMiss

nm_us = NearMiss(version = 3)
x_train, y_train = nm_us.fit_resample(x_train, y_train)

clf = RandomForestClassifier()
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Output:-

Confusion Matrix:-
 [[68681  2398]
 [    8   115]]

              precision    recall  f1-score   support

           0       1.00      0.97      0.98     71079
           1       0.05      0.93      0.09       123

    accuracy                           0.97     71202
   macro avg       0.52      0.95      0.54     71202
weighted avg       1.00      0.97      0.98     71202

I've tried quite a few examples and in most case version 3 really works better as compared to the other 2.

Tomek Links Undersampling

This type of undersampling has a simple yet neat approach it focuses on removing the majority sample of the Tomek link. Tomek link is defined as the points that are closest to each other and both belonging to different classes, kinda like Romeo and Juliet except here only one of them dies, an apology to all R&J fans. Anyways let's try our hand on Tomek links.

from imblearn.under_sampling import TomekLinks

tomek_us = TomekLinks()
x_train, y_train = tomek_us.fit_resample(x_train, y_train)

clf = RandomForestClassifier(verbose = 100)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Here since only Tomek links are deleted the class balance isn't completely achieved rather only ambiguous points are removed.

Random Oversampling - Jugaad

This is the exact opposite of what we did in undersampling instead of removing points from the majority class we explode the minority class by filling data with samples from the minority class chosen at random with repetition and hence achieving class balance. There are ways to explode minority class samples smartly and logically like by generating synthetic data but we'll talk about them shortly.

from imblearn.over_sampling import RandomOverSampler

os = RandomOverSampler()
x_train, y_train = os.fit_resample(x_train, y_train)

clf = RandomForestClassifier(verbose = 100)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Output:-

Confusion Matrix:-
 [[71072     7]
 [   17   106]]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.94      0.86      0.90       123

    accuracy                           1.00     71202
   macro avg       0.97      0.93      0.95     71202
weighted avg       1.00      1.00      1.00     71202

SMOTE - Fix from Logic

SMOTE, or Synthetic Minority Oversampling TEchnique, is a way through which we oversample the data by generating synthetic data based on the provided samples of the minority classes. The steps required to create synthetic data are actually quite simple:-

Take samples from the minority class
Join these samples via lines
Pick points that lie on these lines at random until you achieve class balance

Take the above picture, which took me 30 mins to make, as an example, we only have 4 red points in the minority sample but if we join those points and add points that lie on the red line to the existing dataset then we can balance the class ratio. This basically how SMOTE works. Simple right!

from imblearn.over_sampling import SMOTE

smote = SMOTE()
x_train, y_train = smote.fit_resample(x_train, y_train)

clf = RandomForestClassifier(verbose = 100)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Output:-

Confusion Matrix:-
 [[71066    13]
 [   14   109]]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.89      0.89      0.89       123

    accuracy                           1.00     71202
   macro avg       0.95      0.94      0.94     71202
weighted avg       1.00      1.00      1.00     71202

Hmm, that seems fine recall went up a bit but precision took a hit. There are other variates to SMOTE that you can explore and learn about too. But for now, it's time to move on to the next method.

ADASYN - SMOTE's Sibling

ADASYN, or Adaptive Synthetic, is a way through which we oversample the data by generating synthetic data based on the density of majority neighbors around a minority sample. The difference between SMOTE and ADASYN is that in ADASYN the no. of samples generated around a point depends on the density distribution r_x of that point whereas in SMOTE all minority samples have equal weights.

For example in the above image purple point will have more samples generated around it rather than the green arrow one. Let's understand the steps required in ADASYN:-

Calculate the no. of points to be generated, denoted by G. Here beta is the balance factor which if kept 1 generates a point to achieve perfect class balance.

where n_M <- No. of majority class samples and n_m <- No. of minority class samples

For all i that belong to minority class we find the ratio:-

where K is the no. of neighbors and Δ represents the no. of samples belonging to the majority class out of those K neighbors.

Convert the above to probability distribution.
Calculate the no. of data points to be generated around each minority sample. Then for each minority sample, you generate gᵢ no. of samples.

from imblearn.over_sampling import ADASYN

adasyn = ADASYN()
x_train, y_train = adasyn.fit_resample(x_train, y_train)

clf = RandomForestClassifier(verbose = 100)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(f'Confusion Matrix:-\n {confusion_matrix(y_test,y_pred)}\n')
print(classification_report(y_test, y_pred))

Output:-

Confusion Matrix:-
 [[71065    14]
 [   14   109]]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.89      0.89      0.89       123

    accuracy                           1.00     71202
   macro avg       0.94      0.94      0.94     71202
weighted avg       1.00      1.00      1.00     71202

Handling Class Imbalance in Image Classification

In Image Classification, the usual way to handle class imbalance is to explode the data using Oversampling or via Data Augmentation. In oversampling, it's pretty simple you copy and you replicate like the good old way.

In Oversampling by Data Augmentation, you can generate samples by changing the nature of the image itself. For those who don't know Image Augmentation refers to changing the aspects of an image like its scale, angle orientation, etc. Keep in mind I'm not talking about just applying augmentation to the image but saving those augmentations if you do the first the dataset will be still imbalanced.

Another way to tackle this is by passing class_weights to the CNN model. Now unless you are a person who loves pain chances are you'll be using a CNN model or a pre-trained model for image classification for which we can define class_weights and train the network accordingly.

Handling Class Imbalance in Text Classification

Text Classification for the imbalanced set is also similar, we can use class_weights to define penalty for misclassification of each class. Chances are you might be using LSTM, BERT, etc. for which we can utilize class_weights.

We can also remove duplicate sentences too. Like if you have 2 sentences, "Bag is in the room" and "Bag is in room", welp they are basically the same so we can remove one of them. Removing such duplicate messages will help you reduce the size of your majority class.

The next way is oversampling the google old random way i.e. Random Oversampling or you can explode minority samples with text augmentations. Wait what? Text Augmentations!? I mean in images we rotate, scale up, crop, rotate, etc. but what can we do with text? In text augmentation we start by tokenizing sentences then we can shuffle and rejoin them to generate new texts. We can also replace adjectives, verbs, etc. by its a synonym to generate text with the same meaning.

There is another way where you are converting English text to a language and converting back to English using language translation.

From Me to You...

Boy was that a lot of information to grasp. Imbalance classification can be a pain I mean in a perfect world there would be no imbalance, sadly we live in a world where 4-koma mangas rarely get an anime adaptation, so unfair. The point is where is a will there is a way and I hope I was able to guide you through that way properly. See you in the next article!