Why Do Augmentations Work? A Rookie's Take on Classifier Chaos.

#machinelearning #deeplearning #computervision #python

Why do augmentations work with classifiers?

Ever since I started training classifiers, the process of model training has always been quite infuriating, to say the least. Training any competent model that can at the very least perform accurate distinction between the varying classes is an astronomical task, given the lack of readily available datasets. This is where augmentation techniques hold the spotlight; the mere use of augmentations (at a baseline) is to provide some variations in the training data and implicitly increase the training dataset thereby. Providing an overview, there exist techniques like Autoaugment (AA), Randaugment (RA), and Trivial augment (TA), each of which work to better the previous model at some cost.
AA is an augmentation technique that designs specified variations for each image from any dataset, however, this leads to a larger computational overhead for moderate yet comparable gains. RA further builds on AA, downsizing a core aspect i.e. search space, while providing significantly similar results to AA at a largely reduced overhead. Furthermore, TA proves that a singular effect applied per image is nearly comparable to the results of those provided by RA and AA. This, when considered with the "regularizers", i.e., the Cutmix, Cutout, Mixup, and YOCO families, leads to a vast augmentation landscape. Given the large number of choices clouding any fellow rookie's choice on 'which augmentations to use?' it is a given that working with them can be a bit overwhelming.

The irony of the prefect solution:

A similarly frustrating scenario is what I've been dealing with for the past two years of coding. In my experience I've always found (obviously so) that the size of the dataset does inherently decide how well a model performs. Unbalanced datasets lead to memorization of patterns, while a shallow and multi-classed dataset provides an underconfident model. And the blatant irritant is just not having a large dataset, leading to overfitting. This is where possibly the greatest augmentation assets come in (pretrained models). Pretrained models are those models in which weights for a certain type of problem are already present by training them on a much larger dataset (ranging from thousands to hundreds of thousands). Thanks to these models, say for a facial identification module, I won't need to gather 50–100k images just to get a 70% accuracy when comparing a human face and a tree stump. These are great resources to utilize and perform very efficiently, sometimes with even 200 images per class (sometimes even lower). But just like everything mentioned before, even these aren't perfect.

Pretrained failures make the perfect pitfalls

When it comes to a large dataset and readily available and easy-to-set-up pretrained models, one is bound to mention the following families: ImageNet, ResNet, AlexNet, Vision Transformers (VITs), and MobileNet, to name a few. As I mentioned earlier, if we are dealing with smaller datasets, utilizing their smaller models seems like the best bet. The use of such pretrained models leads to a faster convergence with considerably fewer training epochs. Sometimes, if working with classifiers, just swapping the classifier head to be trained also works. The fallacy begins with a closer look at the architectures of these models. 224 x 224: that is the expected input for all these image families. One must wonder how many of our Region of Interest (ROI) pixels would fit in these 50176 pixels; the answer: 'not many,' from my experience. One reasonable take would be to use the 512 x 512 or 1024 x 1024 variants. However, if we are working with just 200 or 300 images per class with 2–3 varying classes, how helpful can these large-sized models even be then? This leads to the paradoxical mix-matching of just enough for my dataset but not too little or too much.

Duct tape to the rescue:

The term "Duct tape heroes" refers to a few tactics which I've used to overcome or subdue the effect of all mentioned pitfalls, as now after carefully walking through this metaphorical minefield of classifier pitfalls, here's some handy "duct tape" to help you build your bridges (or classifiers rather).

As I've mostly worked with only the small-scale 224 input variants, here's what I've done to (sometimes partially) solve these problems.

Set up a Grad-CAM: Grad-CAM simplified is a heat map that provides the regions of interest according to the model. This provides a preliminary overview of whether the model considers relevant features as important to said task. I'd consider this to be a best practice for all relevant classification tasks.
Randomcrop() might be your culprit: If you've already confirmed that your training images are being read as they should be and are still struggling with under- or overfitting, I would urge you to check which portion of your image is actually being used. Try to save the images output by the RandomCrop() transform, then utilize the Grad-CAM on these cropped images to ensure the model "pays attention" to your ROI.
Pad your images: If both '1.' and '2.' aren't solving it, try padding your image with black or white borders to form the 224 x 224 image. If you already have an image that captures your ROI properly and can be cropped to form the 224 x 224 image, use that instead. Otherwise, if your images seem horizontally or vertically lacking, utilize padding. However, for this particular step I would ask you to proceed with caution. In my experience sometimes the model looks at the pitch-black padding as an important feature itself; this probably has something to do with the inherent training data that each of these models were trained on. As this can lead to loss of vital contextual features and reduced image resolution on smaller dataset, I would suggest using this if you can make do with a few feature losses.
Using crops to your advantage: If you can, I would suggest that the most influential tactic on hand has been to crop out the ROI, fit it to the 224 x 224 limit, and pass those images to the classifier. Similar to '3.' this also has a few caveats; most importantly, this can sometimes lead to distorted model biasing and affect the model via that influence.

There you go. Hopefully by now, if you are new to classifier training, this gives a comprehensive picture of the minute fallacies, or rather 'landmines,' in classifier training. And to those the same as me, a fellow rookie, I hope this adds a few lookouts in your journal as well. In the end, training models is equal parts science and improvisation, and sometimes all you really need is a bit of duct tape.

Do consider adding your own duct tape and wrenches down below.
Until next time, Keep those duct tapes handy!