DEV Community

Cover image for Finding Outliers in Your Vision Datasets
Jimmy Guerrero for Voxel51

Posted on • Edited on • Originally published at voxel51.com

Finding Outliers in Your Vision Datasets

Author: Dan Gural (Machine Learning Engineer at Voxel51)

Automatically parse your dataset for outliers with FiftyOne Plugins

Data is the heart of AI. As new models continue to change and evolve at a rapid pace, it is more important than ever to have a high quality dataset to train on. Lurking in our datasets are poor samples that are not representative of our problem, dragging down the performance of our model. Data curation helps solve this problem by supplying a ML engineer with the tools needed to take their datasets to the next level.

Within FiftyOne, there are many different ways to find poor samples in your datasets. Just a few examples are: 

Today we will be showing how to do outlier detection in FiftyOne using embeddings and sklearn! To start, install the outlier detection plugin to gain access to the outlier_detection operator. It can be installed with:

fiftyone plugins download https://github.com/danielgural/outlier_detection
Enter fullscreen mode Exit fullscreen mode

Finding Outliers in the Entire Dataset

Once installed, we can kick open the FiftyOne app with the dataset of our choice. If you need help loading your dataset, check out the documentation on how to get started. We will be using the MSCOCO 2017 training split for our example. We can get started with:

import fiftyone as fo
import fiftyone.zoo as foz
import numpy as np

dataset = foz.load_zoo_dataset(
    "coco-2017",
    shuffle=True,
    split="train" #change to validation for a smaller split 
)

session = fo.launch_app(dataset)
Enter fullscreen mode Exit fullscreen mode

Once you are in the app, hit the backtick key ( ` ) or the browse operations button to open the operators list. Search for the outlier detection operator and you will be met with the following menu.

Image description

From here, you will be able to configure how you want to find your outliers. You can choose from any of the FiftyOne embedding models, what percentage of your dataset you think is contaminated with outliers, and have optional inputs such as looking through a specific class or tagging the samples found as outliers! Let’s try an example using CLIP on the training set of MSCOCO.

Image description

What’s particularly interesting about using the Outlier Detection Plugin is you are met with so many unique and interesting samples. Very quickly from over 100,000 images we can view the 1% that are most relevant to data curation and make decisions on what to keep and what to remove. Some quick observations from our detection leads us to find these issues:

  • Black and white photos
  • Distorted or warped  images
  • Duplicated images in attempt to maintain aspect ratio
  • Backgrounds dominated by a single color (snow, ocean, sky, etc)

Finding Outliers in a Single Class

Different problems require different data curation decisions and FiftyOne brings the most important samples right in front of you. We can perform outlier detection on a single class as well! Let’s check out  a few examples of “airplane” outliers!

Image description

Once again, we can find all these unusual edge cases in our dataset easily with the outlier detection. Now we can be sure that we have no birthday cake airplanes in our training set!

Conclusion

Outliers are found in almost every dataset. Finding them, especially across hundreds of thousands if not millions of samples can be a daunting task, but with FiftyOne, the workflow can be made simple with the Outlier Detection. If you are interested in finding more FiftyOne plugins, checkout our community repo to optimize your workflows with plugins or contribute one of your own! Plugins are highly flexible and always open source so that you can customize it exactly to your needs! Have fun exploring!

Top comments (0)