DEV Community

PythicCoder
PythicCoder

Posted on • Originally published at devblog.pytorchlightning.ai on

5 Steps to Training your first Video Classifier in a Flash

Learn step by step how to build, train and make predictions on a video classification task with 3 simple tools: Lightning Flash, PyTorch Video, and Kornia.

Video Understanding enables computers to recognize actions, objects, and events in videos. From Retail, Health Care to Agriculture, Video Understanding enables automation of countless industry use cases.

“A case in point: Walmart has developed an AI system that inspects fresh food for signs of defects and spoilage. This system is helping Walmart monitor the temperature and freshness of produce and perishable foods, improve visual inspection at distribution centers, and route perishable food to the nearest store.“ [Source]

Facebook AI Researchrecently released a new library called PyTorchVideo powered by PyTorch Lightning that simplifies video understanding by providing SOTA pre-trained video models, datasets, and video-specific transformers. All the library’s models are highly optimized for inference and support different datasets and transforms.

Lightning Flash is a collection of tasks for fast prototyping, baselining, and fine-tuning scalable Deep Learning models, built on PyTorch Lightning. It allows you to train and finetune models without being overwhelmed by all the details, and then seamlessly override and experiment with Lightning for full flexibility. Learn more about Flash here.

PyTorchLightning/lightning-flash

In its new coming release, PyTorch Lightning Flash will provide a deep integration with PyTorchVideo and its backbones and transforms can be used alongside Kornia ones to provide a seamless preprocessing, training, and fine-tuning experience of SOTA pre-trained video classification models.

Kornia is a differentiable computer vision library for PyTorch that consists of a set of routines and differentiable modules to solve generic computer vision problems.

In this article you will learn how to train a custom video classification model in 5 simple steps using PyTorch Video, Lightning Flash, and Kornia, using the Kinetics dataset.

The Kinetics human action video dataset released by DeepMind is comprised of annotated~10s video clips sourced from YouTube.

Keep reading to train your own model in a flash!

5 Simple Steps for Video Classification

Before we get started all the steps in this tutorial are reproducible and can be run with the free trial of Grid by clicking the Run button below.

All the steps in this tutorial are reproducible and can be run with the free trial of Grid by clicking the Run button above.

The Grid Platform enables you to seamlessly train hundreds machine learning models on the cloud from your laptop without modifying your code in a fully reproducible manner.

Find the full end-to-end code for this tutorial code here and the full reproducible run here on ** Grid **.

Prerequisite Imports

https://gist.github.com/tchaton/f3ceafeb246b46f8c8f7d9e4498760b0

Step 1- Download The Kinetics Dataset

https://gist.github.com/tchaton/a4744780de0f534a613fc738afbac8c2

Step 2- Specify Kornia Transforms to Apply to Video

Flash helps you to place your transform exactly where you want and makes Video Pre and Post Processing Simple as easy as a few lines of initialization code

As videos are memory-heavy objects, data sampling and data augmentation play a key role in the video model training procedure and need to be carefully crafted to reach SOTA results.

Flash optimizes this process as transforms can be provided with simple dictionary mapping between hooks and the transforms. Flash hooks are simple functions that can be overridden to extend Flash’s functionality.

In the code below, we use the post_tensor_transform hook to uniformly select 8 frames from a sampled video clip and resize them to 224 pixels. This is done in parallel within each of the PyTorch DataLoader workers.

We also use the per_batch_transform_on_device hook to enable Kornia normalization to be applied directly to a batch already moved to GPU or TPU .

https://gist.github.com/tchaton/b627d56e6d3da89eb456aede48c3c171

Step 3- Load the Data

Lightning DataModules are shareable, reusable objects that encapsulate all data-related code. Flash enables you to quickly load videos and labels from various formats such as config files or folders into DataModules.

In this example, we use the Flash from_folders helper function to generate a DataModule. To use the from_folders functions videos should be organized into folders for each class as follows:

dir_path//

Code for Initializing a DataModule on using Flash on the Kinetics Dataset

Step 4- Instantiate the VideoClassifier with a pre-trained Model Backbone .

Training a SOTA video classifier from scratch on Kinetics can take around 2 days on 64 GPUS. Thankfully, the Flash Backbones Registry API makes it easy to integrate individual models as well as entire model hubs with Flash.

By default, the PyTorchVideo Model Hub is pre-registered within the Video Classifier Task Backbone Registry.

You can list any of the available backbones and get more details as follows:

https://gist.github.com/tchaton/d2f2b52f77d7ed5a537d686e124c2b60

Once you pick a backbone, you can initialize it with just one line of code:

https://gist.github.com/tchaton/0e83bff7e27e125cc675bcdd31d3bf23

Note: Serializers are optional components that can be passed to Flash Classifiers that enable you to configure how to format the output of the Task Predictions asLogits, Probabilities , or Labels. Read more here.

Step 5- FineTune Model

Once we have a DataModule and VideoClassifier, we can configure the Flash Trainer. Out of the box, the Flash Trainer supports all the Lightning Trainer flags and tricks you know and love including distributed training, checkpointing, logging, mixed precision, SWA, quantization, early stopping, and more.

In addition to enabling you to train models from scratch the Flash Trainer provides quick access to a host of strategies for fine-tuning pre-trained models on new data.

https://gist.github.com/tchaton/685d58357241fe74c656a90798c1e0fe

In this example, we use the NoFreeze Strategy in which the backbone will be entirely trainable from the start. You can even define your own custom FineTuning Strategies .

Conclusion

With these 5 simple steps you can now train your own Video Classification model on any Video Dataset. Once a model is trained, it can easily be saved and reused later for making predictions.

https://gist.github.com/tchaton/6fc6ba4454a6d7aca0b790f8cbf38fe6

One major advantage of using PyTorch Video backbones is that they are optimized for mobile. With Lightning, models can be saved and exported as classic PyTorch Checkpoints or optimized for inference on the edge with ONNX and or TorchScript.

Find the end-to-end tutorial code ** here **.

We thank Tullie Murrell from the PyTorchVideo team for the support and feedback leading to this integration in Flash.

Next Steps

Now that you have train your first model it’s time to experiment. Flash Tasks provide full access to all the Lightning hooks that can be overridden as you iterate towards the state of the art.

Check out the following links to learn more:

About the Authors

Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As Head of Developer Advocacy at Grid.ai, he collaborates with the Machine Learning Community, to solve real-world problems with game-changing technologies that are then documented, open-sourced, and shared with the rest of the world.

Thomas Chaton is PyTorch Lightning Research Engineering Manager. Previously, Senior Research Engineer at Fujitsu AI and Comcast-Sky Labs, Thomas is also the creator of TorchPoints3D.


Top comments (0)