DEV Community: pytorch-ignite

Introducing PyTorch-Ignite's Code Generator v0.2.0

vfdev-5 — Tue, 31 Aug 2021 10:49:35 +0000

Along with the PyTorch-Ignite 0.4.5 release, we are excited to announce the new release of the web application for generating PyTorch-Ignite's training pipelines. This blog post is an overview of the key features and updates of the Code Generator v0.2.0 project release.

Deep Learning As a Routine

In deep learning applications, neural networks are typically accompanied by code to preprocess the input and output data, visualize the results, define proper training and evaluation pipelines, and more. A significant part of this supporting code consists of reusable components, like data loaders, training loops, logging, and tracking. Therefore, deep learning practitioners usually organize their boilerplate codebases into collections of reusable components to speed up development.

PyTorch-Ignite is one such practical solution, a high-level library from the PyTorch ecosystem for training neural networks designed to simplify workflow development while maintaining maximum control, flexibility, and reproducibility. PyTorch-Ignite feels like a natural extension to PyTorch.

Ignite Your Training Pipelines

PyTorch-Ignite's Code Generator is an open-source tool developed to boost your training pipeline's scripts, carefully designed by PyTorch-Ignite's contributors to promote PyTorch-Ignite's best practices. The application has a user-friendly and intuitive web interface, simple enough for day-to-day use, and it is an excellent choice for quickly generating a custom templates for training PyTorch models.

In this release, we are using a new application development stack to enhance the user experience tenfold. For UI, we switched to a JavaScript stack. The PyTorch and PyTorch-Ignite specific generated code remains the same.

Getting Started

The best way to "ignite your training pipeline" is to visit Code Generator's homepage and select your task's template by clicking on the "Getting Started" button.

You can choose a template from a list of templates located on the left in the Templates tab. The app will start to render the template with the preconfigured default settings. You will see all the generated files with the rendered code on the right in different tabs as in a regular IDE. The current state of the configurations is reflected partially in the config YAML file.

Currently, we offer four customizable templates for widely used deep learning tasks: Vision classification and segmentation, Text classification, and DCGAN.

Start adjusting the code in the template by visiting different tabs on the left side:

Training: To turn on Distributed training
Handlers: To set up Checkpointing, Termination on NaNs, Early Stopping, etc
Loggers: To configure Logging

Once you choose the appropriate settings, press the "Download" or "Open in Colab" button at the top to export generated code as a zip archive or a notebook, and follow any given additional steps. The resulting archive contains generated files bundled together. The requirements.txt file contains all the required dependencies, and the README contains all the necessary information for launching the script.

You are now ready to add in your data and model and run the code!

I Want To Contribute!

We encourage open source contributors from both frontend and data science communities to collaborate on the project. If you are interested, please visit the Contribution Guide. If you have any questions, do not hesitate to ask them on our Discord. Here are some good first issues.

Next Steps

In future releases, we plan to extend our template store and add more features, for example, configuration systems, data loaders, datasets and models, optimizers, and schedulers. We will continue improving the app's reliability and usability. To stay in touch, follow us on Twitter and Facebook. We would love to get your feedback on the project.

Acknowledgements

The development of this project is supported by a NumFOCUS Small Development Grant. We are very grateful to them for this support!

Distributed Training Made Easy with PyTorch-Ignite

vfdev-5 — Tue, 10 Aug 2021 23:06:33 +0000

Writing agnostic distributed code that supports different platforms, hardware configurations (GPUs, TPUs) and communication frameworks is tedious. In this blog, we will discuss how PyTorch-Ignite solves this problem with minimal code change.

Prerequisites

This blog assumes you have some knowledge about:

PyTorch's distributed package, the backends and collective functions it provides. In this blog, we will focus on distributed data parallel code.
PyTorch-Ignite. Refer to this blog for an overview.

Introduction

PyTorch-Ignite's ignite.distributed (idist) submodule introduced in version v0.4.0 (July 2020) quickly turns single-process code into its data distributed version.

Thus, you will now be able to run the same version of the code across all supported backends seamlessly:

backends from native torch distributed configuration: nccl, gloo, mpi.
Horovod framework with gloo or nccl communication backend.
XLA on TPUs via pytorch/xla.

In this blog post we will compare PyTorch-Ignite's API with torch native's distributed code and highlight the differences and ease of use of the former. We will also show how Ignite's auto_* methods automatically make your code compatible with the aforementioned distributed backends so that you only have to bring your own model, optimizer and data loader objects.

Code snippets, as well as commands for running all the scripts, are provided in a separate repository.

Then we will also cover several ways of spawning processes via torch native torch.multiprocessing.spawn and also via multiple distributed launchers in order to highlight how Pytorch-Ignite's idist can handle it without any changes to the code, in particular:

More information on launchers experiments can be found here.

PyTorch-Ignite Unified Distributed API

We need to write different code for different distributed backends. This can be tedious especially if you would like to run your code on different hardware configurations. Pytorch-Ignite's idist will do all the work for you, owing to the high-level helper methods.

Focus on the helper `auto_*` methods

auto_model()

This method adapts the logic for non-distributed and available distributed configurations. Here are the equivalent code snippets for distributed model instantiation:

PyTorch-Ignite	PyTorch DDP

Horovod	Torch XLA

Additionally, it is also compatible with NVIDIA/apex

model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model = idist.auto_model(model)

and Torch native AMP

model = idist.auto_model(model)

with autocast():
    y_pred = model(x)

auto_optim()

This method adapts the optimizer logic for non-distributed and available distributed configurations seamlessly. Here are the equivalent code snippets for distributed optimizer instantiation:

PyTorch-Ignite	PyTorch DDP

Horovod	Torch XLA

auto_dataloader()

This method adapts the data loading logic for non-distributed and available distributed configurations seamlessly on target devices.

Additionally, auto_dataloader() automatically scales the batch size according to the distributed configuration context resulting in a general way of loading sample batches on multiple devices.

Here are the equivalent code snippets for the distributed data loading step:

PyTorch-Ignite	PyTorch DDP

Horovod	Torch XLA

Additionally, idist provides collective operations like all_reduce, all_gather, and broadcast that can be used with all supported distributed frameworks. Please, see our documentation for more details.

Examples

The code snippets below highlight the API's specificities of each of the distributed backends on the same use case as compared to the idist API. PyTorch native code is available for DDP, Horovod, and for XLA/TPU devices.

PyTorch-Ignite's unified code snippet can be run with the standard PyTorch backends like gloo and nccl and also with Horovod and XLA for TPU devices. Note that the code is less verbose, however, the user still has full control of the training loop.

The following examples are introductory. For a more robust, production-grade example that uses PyTorch-Ignite, refer here.

The complete source code of these experiments can be found here.

PyTorch-Ignite - Torch native Distributed Data Parallel - Horovod - XLA/TPUs

PyTorch-Ignite	PyTorch DDP
Source Code	Source Code

Horovod	Torch XLA
Source Code	Source Code

Note: You can also mix the usage of idist with other distributed APIs as below:

dist.init_process_group(backend, store=..., world_size=world_size, rank=rank)
rank = idist.get_rank()
ws = idist.get_world_size()
model = idist.auto_model(model)
dist.destroy_process_group()

Running Distributed Code

PyTorch-Ignite's idist also unifies the distributed codes launching method and makes the distributed configuration setup easier with the
ignite.distributed.launcher.Parallel (idist Parallel) context manager.

This context manager has the capability to either spawn nproc_per_node (passed as a script argument) child processes and initialize a processing group according to the provided backend or use tools like torch.distributed.launch, slurm, horovodrun by initializing the processing group given the backend argument only
in a general way.

With `torch.multiprocessing.spawn`

In this case idist Parallel is using the native torch torch.multiprocessing.spawn method under the hood in order to run
the distributed configuration. Here nproc_per_node is passed as a spawn argument.

Running multiple distributed configurations with one code. Source: ignite_idist.py:

# Running with gloo
python -u ignite_idist.py --nproc_per_node 2 --backend gloo

# Running with nccl
python -u ignite_idist.py --nproc_per_node 2 --backend nccl

# Running with horovod with gloo controller ( gloo or nccl support )
python -u ignite_idist.py --backend horovod --nproc_per_node 2

# Running on xla/tpu
python -u ignite_idist.py --backend xla-tpu --nproc_per_node 8 --batch_size 32

With Distributed launchers

PyTorch-Ignite's idist Parallel context manager is also compatible
with multiple distributed launchers.

With torch.distributed.launch

Here we are using the torch.distributed.launch script in order to
spawn the processes:

python -m torch.distributed.launch --nproc_per_node 2 --use_env ignite_idist.py --backend gloo

With horovodrun

horovodrun -np 4 -H hostname1:2,hostname2:2 python ignite_idist.py --backend horovod

Note: In order to run this example and to avoid the installation procedure, you can pull one of PyTorch-Ignite's docker image with pre-installed Horovod. It will include Horovod with gloo controller and nccl support.
docker run --gpus all -it -v $PWD:/project pytorchignite/hvd-vision:latest /bin/bash
cd project

With slurm

The same result can be achieved by using slurm without any
modification to the code:

srun --nodes=2
    --ntasks-per-node=2
    --job-name=pytorch-ignite
    --time=00:01:00
    --partition=gpgpu
    --gres=gpu:2
    --mem=10G
    python ignite_idist.py --backend nccl

or using sbatch script.bash with the script file script.bash:

#!/bin/bash
#SBATCH --job-name=pytorch-ignite
#SBATCH --output=slurm_%j.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=00:01:00
#SBATCH --partition=gpgpu
#SBATCH --gres=gpu:2
#SBATCH --mem=10G

srun python ignite_idist.py --backend nccl

Closing Remarks

As we saw through the above examples, managing multiple configurations
and specifications for distributed computing has never been easier. In
just a few lines we can parallelize and execute code wherever it is
while maintaining control and simplicity.

References

idist-snippets:
complete code used in this post.
why-ignite: examples
with distributed data parallel: native pytorch, pytorch-ignite,
slurm.
CIFAR10 example
of distributed training on CIFAR10 with muliple configurations: 1 or
multiple GPUs, multiple nodes and GPUs, TPUs.

Next steps

To learn more about PyTorch-Ignite, please check out our website: https://pytorch-ignite.ai and our tutorials and how-to guides.

We also provide PyTorch-Ignite code-generator application: https://code-generator.pytorch-ignite.ai/ to start working on tasks without rewriting everything from scratch.

PyTorch-Ignite's code is available on the GitHub: https://github.com/pytorch/ignite . The project is a community effort, and everyone is welcome to contribute and join the contributors community no matter your background and skills !

Keep updated with all PyTorch-Ignite news by following us on Twitter and Facebook.

Introduction to PyTorch-Ignite

vfdev-5 — Tue, 10 Aug 2021 22:39:51 +0000

This post is a general introduction of PyTorch-Ignite. It intends to give a brief but illustrative overview of what PyTorch-Ignite can offer for Deep Learning enthusiasts, professionals and researchers. Following the same philosophy as PyTorch, PyTorch-Ignite aims to keep it simple, flexible and extensible but performant and scalable.

Throughout this tutorial, we will introduce the basic concepts of PyTorch-Ignite with the training and evaluation of a MNIST classifier as a beginner application case. We also assume that the reader is familiar with PyTorch.

PyTorch-Ignite: What and Why ?

PyTorch-Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

PyTorch-Ignite is designed to be at the crossroads of high-level Plug & Play features and under-the-hood expansion possibilities.
PyTorch-Ignite aims to improve the deep learning community's technical skills by promoting best practices.
Things are not hidden behind a divine tool that does everything, but remain within the reach of users.

PyTorch-Ignite takes a "Do-It-Yourself" approach as research is unpredictable and it is important to capture its requirements without blocking things.

🔥 PyTorch + Ignite 🔥

PyTorch-Ignite wraps native PyTorch abstractions such as Modules, Optimizers, and DataLoaders in thin abstractions which allow your models to be separated from their training framework completely. This is achieved by a way of inverting control using an abstraction known as the Engine. The Engine is responsible for running an arbitrary function - typically a training or evaluation function - and emitting events along the way.

A built-in event system represented by the Events class ensures Engine's flexibility, thus facilitating interaction on each step of the run.
With this approach users can completely customize the flow of events during the run.

In summary, PyTorch-Ignite is

Extremely simple engine and event system = Training loop abstraction
Out-of-the-box metrics to easily evaluate models
Built-in handlers to compose training pipelines, save artifacts and log parameters and metrics

Additional benefits of using PyTorch-Ignite are

Less code than pure PyTorch while ensuring maximum control and simplicity
More modular code

PyTorch-Ignite	PyTorch

About the design of PyTorch-Ignite

PyTorch-Ignite allows you to compose your application without being focused on a super multi-purpose object, but rather on weakly coupled components allowing advanced customization.

The design of the library is guided by:

Anticipating new software or use-cases to come in in the future without centralizing everything in a single class.
Avoiding configurations with a ton of parameters that are complicated to manage and maintain.
Providing tools targeted to maximizing cohesion and minimizing coupling.
Keeping it simple.

Quick-start example

In this section we will use PyTorch-Ignite to build and train a classifier of the well-known MNIST dataset. This simple example will introduce the principal concepts behind PyTorch-Ignite.

For additional information and details about the API, please, refer to the project's documentation.

pip install pytorch-ignite

Common PyTorch code

First, we define our model, training and validation datasets, optimizer and loss function:

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD
from torch.utils.data import DataLoader

from torchvision.transforms import Compose, ToTensor, Normalize
from torchvision.datasets import MNIST

# transform to normalize the data
transform = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))])

# Download and load the training data
trainset = MNIST("data", download=True, train=True, transform=transform)
train_loader = DataLoader(trainset, batch_size=128, shuffle=True)

# Download and load the test data
validationset = MNIST("data", train=False, transform=transform)
val_loader = DataLoader(validationset, batch_size=256, shuffle=False)

# Define a class of CNN model (as you want)
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=-1)

device = "cuda"

# Define a model on move it on CUDA device
model = Net().to(device)

# Define a NLL loss
criterion = nn.NLLLoss()

# Define a SGD optimizer
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.8)

The above code is pure PyTorch and is typically user-defined and is required for any pipeline.

Trainer and evaluator's setup

model's trainer is an engine that loops multiple times over the training dataset and updates model parameters. Let's see how we define such a trainer using PyTorch-Ignite. To do this, PyTorch-Ignite introduces the generic class Engine that is an abstraction that loops over the provided data, executes a processing function and returns a result. The only argument needed to construct the trainer is a train_step function.

from ignite.engine import Engine

def train_step(engine, batch):
    x, y = batch
    x = x.to(device)
    y = y.to(device)

    model.train()
    y_pred = model(x)
    loss = criterion(y_pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss

# Define a trainer engine
trainer = Engine(train_step)

Please note that train_step function must accept engine and batch arguments. In the example above, engine is not used inside train_step, but we can easily imagine a use-case where we would like to fetch certain information like current iteration, epoch or custom variables from the engine.

Similarly, model evaluation can be done with an engine that runs a single time over the validation dataset and computes metrics.

def validation_step(engine, batch):
    model.eval()
    with torch.no_grad():
        x, y = batch[0], batch[1]
        x = x.to("cuda")
        y = y.to("cuda")

        y_pred = model(x)

        return y_pred, y

evaluator = Engine(validation_step)

This allows the construction of training logic from the simplest to the most complicated scenarios.

The type of output of the process functions (i.e. loss or y_pred, y in the above examples) is not restricted. These functions can return everything the user wants. Output is set to an engine's internal object engine.state.output and can be used further for any type of processing.

Events and Handers

To improve the engine’s flexibility, a configurable event system is introduced to facilitate the interaction on each step of the run. Namely, Engine allows to add handlers on various Events that are triggered during the run. When an event is triggered, attached handlers (named functions, lambdas, class functions) are executed. Here is a schema for when built-in events are triggered by default:

fire_event(Events.STARTED)
while epoch < max_epochs:
    fire_event(Events.EPOCH_STARTED)
    # run once on data
    for batch in data:
        fire_event(Events.ITERATION_STARTED)

        output = process_function(batch)

        fire_event(Events.ITERATION_COMPLETED)
    fire_event(Events.EPOCH_COMPLETED)
fire_event(Events.COMPLETED)

Note that each engine (i.e. trainer and evaluator) has its own event system which allows to define its own engine's process logic.

Using Events and handlers, it is possible to completely customize the engine's runs in a very intuitive way:

from ignite.engine import Events

# Show a message when the training begins
@trainer.on(Events.STARTED)
def start_message():
    print("Start training!")

# Handler can be want you want, here a lambda !
trainer.add_event_handler(
    Events.COMPLETED,
    lambda _: print("Training completed!")
)

# Run evaluator on val_loader every trainer's epoch completed
@trainer.on(Events.EPOCH_COMPLETED)
def run_validation():
    evaluator.run(val_loader)

In the code above, the run_validation function is attached to the trainer and will be triggered at each completed epoch to launch model's validation with evaluator. This shows that engines can be embedded to create complex pipelines.

Handlers offer unparalleled flexibility compared to callbacks as they can be any function: e.g., a lambda, a simple function, a class method, etc. Thus, we do not require to inherit from an interface and override its abstract methods which could unnecessarily bulk up your code and its complexity.

The possibilities of customization are endless as PyTorch-Ignite allows you to get hold of your application workflow. As mentioned before, there is no magic nor fully automatated things in PyTorch-Ignite.

Model evaluation metrics

Metrics are another nice example of what the handlers for PyTorch-Ignite are and how to use them. In our example, we use the built-in metrics Accuracy and Loss.

from ignite.metrics import Accuracy, Loss

# Accuracy and loss metrics are defined
val_metrics = {
  "accuracy": Accuracy(),
  "loss": Loss(criterion)
}

# Attach metrics to the evaluator
for name, metric in val_metrics.items():
    metric.attach(evaluator, name)

PyTorch-Ignite metrics can be elegantly combined with each other.

from ignite.metrics import Precision, Recall

# Build F1 score
precision = Precision(average=False)
recall = Recall(average=False)
F1 = (precision * recall * 2 / (precision + recall)).mean()

# and attach it to evaluator
F1.attach(evaluator, "f1")

To make general things even easier, helper methods are available for the creation of a supervised Engine as above. Thus, let's define another evaluator applied to the training dataset in this way.

from ignite.engine import create_supervised_evaluator

# Define another evaluator with default validation function and attach metrics
train_evaluator = create_supervised_evaluator(model, metrics=val_metrics, device="cuda")

# Run train_evaluator on train_loader every trainer's epoch completed
@trainer.on(Events.EPOCH_COMPLETED)
def run_train_validation():
    train_evaluator.run(train_loader)

The reason why we want to have two separate evaluators (evaluator and train_evaluator) is that they can have different attached handlers and logic to perform. For example, if we would like store the best model defined by the validation metric value, this role is delegated to evaluator which computes metrics over the validation dataset.

Common training handlers

From now on, we have trainer which will call evaluators evaluator and train_evaluator at every completed epoch. Thus, each evaluator will run and compute corresponding metrics. In addition, it would be very helpful to have a display of the results that shows those metrics.

Using the customization potential of the engine's system, we can add simple handlers for this logging purpose:

@evaluator.on(Events.COMPLETED)
def log_validation_results():
    metrics = evaluator.state.metrics
    print("Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f} Avg F1: {:.2f}"
          .format(trainer.state.epoch, metrics["accuracy"], metrics["loss"], metrics["f1"]))

@train_evaluator.on(Events.COMPLETED)
def log_train_results():
    metrics = train_evaluator.state.metrics
    print("  Training Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
          .format(trainer.state.epoch, metrics["accuracy"], metrics["loss"]))

Here we attached log_validation_results and log_train_results handlers on Events.COMPLETED since evaluator and train_evaluator will run a single epoch over the validation datasets.

Let's see how to add some others helpful features to our application.

PyTorch-Ignite provides a ProgressBar handler to show an engine's progression.

from ignite.contrib.handlers import ProgressBar

ProgressBar().attach(trainer, output_transform=lambda x: {'batch loss': x})

ModelCheckpoint handler can be used to periodically save objects which have an attribute state_dict.

from ignite.handlers import ModelCheckpoint, global_step_from_engine

# Score function to select relevant metric, here f1
def score_function(engine):
    return engine.state.metrics["f1"]

# Checkpoint to store n_saved best models wrt score function
model_checkpoint = ModelCheckpoint(
    "quick-start-mnist-output",
    n_saved=2,
    filename_prefix="best",
    score_function=score_function,
    score_name="f1",
    global_step_transform=global_step_from_engine(trainer),
)

# Save the model (if relevant) every epoch completed of evaluator
evaluator.add_event_handler(Events.COMPLETED, model_checkpoint, {"model": model})

PyTorch-Ignite provides wrappers to modern tools to track experiments. For example, TensorBoardLogger handler allows to log metric results, model's and optimizer's parameters, gradients, and more during the training and validation for TensorBoard.

from ignite.contrib.handlers import TensorboardLogger

# Define a Tensorboard logger
tb_logger = TensorboardLogger(log_dir="quick-start-mnist-output")

# Attach handler to plot trainer's loss every 100 iterations
tb_logger.attach_output_handler(
    trainer,
    event_name=Events.ITERATION_COMPLETED(every=100),
    tag="training",
    output_transform=lambda loss: {"batchloss": loss},
)

# Attach handler to dump evaluator's metrics every epoch completed
for tag, evaluator in [("training", train_evaluator), ("validation", evaluator)]:
    tb_logger.attach_output_handler(
        evaluator,
        event_name=Events.EPOCH_COMPLETED,
        tag=tag,
        metric_names="all",
        global_step_transform=global_step_from_engine(trainer),
    )

It is possible to extend the use of the TensorBoard logger very simply by integrating user-defined functions. For example, here is how to display images and predictions during training:

import matplotlib.pyplot as plt

# Store predictions and scores using matplotlib
def predictions_gt_images_handler(engine, logger, *args, **kwargs):
    x, _ = engine.state.batch
    y_pred, y = engine.state.output
    # y_pred is log softmax value
    num_x = num_y = 8
    le = num_x * num_y
    probs, preds = torch.max(torch.exp(y_pred[:le]), dim=1)
    fig = plt.figure(figsize=(20, 20))
    for idx in range(le):
        ax = fig.add_subplot(num_x, num_y, idx + 1, xticks=[], yticks=[])
        ax.imshow(x[idx].squeeze(), cmap="Greys")
        ax.set_title("{0} {1:.1f}% (label: {2})".format(
            preds[idx],
            probs[idx] * 100.0,
            y[idx]),
            color=("green" if preds[idx] == y[idx] else "red")
        )
    logger.writer.add_figure('predictions vs actuals', figure=fig, global_step=trainer.state.epoch)

# Attach custom function to evaluator at first iteration
tb_logger.attach(
    evaluator,
    log_handler=predictions_gt_images_handler,
    event_name=Events.ITERATION_COMPLETED(once=1),
)

All that is left to do now is to run the trainer on data from train_loader for a number of epochs.

trainer.run(train_loader, max_epochs=5)

# Once everything is done, let's close the logger
tb_logger.close()

Start training!

Validation Results - Epoch: 1  Avg accuracy: 0.94 Avg loss: 0.20 Avg F1: 0.94
  Training Results - Epoch: 1  Avg accuracy: 0.94 Avg loss: 0.21

Validation Results - Epoch: 2  Avg accuracy: 0.96 Avg loss: 0.12 Avg F1: 0.96
  Training Results - Epoch: 2  Avg accuracy: 0.96 Avg loss: 0.13

Validation Results - Epoch: 3  Avg accuracy: 0.97 Avg loss: 0.10 Avg F1: 0.97
  Training Results - Epoch: 3  Avg accuracy: 0.97 Avg loss: 0.10

Validation Results - Epoch: 4  Avg accuracy: 0.98 Avg loss: 0.07 Avg F1: 0.98
  Training Results - Epoch: 4  Avg accuracy: 0.97 Avg loss: 0.09

Validation Results - Epoch: 5  Avg accuracy: 0.98 Avg loss: 0.07 Avg F1: 0.98
  Training Results - Epoch: 5  Avg accuracy: 0.98 Avg loss: 0.08
Training completed!

We can inspect results using tensorboard. We can observe two tabs "Scalars" and "Images".

%load_ext tensorboard

%tensorboard --logdir=.

5 takeaways

Almost any training logic can be coded as a train_step method and a trainer built using this method.
The essence of the library is the Engine class that loops a given number of times over a dataset and executes a processing function.
A highly customizable event system simplifies interaction with the engine on each step of the run.
PyTorch-Ignite provides a set of built-in handlers and metrics for common tasks.
PyTorch-Ignite is easy to extend.

Advanced features

In this section we would like to present some advanced features of PyTorch-Ignite for experienced users. We will cover events, handlers and metrics in more detail, as well as distributed computations on GPUs and TPUs. Feel free to skip this section now and come back later if you are a beginner.

Power of Events & Handlers

We have seen throughout the quick-start example that events and handlers are perfect to execute any number of functions whenever you wish. In addition to that we provide several ways to extend it even more by

Built-in events filtering
Stacking events to share the action
Adding custom events to go beyond built-in standard events

Let's look at these features in more detail.

Built-in events filtering

Users can simply filter out events to skip triggering the handler. Let's create a dummy trainer:

from ignite.engine import Engine, Events

trainer = Engine(lambda e, batch: None)

Let's consider a use-case where we would like to train a model and periodically run its validation on several development datasets, e.g. devset1 and devset2:

# We run the validation on devset1 every 5 epochs
@trainer.on(Events.EPOCH_COMPLETED(every=5))
def run_validation1():
    print("Epoch {}: Validation on devset 1".format(trainer.state.epoch))
    # evaluator.run(devset1)  # commented out for demo purposes

# We run another validation on devset2 every 10 epochs
@trainer.on(Events.EPOCH_COMPLETED(every=10))
def run_validation2():
    print("Epoch {}: Validation on devset 2".format(trainer.state.epoch))
    # evaluator.run(devset2)  # commented out for demo purposes

train_data = [0, 1, 2, 3, 4]
trainer.run(train_data, max_epochs=50)

Epoch 5: Validation on devset 1
Epoch 10: Validation on devset 1
Epoch 10: Validation on devset 2
Epoch 15: Validation on devset 1
Epoch 20: Validation on devset 1
Epoch 20: Validation on devset 2
Epoch 25: Validation on devset 1
Epoch 30: Validation on devset 1
Epoch 30: Validation on devset 2
Epoch 35: Validation on devset 1
Epoch 40: Validation on devset 1
Epoch 40: Validation on devset 2
Epoch 45: Validation on devset 1
Epoch 50: Validation on devset 1
Epoch 50: Validation on devset 2

Let's now consider another situation where we would like to make a single change once we reach a certain epoch or iteration. For example, let's change the training dataset on the 5-th epoch from low resolution images to high resolution images:

def train_step(e, batch):
    print("Epoch {} - {} : batch={}".format(e.state.epoch, e.state.iteration, batch))

trainer = Engine(train_step)

small_res_data = [0, 1, 2, ]
high_res_data = [10, 11, 12]

# We run the following handler once on 5-th epoch started
@trainer.on(Events.EPOCH_STARTED(once=5))
def change_train_dataset():
    print("Epoch {}: Change training dataset".format(trainer.state.epoch))
    trainer.set_data(high_res_data)

trainer.run(small_res_data, max_epochs=10)

Epoch 1 - 1 : batch=0
Epoch 1 - 2 : batch=1
Epoch 1 - 3 : batch=2
Epoch 2 - 4 : batch=0
Epoch 2 - 5 : batch=1
Epoch 2 - 6 : batch=2
Epoch 3 - 7 : batch=0
Epoch 3 - 8 : batch=1
Epoch 3 - 9 : batch=2
Epoch 4 - 10 : batch=0
Epoch 4 - 11 : batch=1
Epoch 4 - 12 : batch=2
Epoch 5: Change training dataset
Epoch 5 - 13 : batch=10
Epoch 5 - 14 : batch=11
Epoch 5 - 15 : batch=12
Epoch 6 - 16 : batch=10
Epoch 6 - 17 : batch=11
Epoch 6 - 18 : batch=12
Epoch 7 - 19 : batch=10
Epoch 7 - 20 : batch=11
Epoch 7 - 21 : batch=12
Epoch 8 - 22 : batch=10
Epoch 8 - 23 : batch=11
Epoch 8 - 24 : batch=12
Epoch 9 - 25 : batch=10
Epoch 9 - 26 : batch=11
Epoch 9 - 27 : batch=12
Epoch 10 - 28 : batch=10
Epoch 10 - 29 : batch=11
Epoch 10 - 30 : batch=12

Let's now consider another situation where we would like to trigger a handler with completely custom logic. For example, we would like to dump model gradients if the training loss satisfies a certain condition:

# Let's predefine for simplicity training losses
train_losses = [2.0, 1.9, 1.7, 1.5, 1.6, 1.2, 0.9, 0.8, 1.0, 0.8, 0.7, 0.4, 0.2, 0.1, 0.1, 0.01]

trainer = Engine(lambda e, batch: train_losses[e.state.iteration - 1])

# We define our custom logic when to execute a handler
def custom_event_filter(trainer, event):
    if 0.1 < trainer.state.output < 1.0:
        return True
    return False

# We run the following handler every iteration completed under our custom_event_filter condition:
@trainer.on(Events.ITERATION_COMPLETED(event_filter=custom_event_filter))
def dump_model_grads():
    print("{} - loss={}: dump model grads".format(trainer.state.iteration, trainer.state.output))

train_data = [0, ]
trainer.run(train_data, max_epochs=len(train_losses))

7 - loss=0.9: dump model grads
8 - loss=0.8: dump model grads
10 - loss=0.8: dump model grads
11 - loss=0.7: dump model grads
12 - loss=0.4: dump model grads
13 - loss=0.2: dump model grads

Stack events to share the action

A user can trigger the same handler on events of differen types. For example, let's run a handler for model's validation every 3 epochs and when the training is completed:

trainer = Engine(lambda e, batch: None)

@trainer.on(Events.EPOCH_COMPLETED(every=3) | Events.COMPLETED)
def run_validation():
    print("Epoch {} - event={}: Validation".format(trainer.state.epoch, trainer.last_event_name))
    # evaluator.run(devset)

train_data = [0, 1, 2, 3, 4]
trainer.run(train_data, max_epochs=20)

Epoch 3 - event=epoch_completed: Validation
Epoch 6 - event=epoch_completed: Validation
Epoch 9 - event=epoch_completed: Validation
Epoch 12 - event=epoch_completed: Validation
Epoch 15 - event=epoch_completed: Validation
Epoch 18 - event=epoch_completed: Validation
Epoch 20 - event=completed: Validation

Add custom events

A user can add their own events to go beyond built-in standard events. For example, let's define new events related to backward and optimizer step calls. This can help us to attach specific handlers on these events in a configurable manner.

from ignite.engine import EventEnum


class BackpropEvents(EventEnum):
    BACKWARD_STARTED = 'backward_started'
    BACKWARD_COMPLETED = 'backward_completed'
    OPTIM_STEP_COMPLETED = 'optim_step_completed'


def update(engine, batch):
    # ...
    # loss = criterion(y_pred, y)
    engine.fire_event(BackpropEvents.BACKWARD_STARTED)
    # loss.backward()
    engine.fire_event(BackpropEvents.BACKWARD_COMPLETED)
    # optimizer.step()
    engine.fire_event(BackpropEvents.OPTIM_STEP_COMPLETED)
    # ...

trainer = Engine(update)
trainer.register_events(*BackpropEvents)

def function_before_backprop():
    print("{} - before backprop".format(trainer.state.iteration))

trainer.add_event_handler(BackpropEvents.BACKWARD_STARTED, function_before_backprop)

def function_after_backprop():
    print("{} - after backprop".format(trainer.state.iteration))

trainer.add_event_handler(BackpropEvents.BACKWARD_COMPLETED, function_after_backprop)

train_data = [0, 1, 2, 3, 4]
trainer.run(train_data, max_epochs=2)

1 - before backprop
1 - after backprop
2 - before backprop
2 - after backprop
3 - before backprop
3 - after backprop
4 - before backprop
4 - after backprop
5 - before backprop
5 - after backprop
6 - before backprop
6 - after backprop
7 - before backprop
7 - after backprop
8 - before backprop
8 - after backprop
9 - before backprop
9 - after backprop
10 - before backprop
10 - after backprop

Out-of-the-box metrics

PyTorch-Ignite provides an ensemble of metrics dedicated to many Deep Learning tasks (classification, regression, segmentation, etc.). Most of these metrics provide a way to compute various quantities of interest in an online fashion without having to store the entire output history of a model.

For classification : Precision, Recall, Accuracy, ConfusionMatrix and more!
For segmentation : DiceCoefficient, IoU, mIOU and more!
~20 regression metrics, e.g. MSE, MAE, MedianAbsoluteError, etc
Metrics that store the entire output history per epoch
- Possible to use with scikit-learn metrics, e.g. EpochMetric, AveragePrecision, ROC_AUC, etc
Easily composable to assemble a custom metric
Easily extendable to create custom metrics

Complete lists of metrics provided by PyTorch-Ignite can be found here for ignite.metrics and here for ignite.contrib.metrics.

Two kinds of public APIs are provided:

metric is attached to Engine
metric's reset, update, compute methods

More on the `reset`, `update`, `compute` public API

Let's demonstrate this API on a simple example using the Accuracy metric. The idea behind this API is that we accumulate internally certain counters on each update call. The metric's value is computed on each compute call and counters are reset on each reset call.

import torch
from ignite.metrics import Accuracy

acc = Accuracy()

# Start accumulation
acc.reset()

y_target = torch.tensor([0, 1, 2, 1,])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [10.0, 0.1, -1.0],  # correct
    [2.0, -1.0, -2.0],  # incorrect
    [1.0, -1.0, 4.0],   # correct
    [0.0, 5.0, -1.0],   # correct
])
acc.update((y_pred, y_target))

# Compute accuracy on 4 samples
print("After 1st update, accuracy=", acc.compute())

y_target = torch.tensor([1, 2, 0, 2])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [2.0, 1.0, -1.0],   # incorrect
    [0.0, 1.0, -2.0],   # incorrect
    [2.6, 1.0, -4.0],   # correct
    [1.0, -3.0, 2.0],   # correct
])
acc.update((y_pred, y_target))

# Compute accuracy on 8 samples
print("After 2nd update, accuracy=", acc.compute())

After 1st update, accuracy= 0.75
After 2nd update, accuracy= 0.625

Composable metrics

Users can compose their own metrics with ease from existing ones using arithmetic operations or PyTorch methods. For example, an error metric defined as 100 * (1.0 - accuracy) can be coded in a straightforward manner:

import torch
from ignite.metrics import Accuracy

acc = Accuracy()
error = 100.0 * (1.0 - acc)

# Start accumulation
acc.reset()

y_target = torch.tensor([0, 1, 2, 1,])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [10.0, 0.1, -1.0],  # correct
    [2.0, -1.0, -2.0],  # incorrect
    [1.0, -1.0, 4.0],   # correct
    [0.0, 5.0, -1.0],   # correct
])
acc.update((y_pred, y_target))

# Compute error on 4 samples
print("After 1st update, error=", error.compute())

y_target = torch.tensor([1, 2, 0, 2])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [2.0, 1.0, -1.0],   # incorrect
    [0.0, 1.0, -2.0],   # incorrect
    [2.6, 1.0, -4.0],   # correct
    [1.0, -3.0, 2.0],   # correct
])
acc.update((y_pred, y_target))

# Compute err on 8 samples
print("After 2nd update, error=", error.compute())

After 1st update, error= 25.0
After 2nd update, error= 37.5

In case a custom metric can not be expressed as arithmetic operations of base metrics, please follow this guide to implement the custom metric.

Out-of-the-box handlers

PyTorch-Ignite provides various commonly used handlers to simplify application code:

Common training handlers: Checkpoint, EarlyStopping, Timer, TerminateOnNan
Optimizer's parameter scheduling (learning rate, momentum, etc.)
- concatenate schedulers, add warm-up, cyclical scheduling, piecewise-linear scheduling, and more! See examples.
Time profiling
Logging to experiment tracking systems:
- Tensorboard, Visdom, MLflow, Polyaxon, Neptune, Trains, etc.

Complete lists of handlers provided by PyTorch-Ignite can be found here for ignite.handlers and here for ignite.contrib.handlers.

Common training handlers

With the out-of-the-box Checkpoint handler, a user can easily save the training state or best models to the filesystem or a cloud.

EarlyStopping and TerminateOnNan helps to stop the training if overfitting or diverging.

All those things can be easily added to the trainer one by one or with helper methods.

Let's consider an example of using helper methods.

import torch
import torch.nn as nn
import torch.optim as optim

from ignite.engine import create_supervised_trainer, create_supervised_evaluator, Events
from ignite.metrics import Accuracy
import ignite.contrib.engines.common as common

train_data = [[torch.rand(2, 4), torch.randint(0, 5, size=(2, ))] for _ in range(10)]
val_data = [[torch.rand(2, 4), torch.randint(0, 5, size=(2, ))] for _ in range(10)]
epoch_length = len(train_data)

model = nn.Linear(4, 5)
optimizer = optim.SGD(model.parameters(), lr=0.01)
# step_size is expressed in iterations
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=epoch_length, gamma=0.88)

# Let's define some dummy trainer and evaluator
trainer = create_supervised_trainer(model, optimizer, nn.CrossEntropyLoss())
evaluator = create_supervised_evaluator(model, metrics={"accuracy": Accuracy()})


@trainer.on(Events.EPOCH_COMPLETED)
def run_validation():
    evaluator.run(val_data)

# training state to save
to_save = {
    "trainer": trainer, "model": model,
    "optimizer": optimizer, "lr_scheduler": lr_scheduler
}
metric_names = ["batch loss", ]

common.setup_common_training_handlers(
    trainer=trainer,
    to_save=to_save,
    output_path="checkpoints",
    save_every_iters=epoch_length,
    lr_scheduler=lr_scheduler,
    output_names=metric_names,
    with_pbars=True,
)

tb_logger = common.setup_tb_logging("tb_logs", trainer, optimizer, evaluators=evaluator)

common.save_best_model_by_val_score(
    "best_models",
    evaluator=evaluator,
    model=model,
    metric_name="accuracy",
    n_saved=2,
    trainer=trainer,
    tag="val",
)

trainer.run(train_data, max_epochs=5)

tb_logger.close()

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

ls -all "checkpoints"
ls -all "best_models"
ls -all "tb_logs"

total 12
drwxr-xr-x 2 root root 4096 Aug 31 11:27 .
drwxr-xr-x 1 root root 4096 Aug 31 11:27 ..
-rw------- 1 root root 1657 Aug 31 11:27 training_checkpoint_50.pt
total 16
drwxr-xr-x 2 root root 4096 Aug 31 11:27  .
drwxr-xr-x 1 root root 4096 Aug 31 11:27  ..
-rw------- 1 root root 1145 Aug 31 11:27 'best_model_2_val_accuracy=0.3000.pt'
-rw------- 1 root root 1145 Aug 31 11:27 'best_model_3_val_accuracy=0.3000.pt'
total 12
drwxr-xr-x 2 root root 4096 Aug 31 11:27 .
drwxr-xr-x 1 root root 4096 Aug 31 11:27 ..
-rw-r--r-- 1 root root  325 Aug 31 11:27 events.out.tfevents.1598873224.3aa7adc24d3d.115.1

In the above code, the common.setup_common_training_handlers method adds TerminateOnNan, adds a handler to use lr_scheduler (expressed in iterations), adds training state checkpointing, exposes batch loss output as exponential moving averaged metric for logging, and adds a progress bar to the trainer.

Next, the common.setup_tb_logging method returns a TensorBoard logger which is automatically configured to log trainer's metrics (i.e. batch loss), optimizer's learning rate and evaluator's metrics.

Finally, common.save_best_model_by_val_score sets up a handler to save the best two models according to the validation accuracy metric.

Distributed and XLA device support

PyTorch offers a distributed communication package for writing and running parallel applications on multiple devices and machines.
The native interface provides commonly used collective operations and allows to address multi-CPU and multi-GPU computations seamlessly using the torch DistributedDataParallel module and the well-known mpi, gloo and nccl backends.
Recently, users can also run PyTorch on XLA devices, like TPUs, with the torch_xla package.

However, writing distributed training code working on GPUs and TPUs is not a trivial task due to some API specificities.
The purpose of the PyTorch-Ignite ignite.distributed package introduced in version 0.4 is to unify the code for native torch.distributed API, torch_xla API on XLA devices and also supporting other distributed frameworks (e.g. Horovod).
To make distributed configuration setup easier, the Parallel context manager has been introduced:

import ignite.distributed as idist

def training(local_rank, config, **kwargs):
    print(idist.get_rank(), ': run with config:', config, '- backend=', idist.backend())
    # do the training ...

backend = 'gloo' # or "nccl" or "xla-tpu"
dist_configs = {'nproc_per_node': 2}
# dist_configs["start_method"] = "fork"  # If using Jupyter Notebook
config = {'c': 12345}

with idist.Parallel(backend=backend, **dist_configs) as parallel:
    parallel.run(training, config, a=1, b=2)

2020-08-31 11:27:07,128 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'gloo'
2020-08-31 11:27:07,128 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: 
    nproc_per_node: 2
    nnodes: 1
    node_rank: 0
2020-08-31 11:27:07,128 ignite.distributed.launcher.Parallel INFO: Spawn function '<function training at 0x7f32b8ac9d08>' in 2 processes
0 : run with config: {'c': 12345} - backend= gloo
1 : run with config: {'c': 12345} - backend= gloo
2020-08-31 11:27:09,959 ignite.distributed.launcher.Parallel INFO: End of run

The above code with a single modification can run on a GPU, single-node multiple GPUs, single or multiple TPUs etc. It can be executed with the torch.distributed.launch tool or by Python and spawning the required number of processes. For more details, see the documentation.

In addition, methods like auto_model(), auto_optim() and auto_dataloader() help to adapt in a transparent way the provided model, optimizer and data loaders to an existing configuration:

# main.py

import ignite.distributed as idist

def training(local_rank, config, **kwargs):

    print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend())

    train_loader = idist.auto_dataloader(dataset, batch_size=32, num_workers=12, shuffle=True, **kwargs)
    # batch size, num_workers and sampler are automatically adapted to existing configuration
    # ...
    model = resnet50()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    # if training with Nvidia/Apex for Automatic Mixed Precision (AMP)
    # model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

    model = idist.auto_model(model)
    # model is DDP or DP or just itself according to existing configuration
    # ...
    optimizer = idist.auto_optim(optimizer)
    # optimizer is itself, except XLA configuration and overrides `step()` method.
    # User can safely call `optimizer.step()` (behind `xm.optimizer_step(optimizier)` is performed)

backend = "nccl"  # torch native distributed configuration on multiple GPUs
# backend = "xla-tpu"  # XLA TPUs distributed configuration
# backend = None  # no distributed configuration
with idist.Parallel(backend=backend, **dist_configs) as parallel:
    parallel.run(training, config, a=1, b=2)

Please note that these auto_* methods are optional; a user is free use some of them and manually set up certain parts of the code if required. The advantage of this approach is that there is no under the hood inevitable objects' patching and overriding.

More details about distributed helpers provided by PyTorch-Ignite can be found in the documentation.
A complete example of training on CIFAR10 can be found here.

A detailed tutorial with distributed helpers is published here.

Next steps

To learn more about PyTorch-Ignite, please check out our website: https://pytorch-ignite.ai and our tutorials and how-to guides.

We also provide PyTorch-Ignite code-generator application: https://code-generator.pytorch-ignite.ai/ to start working on tasks without rewriting everything from scratch.

Keep updated with all PyTorch-Ignite news by following us on Twitter and Facebook.

DEV Community: pytorch-ignite

Introducing PyTorch-Ignite's Code Generator v0.2.0

Deep Learning As a Routine

Ignite Your Training Pipelines

Getting Started

I Want To Contribute!

Next Steps

Acknowledgements

Distributed Training Made Easy with PyTorch-Ignite

Prerequisites

Introduction

PyTorch-Ignite Unified Distributed API

Focus on the helper auto_* methods

Examples

PyTorch-Ignite - Torch native Distributed Data Parallel - Horovod - XLA/TPUs

Running Distributed Code

With torch.multiprocessing.spawn

With Distributed launchers

With torch.distributed.launch

With horovodrun

With slurm

Closing Remarks

References

Next steps

Introduction to PyTorch-Ignite

PyTorch-Ignite: What and Why ?

🔥 PyTorch + Ignite 🔥

About the design of PyTorch-Ignite

Quick-start example

Common PyTorch code

Trainer and evaluator's setup

Events and Handers

Model evaluation metrics

Common training handlers

5 takeaways

Advanced features

Power of Events & Handlers

Built-in events filtering

Stack events to share the action

Add custom events

Out-of-the-box metrics

More on the reset, update, compute public API

Composable metrics

Out-of-the-box handlers

Common training handlers

Distributed and XLA device support

Next steps

Focus on the helper `auto_*` methods

With `torch.multiprocessing.spawn`

More on the `reset`, `update`, `compute` public API