DEV Community: Jakub Czakon

Text Classification: All Tips and Tricks from 5 Kaggle Competitions

Jakub Czakon — Fri, 29 May 2020 08:45:26 +0000

This article was originally posted by Shahul ES on Neptune blog

In this article, I will discuss some great tips and tricks to improve the performance of your text classification model. These tricks are obtained from solutions of some of Kaggle’s top NLP competitions.

Namely, I’ve gone through:

Jigsaw Unintended Bias in Toxicity Classification – $65,000
Toxic Comment Classification Challenge – $35,000
Quora Insincere Questions Classification – $25,000
Google QUEST Q&A Labeling – $25,000
TensorFlow 2.0 Question Answering – $50,000 and found a ton of great ideas.

Without much lag, let’s begin.

Dealing with larger datasets

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for Kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.

Optimize the memory by reducing the size of some attributes
Use open-source libraries such as Dask to readand manipulate the data , it performs parallel computing and saves up memory space
Use cudf
Convert data to parquet format
Convert data to feather format

Small datasets and external data

But, what can one do if the dataset is small? Let’s see some techniques to tackle this situation.

One way to increase the performance of any machine learning model is to use some external data frame that contains some variables that influence the predicate variable.

Let’s see some of the external datasets.

Use of squad data for Question Answering tasks
Other datasets for QA tasks
Wikitext long term dependency language modeling dataset
Stackexchange data Prepare a dictionary of commonly misspelled words and corrected words.
Use of helper datasets for cleaning
Pseudo labeling is the process of adding confidently predicted test data to your training data
Use different data sampling methods
Text augmentation by Exchanging words with synonyms
Text augmentation by noising in RNN
Text augmentation by translation to other languages and back

Data Exploration and Gaining insights

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

Twitter data exploration methods
Simple EDA for tweets
EDA for Quora data
EDA in R for Quora data
Complete EDA with stack exchange data
My previous article on EDA for natural language processing

Data Cleaning

Data cleaning is one of the important and integral parts of any NLP problem. Text data always needs some preprocessing and cleaning before we can represent it in a suitable form.
Use this notebook to clean social media data
Data cleaning for BERT
Use textblob to correct misspellings
Cleaning for pre-trained embeddings
Language detection and translation for multilingual tasks
Preprocessing for Glove part 1 and part 2
Increasing word coverage to get more from pre-trained word embeddings

Text Representations

Before we feed our text data to the Neural network or ML model, the text input needs to be represented in a suitable format. These representations determine the performance of the model to a large extent.

Pretrained Glove vectors
Pretrained fasttext vectors
Pretrained word2vec vectors
My previous article on these 3 embedding
Combining pre-trained vectors. This can help in better representation of text and decreasing OOV words
Paragram embeddings
Universal Sentence Encoder
Use USE to generate sentence-level features
3 methods to combine embedding

Contextual embeddings models

BERT Bidirectional Encoder Representations from Transformers
GPT
Roberta a Robustly Optimized BERT
Albert a Lite BERT for Self-supervised Learning of Language Representations
Distilbert a lighter version of BERT
XLNET

Modeling

Model architecture

Choosing the right architecture is important to develop a proper machine learning model, sequence to sequence models like LSTMs, GRUs perform well in NLP problems and is always worth trying. Stacking 2 layers of LSTM/GRU networks is a common approach.

Loss functions

Choosing a proper loss function for your NN model really enhances the performance of your model by allowing it to optimize well on the surface.

You can try different loss functions or even write a custom loss function that matches your problem. Some of the popular loss functions are

Binary cross-entropy for binary classification
Categorical cross-entropy for multi-class classification
Focal loss used for unbalanced datasets
Weighted focal loss for multilabel classification
Weighted kappa for multiclass classification
BCE with logit loss to get sigmoid cross-entropy
Custom mimic loss used in Jigsaw unintended bias classification competition
MTL custom loss used in jigsaw unintended bias classification competition

Optimizers

Stochastic gradient descent
RMSprop
Adagrad allows the learning rate to adapt based on parameters
Adam for fast and easy convergence
Adam with warmup to enable warmup state to Adam algorithm
Bert Adam for Bert based models
Rectified Adam for stabilizing training and accelerating convergence

Callback methods

Callbacks are always useful to monitor the performance of your model while training and trigger some necessary actions that can enhance the performance of your model.

Model checkpoint for monitoring and saving weights
Learning rate scheduler to change the learning rate based on model performance to help converge easily
Simple custom callbacks using lambda callbacks
Custom Checkpointing
Building your custom callbacks for various use cases
Reduce on plateau to reduce the learning rate when a metric has stopped improving
Early Stopping to stop training when the model stops improving
Snapshot ensembling to get a variety of model checkpoints in one training
Fast geometric ensembling
Stochastic Weight Averaging (SWA)
Dynamic learning rate decay

Evaluation and cross-validation

Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.

The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.

There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.

K-fold cross-validation
Stratified KFold cross-validation
Group KFold
Adversarial validation to check if train and test distributions are similar or not
CV analysis of different strategies

Runtime tricks

You can perform some tricks to decrease the runtime and also improve model performance at the runtime.

Sequence bucketing to save runtime and improve performance
Get sentences from its head and tail when the input sentence is larger than 512 tokens
Use the GPU efficiently
Free keras memory
Save and load models to save runtime and memory
Don’t Save Embedding in RNN Solutions
Load word2vec vectors without key vectors

Model ensembling

If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.

Let’s see some of the popular ensembling techniques used in Kaggle competitions:

Weighted average ensemble
Stacked generalization ensemble
Out of folds predictions
Blending with linear regression
Use optuna to determine blending weights
Power average ensemble
Power 3.5 blending strategy

Final thoughts

In this article, you saw many popular and effective ways to improve the performance of your NLP classification model. Hopefully, you will find them useful in your projects.

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

6 GAN Architectures You Really Should Know

Jakub Czakon — Mon, 25 May 2020 14:45:22 +0000

This article was originally posted by Shibsankar Das on the Neptune blog where you can find more in-depth articles for machine learning practitioners.

Generative Adversarial Networks (GANs) were first introduced in 2014 by Ian Goodfellow et. al. and since then this topic itself opened up a new area of research.

Within a few years, the research community came up with plenty of papers on this topic some of which have very interesting names :). You have CycleGAN, followed by BiCycleGAN, followed by ReCycleGAN and so on.

With the invention of GANs, Generative Models had started showing promising results in generating realistic images. GANs has shown tremendous success in Computer Vision. In recent times, it started showing promising results in Audio, Text as well.

Some of the most popular GAN formulations are:

Transforming an image from one domain to another(CycleGAN),
Generating an image from a textual description (text-to-image),
Generating very high-resolution images (ProgressiveGAN) and many more.

In this article, we will talk about some of the most popular GAN architectures, particularly 6 architectures that you should know to have a diverse coverage on Generative Adversarial Networks (GANs).

Namely:

CycleGAN
StyleGAN
pixelRNN
text-2-image
DiscoGAN
lsGAN

GAN 101 and Vanilla GAN

There are 2 kinds of models in the context of Supervised Learning, Generative and Discriminative Models. Discriminative Models are primarily used to solve the Classification task where the model usually learns a decision boundary to predict which class a data point belongs to. On the other side, Generative Models are primarily used to generate synthetic data points that follow the same probability distribution as training data distribution. Our topic od discussion, Generative Adversarial Networks(GANs) is an example of the Generative Model.

The primary objective of the Generative Model is to learn the unknown probability distribution of the population from which the training observations are sampled from. Once the model is successfully trained, you can sample new, “generated” observations that follow the training distribution.

Let’s discuss the core concepts of GAN formulation.

GAN comprises of two independent networks, a Generator, and a Discriminator.

The Generator generates synthetic samples given a random noise [sampled from a latent space] and the Discriminator is a binary classifier that discriminates between whether the input sample is real [output a scalar value 1] or fake [output a scalar value 0].

Samples generated by the Generator is termed as a fake sample. As you see in Fig1 and Fig2 that when a data point from the training dataset is given as input to the Discriminator, it calls it out as a Real sample whereas it calls out the other data point as fake when it’s generated by the Generator.

Fig1: Generator and Discriminator as GAN building blocks

The beauty of this formulation is the adversarial nature between the Generator and the Discriminator.

The Discriminator wants to do its job in the best possible way. When a fake sample [which are generated by the Generator] is given to the Discriminator, it wants to call it out as fake but the Generator wants to generate samples in a way so that the Discriminator makes a mistake in calling it out as a real one. In some sense, the Generator is trying to fool the Discriminator.

Fig2: Generator and Discriminator as GAN building blocks

Let us have a quick look at the objective function and how does the optimization is done. It’s a min-max optimization formulation where the Generator wants to minimize the objective function whereas the Discriminator wants to maximize the same objective function.

Fig3 depicts the objective function being optimized. The Discriminator function is termed as D and the Generator function is termed as G. Pz is the probability distribution of the latent space which is usually a random Gaussian distribution. Pdata is the probability distribution of the training dataset. When x is sampled from Pdata , the Discriminator wants to classify it as a real sample. G(z) is a generated sample when G(z) is given as input to the Discriminator, it wants to classify it as a fake one.

Fig3: Objective function in GAN formulation

The Discriminator wants to drive the likelihood of D(G(z)) to 0. Hence it wants to maximize (1-D(G(z))) whereas the Generator wants to force the likelihood of D(G(z)) to 1 so that Discriminator makes a mistake in calling out a generated sample as real. Hence Generator wants to minimize (1-D(G(z)).

Fig4: Objective function in GAN formulation

CycleGAN:

CycleGAN is a very popular GAN architecture primarily being used to learn transformation between images of different styles.

As an example, this kind of formulation can learn:

a map between artistic and realistic images,
a transformation between images of horse and zebra,
a transformation between winter image and summer image
and so on

FaceApp is one of the most popular examples of CycleGAN where human faces are transformed into different age groups.

As an example, let’s say X is a set of images of horse and Y is a set of images of zebra.

The goal is to learn a mapping function G: X-> Y such that images generated by G(X) are indistinguishable from the image of Y. This objective is achieved using an Adversarial loss. This formulation not only learns G, but it also learns an inverse mapping function F: Y->X and use cycle-consistency loss to enforce F(G(X)) = X and vice versa.

While training, 2 kinds of training observations are given as input.

One set of observations have paired images {Xi, Yi} for i where each Xi has it’s Yi counterpart.
The other set of observations has a set of images from X and another set of images from Y without any match between Xi and Yi.

Fig5: The training procedure for CycleGAN.

As I have mentioned earlier there are 2 kinds of functions being learned, one of them is G which transforms X to Y and the other one is F which transforms Y to X and it comprises two individual GAN models. So, you will find 2 Discriminator function Dx, Dy.

As part of Adversarial formulation, there is one Discriminator Dx that classifies whether the transformed Y is indistinguishable from Y. Similarly, there is one more Discriminator Dy that classifies whether is indistinguishable from X.

Along with Adversarial Loss, CycleGAN uses cycle-consistency loss to enable training without paired images and this additional loss help the model to minimize reconstruction loss F(G(x)) ≈ X and G(F(Y)) ≈ Y

So, All-in-all CycleGAN formulation comprises of 3 individual loss as follows:

and as part of optimization, the following loss function is optimized.

Let’s take a look at some of the results from CycleGAN. As you see, the model has learned a transformation to convert an image of a zebra to a horse, a summer time image to the winter counterpart and vice-versa.

Following is a code snippet on the different loss functions. Please refer to the following reference for complete code flow.

CycleGAN

# Generator G translates X -> Y
# Generator F translates Y -> X.
fake_y = generator_g(real_x, training=True)
cycled_x = generator_f(fake_y, training=True)

fake_x = generator_f(real_y, training=True)
cycled_y = generator_g(fake_x, training=True)

# same_x and same_y are used for identity loss.
same_x = generator_f(real_x, training=True)
same_y = generator_g(real_y, training=True)

disc_real_x = discriminator_x(real_x, training=True)
disc_real_y = discriminator_y(real_y, training=True)

disc_fake_x = discriminator_x(fake_x, training=True)
disc_fake_y = discriminator_y(fake_y, training=True)

# calculate the loss
gen_g_loss = generator_loss(disc_fake_y)
gen_f_loss = generator_loss(disc_fake_x)

total_cycle_loss = calc_cycle_loss(real_x, cycled_x) + \
                   calc_cycle_loss(real_y,cycled_y)

# Total generator loss = adversarial loss + cycle loss
total_gen_g_loss = gen_g_loss + total_cycle_loss + identity_loss(real_y, same_y)
total_gen_f_loss = gen_f_loss + total_cycle_loss + identity_loss(real_x, same_x)

disc_x_loss = discriminator_loss(disc_real_x, disc_fake_x)
disc_y_loss = discriminator_loss(disc_real_y, disc_fake_y)

Following is an example where an image of horse has been transformed into an image that looks like a zebra.

References:

Research Paper:

Tensorflow has a well-documented tutorial on CycleGAN. Please refer to the following URL as reference

https://www.tensorflow.org/tutorials/generative/cyclegan

# StyleGAN:

Can you guess which image (from the following 2 images) is real and which one is generated by GAN?

The fact is that both the images are imagined by a GAN formulation called StyleGAN.

StyleGAN is a GAN formulation which is capable of generating very high-resolution images even of 1024*1024 resolution. The idea is to build a stack of layers where initial layers are capable of generating low-resolution images (starting from 2*2) and further layers gradually increase the resolution.

The easiest way for GAN to generate high-resolution images is to remember images from the training dataset and while generating new images it can add random noise to an existing image. In reality, StyleGAN doesn’t do that rather it learn features regarding human face and generates a new image of the human face that doesn’t exist in reality. If this sounds interesting, visit https://thispersondoesnotexist.com/ Each visit to this URL will generate a new image of a human face who doesn’t exist in the universe.

This figure depicts the typical architecture of StyleGAN. The latent space vector z is passed through a mapping transformation comprises of 8 fully connected layers whereas the synthesis network comprises of 18 layers, where each layer produces image from 4 x 4 to 1024 x 1024. The output layer output RGB image through a separate convolution layer. This architecture has 26.2 million parameters and because of this very high number of trainable parameters, this model requires a huge number of training images to build a successful model.

Each layer is normalized using Adaptive instance normalization (AdaIN) function as follows:

where each feature map xi is normalized separately, and then scaled and biased using the corresponding scalar components from style y. Thus the dimensionality of y is twice the number of feature maps on that layer.

References:

Paper: https://arxiv.org/pdf/1812.04948.pdf

Github: https://github.com/NVlabs/stylegan

PixelRNN

PixelRNN is an example of the auto-regressive Generative Model.

In the era of social media, plenty of images are out there. But it’s extremely difficult to learn the distribution of natural images in an unsupervised setting. PixelRNN is capable of modeling the discrete probability distribution of image and predict the pixel of an image in two spatial dimensions.

We all know that RNNs are powerful in learning conditional distribution, especially LSTM is good at learning the long-term dependency in a series of pixels. This formulation works in a progressive fashion where the model predicts the next pixel Xi+1 when all pixels X0 to Xi are provided.

Compared to GANs, Auto-regressive models like PixelRNN learn an explicit data distribution where GANs learn implicit probability distribution. Because of that GAN doesn’t explicitly expose the probability distribution rather allows us to sample observation from the learned probability distribution.

The Figure depicts the individual residual blocks of pixelRNN. It’s trained up to several depths of layers. The input map to the PixelRNN LSTM layer has 2h features. The input-to-state component reduces the number of features by producing h features per gate. After applying the recurrent layer, the output map is upsampled back to 2h features per position via a 1 × 1 convolution and the input map is added to the output map.

[Source:https://arxiv.org/pdf/1601.06759.pdf#page=9&zoom=100,0,0]

References:

Paper: https://arxiv.org/pdf/1601.06759.pdf

Github: https://github.com/carpedm20/pixel-rnn-tensorflow

text-2-image

Generative Adversarial Networks are good at generating random images. As an example, a GAN which was trained on images of cats can generate random images of a cat having two eyes, two ears, whiskers. But the color pattern on the cat could be very random. So, random images are often not useful to solve business use cases. Now, asking GAN to generate an image based on our expectation, is an extremely difficult task.

In this section, we will talk about a GAN architecture that made significant progress in generating meaningful images based on an explicit textual description. This GAN formulation takes a textual description as input and generates an RGB image that was described in the textual description.

As an example, given “this flower has a lot of small round pink petals” as input, it will generate an image of a flower having round pink petals.

In this formulation, instead of giving only noise as input to the Generator, the textual description is first transformed into a text embedding, concatenated with noise vector and then given as input to the Generator.

As an example, the textual description has been transformed into a 256-dimensional embedding and concatenated with a 100-dimensional noise vector [which was sampled from a latent space which is usually a random Normal distribution].

This formulation will help the Generator to generate images that are aligned with the input description instead of generating random images.

For the Discriminator, instead of having the only image as input, a pair of image and text embedding are sent as input. Output signals are either 0 or 1. Earlier the Discriminator’s responsibility was just to predict whether a given image is real or fake.

Now, the Discriminator has one more additional responsibility. Along with identifying the given image is read or fake, it also predicts the likelihood of whether the given image and text aligned with each other.

This formulation force the Generator to not only generate images that look real but also to generate images that are aligned with the input textual description.

To fulfill the purpose of the 2-fold responsibility of the Discriminator, during training time, a series of different (image, text) pairs are given as input to the model which are as follows:

1.Pair of (Real Image, Real Caption) as input and target variable is set to 1
2.Pair of (Wrong Image, Real Caption) as input and target variable is set to 0
3.Pair of (Fake Image, Real Caption) as input and target variable is set to 0
The pair of Real Image and Real Caption are given so that the model learns whether a given image and text pair are aligned with each other. The wrong Image, Read Caption means the image is not as described in the caption. In this case, the target variable is set to 0 so that the model learns that the given image and caption are not aligned. Here Fake Image means an image generated by the Generator, in this case, the target variable is set to 0 so that the Discriminator model can distinguish between real and fake images.

The training dataset used for the training has image along with 10 different textual description that describes properties of the image.

The followings are some of the results from a trained text-2-image model.

References:

Research Paper: https://arxiv.org/pdf/1605.05396.pdf

Github: https://github.com/paarthneekhara/text-to-image

DiscoGAN

In recent times, DiscoGAN became very popular because of its ability to learn cross-domain relations given unsupervised data.

For humans, cross-domain relations are very natural. Given images of two different domains, a human can figure out how they are related to each other. As an example, in the following figure, we have images from 2 different domains and just by one glance at these images, we can figure out very easily that they are related by the nature of their exterior color.

Now, building a Machine Learning model to figure out such relation given unpaired images from 2 different domains is an extremely difficult task.

In recent times, DiscoGAN had shown promising results in learning such a relation across 2 different domains.

The core concept of DiscoGAN is very much similar to CycleGAN:

Both learn 2 individual transformation function, one learns a transformation from domain X to domain Y whereas the other one learns a reverse mapping and both uses reconstruction loss as a measure of how well the original image is reconstructed after twice transformation across domains.
Both follow the principle that if we transform an image from one domain1 to domain2 and then back to domain1 again then it should match the original image.
The primary difference between DiscoGAN and CycleGAN is that DiscoGAN uses two reconstruction loss, one for both the domain whereas CycleGAN uses single cycle-consistency loss.

Figure: (a) Vanilla GAN (b) GAN with reconstruction loss (c) DiscoGAN architecture

Like CycleGAN, DiscoGAN is also built on the fundamental of reconstruction loss. The idea is that when an image is transformed from one domain to another and then transformed back to the original domain, the generated image should be as close as the original one. In this case, the quantitative difference is considered as the reconstruction loss and during training, the model tries to minimize this loss.

So, the model comprises of two GAN networks called GAB and GBA . In the above figure, the model is trying to learn the cross-domain relation in terms of their direction. After the reconstruction of an image, the direction should be the same as the original one.

References:

Research Paper: https://arxiv.org/pdf/1703.05192.pdf

Github: https://github.com/SKTBrain/DiscoGAN

lsGAN

In recent times, Generative Adversarial Networks have demonstrated impressive performance for unsupervised tasks.

In regular GAN, the discriminator uses cross-entropy loss function which sometimes leads to vanishing gradient problems. Instead of that lsGAN proposes to use the least-squares loss function for the discriminator. This formulation provides a higher quality of images generated by GAN.

Earlier, in vanilla GAN, we have seen following min-max optimization formulation where the Discriminator is a binary classifier and is using sigmoid cross-entropy loss during optimization.

As mentioned earlier, often this formulation causes vanishing gradient problems for data point which are at the correct side of the decision boundary but far away from the dense area. The Least Square formulation addresses this issue and provides more stable learning of the model and generate better images.

Following is the reformulated optimization formulation for lsGAN where:

a is the label for fake sample,
b is the label for real sample and
c denotes the value that the Generator wants the Discriminator to believe for a fake sample.

Now, we have 2 individual loss functions that are being optimized. One is being minimized with respect to the Discriminator and the other one is being minimized with respect to the Generator.

lsGAN has a huge advantage compared to vanilla GAN. In vanilla GAN, as the Discriminator uses binary cross-entropy loss, the loss for an observation is 0 as long as it’s at the correct side of the decision boundary.

But in the case of lsGAN, the model penalizes an observation if it’s a long way from the decision boundary even if it’s at the correct side of the decision boundary.

This penalization forces the Generator to generate samples towards the decision boundary. Along with that it also removes the problem of vanishing gradient as the far-away point generates more gradients while updating the Generator.

References:

Research Paper: https://arxiv.org/pdf/1611.04076.pdf

Github: https://github.com/xudonmao/LSGAN

Final thoughts

One thing is common in all the GAN architectures we have talked about. Each one of them is built on the principle of adversarial loss and they all have Generator and Discriminator which follows the adversarial nature to fool each other. GANs has shown tremendous success over the last few years and became one of the most popular research topics in machine learning research community. In future, we will see a lot of progress in this domain.

The following Git repository has consolidated an exclusive list of GAN papers.

https://github.com/hindupuravinash/the-gan-zoo

References:

Image Segmentation: Tips and Tricks from 39 Kaggle Competitions

Jakub Czakon — Tue, 19 May 2020 13:39:22 +0000

This article was originally posted by Derrick Mwiti on the Neptune blog where you can find more in-depth articles for machine learning practitioners.

Imagine if you could get all the tips and tricks you need to hammer a Kaggle competition. I have gone over 39 Kaggle competitions including

Data Science Bowl 2017 – $1,000,000
Intel & MobileODT Cervical Cancer Screening – $100,000
2018 Data Science Bowl – $100,000
Airbus Ship Detection Challenge – $60,000
Planet: Understanding the Amazon from Space – $60,000
APTOS 2019 Blindness Detection – $50,000
Human Protein Atlas Image Classification – $37,000
SIIM-ACR Pneumothorax Segmentation – $30,000
Inclusive Images Challenge – $25,000

– and extracted that knowledge for you. Dig in.

External Data
Preprocessing
Data Augmentations
Modeling
Hardware Setups
Loss Functions
Training Tips
Evaluation and Cross-validation
Ensembling Methods
Post Processing

External Data

Use of the LUng Node Analysis Grand Challenge data because it contains detailed annotations from radiologists
Use of the LIDC-IDRI data because it had radiologist descriptions of each tumor that they found
Use Flickr CC, Wikipedia Commons datasets
Use Human Protein Atlas Dataset
Use IDRiD dataset

Data Exploration and Gaining insights

Clustering of 3d segmentation with the 0.5 threshold
Identify if there is a substantial difference in train/test label distributions

Preprocessing

Perform blob Detection using the Difference of Gaussian (DoG) method. Used the implementation available in skimage package.
Use of patch-based inputs for training in order to reduce the time of training
Use cudf for loading data instead of Pandas because it has a faster reader
Ensure that all the images have the same orientation
Apply contrast limited adaptive histogram equalization
Use OpenCV for all general image preprocessing
Employ automatic active learning and adding manual annotations
Resize all images to the same resolution in order to apply the same model to scans of different thicknesses
Convert scan images into normalized 3D numpy arrays
Apply single Image Haze Removal using Dark Channel Prior
Convert all data to Hounsfield units
Find duplicate images using pair-wise correlation on RGBY
Make labels more balanced by developing a sampler Apply pseudo labeling to test data in order to improve score
Scale down images/masks to 320×480
Histogram equalization (CLAHE) with kernel size 32×32
Convert DCM to PNG
Calculate the md5 hash for each image when there are duplicate images

Data Augmentations

Use albumentations package for augmentations
Apply random rotation by 90 degrees
Use horizontal, vertical or both flips
Attempt heavy geometric transformations: Elastic Transform, PerspectiveTransform, Piecewise Affine transforms, pincushion distortion
Apply random HSV
Use of loss-less augmentation for generalization to prevent loss of useful image information
Apply channel shuffling
Do data augmentation based on class frequency
Apply gaussian noise
Use lossless permutations of 3D images for data augmentation
Rotate by a random angle from 0 to 45 degrees
Scale by a random factor from 0.8 to 1.2
Brightness changing Randomly change hue, saturation and value Apply D4 augmentations Contrast limited adaptive histogram equalization Use the AutoAugment augmentation strategy

Modeling

Architectures

Use of a U-net based architecture. Adopted the concepts and applied them to 3D input tensors
Employing automatic active learning and adding manual annotations
The inception-ResNet v2 architecture for training features with different receptive fields
Siamese networks with adversarial training
ResNet50, Xception, Inception ResNet, v2 x 5 with Dense (FC) layer as the final layer
Use of a global max-pooling layer which returns a fixed-length output no matter the input size
Use of stacked dilated convolutions
VoxelNet
Replace plus sign in LinkNet skip connections with concat and conv1x1
Generalized mean pooling
Keras NASNetLarge to train the model from scratch using 224x224x3
Use of the 3D convnet to slide over the images
Imagenet-pre-trained ResNet152 as the feature extractor *Replace the final fully-connected layers of ResNet by 3 fully connected layers with dropout
Use ConvTranspose in the decoder
Applying the VGG baseline architecture
Implementing the C3D network with adjusted receptive fields and a 64 unit bottleneck layer on the end of the network
Use of UNet type architectures with pre-trained weights to improve convergence and performance of binary segmentation on 8-bit RGB input images
LinkNet since it’s fast and memory efficient
MASKRCNN
BN-Inception
Fast Point R-CNN
Seresnext
UNet and Deeplabv3
Faster RCNN
SENet154
ResNet152
NASNet-A-Large
EfficientNetB4
ResNet101
GAPNet
PNASNet-5-Large
Densenet121
AC-GAN
XceptionNet (96)(, XceptionNet (299), Inception v3 (139), InceptionResNet v2 (299), DenseNet121 (224)
AlbuNet (resnet34) from ternausnets
SpaceNet
Resnet50 from selim_sef SpaceNet 4
SCSEUnet (seresnext50) from selim_sef SpaceNet 4
A custom Unet and Linknet architecture
FPNetResNet50 (5 folds)
FPNetResNet101 (5 folds)
FPNetResNet101 (7 folds with different seeds)
PANetDilatedResNet34 (4 folds)
PANetResNet50 (4 folds)
EMANetResNet101 (2 folds)
RetinaNet
Deformable R-FCN
Deformable Relation Networks

Hardware Setups

Loss Functions

Dice Coefficient because it works well with imbalanced data
Weighted boundary loss whose aim is to reduce the distance between the predicted segmentation and the ground truth
MultiLabelSoftMarginLoss that creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input and target
Balanced cross entropy (BCE) [with logit loss]( that involves weighing the positive and negative examples by a certain coefficient
Lovasz that performs direct optimization of the mean intersection-over-union loss in neural networks based on the convex Lovasz extension of sub-modular losses
FocalLoss + Lovasz obtained by summing the Focal and Lovasz losses
Arc margin loss that incorporates margin in order to maximise face class separability
Npairs loss that computes the npairs loss between y_true and y_pred.
A combination of BCE and Dice loss functions
LSEP – a pairwise ranking that is is smooth everywhere and thus is easier to optimize
Center loss that simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers
Ring Loss that augments standard loss functions such as Softmax
Hard triplet loss that trains a network to embed features of the same class at the same time maximizing the embedding distance of different classes
1 + BCE – Dice that involves subtracting the BCE and DICE losses then adding 1
Binary cross-entropy – log(dice) that is the binary cross-entropy minus the log of the dice loss
Combinations of BCE, dice and focal
Lovasz Loss that loss performs direct optimization of the mean intersection-over-union loss
BCE + DICE -Dice loss is obtained by calculating smooth dice coefficient function
Focal loss with Gamma 2 that is an improvement to the standard cross-entropy criterion
BCE + DICE + Focal – this is basically a summation of the three loss functions
Active Contour Loss that incorporates the area and size information and integrates the information in a dense deep learning model
1024 * BCE(results, masks) + BCE(cls, cls_target)
Focal + kappa – Kappa is a loss function for multi-class classification of ordinal data in deep learning. In this case we sum it and the focal loss
ArcFaceLoss — Additive Angular Margin Loss for Deep Face Recognition
soft Dice trained on positives only – Soft Dice uses predicted probabilities
2.7 * BCE(pred_mask, gt_mask) + 0.9 * DICE(pred_mask, gt_mask) + 0.1 * BCE(pred_empty, gt_empty) which is a custom loss used by the Kaggler
nn.SmoothL1Loss() that creates a criterion that uses a squared term if the absolute element-wise error falls below 1 and an L1 term otherwise
Use of the Mean Squared Error objective function in scenarios where it seems to work better than binary-cross entropy objective function.

Training tips

Try different learning rates
Try different batch sizes
Use SDG with momentum with manual rate scheduling
Too much augmentation will reduce the accuracy
Train on image crops and predict on full images
Use of Keras’s ReduceLROnPlateau() to the learning rate
Train without augmentation until plateau then apply soft and hard augmentation to some epochs
Freeze all layers except the last one and use 1000 images from Stage1 for tuning
Make labels more balanced by developing a sampler
Use of class aware sampling
Use dropout and augmentation while tuning the last layer
Pseudo Labeling to improve score
Use Adam reducing LR on plateau with patience 2–4
Use Cyclic LR with SGD
Reduce the learning rate by a factor of two if validation loss does not improve for two consecutive epochs
Repeat the worst batch out of 10 batches
Train with default UNET
Overlap tiles so that each edge pixel is covered twice
Hyperparameter tuning: learning rate on training, non-maximum suppression and score threshold on inference
Remove low bounding box with low confidence score
Train different convolutional neural networks then build an ensemble
Stop training when the F1 score is decreasing
Differential learning rate with gradual reducing
Train ANNs [in a stacking way using 5 folds](https://www.kaggle.com/c/statoil-iceberg-classifier-challenge/discussion/48207 and 30 repeats
Track of your experiments using Neptune.

Evaluation and cross-validation

Split on non-uniform stratified by classes
Avoid overfitting by applying cross-validation while tuning the last layer
10-fold CV ensemble for classification
Combination of 5 10-fold CV ensembles for detection
Sklearn’s stratified K fold function
5 KFold Cross-Validation
Adversarial Validation & Weighting

Ensembling methods

Use simple majority voting for ensemble
XGBoost on the max malignancy at 3 zoom levels, the z-location and the amount of strange tissue
LightGBM for models with too many classes. This was done for raw data features only.
CatBoost for a second-layer model
Training with 7 features for the gradient boosting classifier
Use ‘curriculum learning’ to speed up model training. In this technique, models are first trained on simple samples then progressively moving to hard ones.
Ensemble with ResNet50, InceptionV3, and InceptionResNetV2
Ensemble method for object detection
An ensemble of Mask RCNN, YOLOv3, and Faster RCNN architectures n with a classification network — DenseNet-121 architecture

Post Processing

Apply test time augmentation — presenting an image to a model several times with different random transformations and average the predictions you get
Equalize test prediction probabilities instead of only using predicted classes
Apply geometric mean to the predictions
Overlap tiles during inferencing so that each edge pixel is covered at least thrice because UNET tends to have bad predictions around edge areas.
Non-maximum suppression and bounding box shrinkage
Watershed post processing to detach objects in instance segmentation problems.

Final Thoughts

Hopefully, this article gave you some background into image segmentation tips and tricks and given you some tools and frameworks that you can use to start competing.

We’ve covered tips on:

architectures
training tricks,
losses,
pre-processing,
post processing
ensembling
tools and frameworks. If you want to go deeper down the rabbit hole, simply follow the links and see how the best image segmentation models are built.

Happy segmenting!

Image segmentation in 2020: Architectures, Losses, Datasets, and Frameworks

Jakub Czakon — Mon, 11 May 2020 10:23:59 +0000

This article was originally posted by Derrick Mwiti on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

In this piece, we’ll take a plunge into the world of image segmentation using deep learning. We’ll talk about:

what image segmentation is and the two main types of image segmentation
Image segmentation architectures
Loss functions used in image segmentation
Frameworks that you can use for your image segmentation projects

Let's dive in.

What is Image Segmentation?

As the term suggests this is the process of dividing an image into multiple segments. In this process, every pixel in the image is associated with an object type. There are two major types of image segmentation - semantic segmentation and instance segmentation.
In semantic segmentation, all objects of the same type are marked using one class label while in instance segmentation similar objects get their own separate labels.

Image Segmentation Architectures

The basic architecture in image segmentation consists of an encoder and a decoder.

The encoder extracts features from the image through filters. The decoder is responsible for generating the final output which is usually a segmentation mask containing the outline of the object. Most of the architectures have this architecture or a variant of it.
Let's look at a couple.

U-Net

U-Net is a convolutional neural network originally developed for segmenting biomedical images. When visualized its architecture looks like the letter U and hence the name U-Net. Its architecture is made up of two parts, the left part - the contracting path and the right part - the expansive path. The purpose of the contracting path is to capture context while the role of the expansive path is to aid in precise localization.

U-Net is made up of an expansive path on the right and a contracting path on the left. The contracting path is made up of two three-by-three convolutions. The convolutions are followed by a rectified linear unit and a two-by-two max-pooling computation for downsampling.
U-Net's full implementation can be found here.

FastFCN - Fast Fully-connected network

In this architecture, a Joint Pyramid Upsampling(JPU) module is used to replace dilated convolutions since they consume a lot of memory and time. It uses a fully-connected network at its core while applying JPU for upsampling. JPU upsamples the low-resolution feature maps to high-resolution feature maps.

Gated-SCNN

This architecture consists of a two-stream CNN architecture. In this model, a separate branch is used to process image shape information. The shape stream is used to process boundary information.

You can implement it by checking out the code here.

DeepLab

In this architecture, convolutions with upsampled filters are used for tasks that involve dense prediction. Segmentation of objects at multiple scales is done via atrous spatial pyramid pooling. Finally, DCNNs are used to improve the localization of object boundaries. Atrous convolution is achieved by upsampling the filters through the insertion of zeros or sparse sampling of input feature maps.

You can try its implementation on either PyTorch or TensorFlow.

Mask R-CNN

In this architecture, objects are classified and localized using a bounding box and semantic segmentation that classifies each pixel into a set of categories. Every region of interest gets a segmentation mask. A class label and a bounding box are produced as the final output. The architecture is an extension of the Faster R-CNN. The Faster R-CNN is made up of a deep convolutional network that proposes the regions and a detector that utilizes the regions.

Here is an image of the result obtained on the COCO test set.

Image Segmentation Loss functions

Semantic segmentation models usually use a simple cross-categorical entropy loss function during training. However, if you are interested in getting the granular information of an image, then you have to revert to slightly more advanced loss functions. '
Let's go through a couple of them.

Focal Loss

This loss is an improvement to the standard cross-entropy criterion. This is done by changing its shape such that the loss assigned to well-classified examples is down-weighted. Ultimately, this ensures that there is no class imbalance. In this loss function, the cross-entropy loss is scaled with the scaling factors decaying at zero as the confidence in the correct classes increases. The scaling factor automatically down weights the contribution of easy examples at training time and focuses on the hard ones.

Dice loss

This loss is obtained by calculating smooth dice coefficient function. This loss is the most commonly used loss is segmentation problems.

Intersection over Union (IoU)-balanced Loss

The IoU-balanced classification loss aims at increasing the gradient of samples with high IoU and decreasing the gradient of samples with low IoU. In this way, the localization accuracy of machine learning models is increased.

Boundary loss

One variant of the boundary loss is applied to tasks with highly unbalanced segmentations. This loss's form is that of a distance metric on space contours and not regions. In this manner, it tackles the problem posed by regional losses for highly imbalanced segmentation tasks.

Weighted cross-entropy

In one variant of cross-entropy, all positive examples are weighted by a certain coefficient. It is used in scenarios that involve class imbalance.

Lovász-Softmax loss

This loss performs direct optimization of the mean intersection-over-union loss in neural networks based on the convex Lovasz extension of sub-modular losses.

Other losses worth mentioning are:

TopK loss whose aim is to ensure that networks concentrate on hard samples during the training process.
Distance penalized CE loss that directs the network to boundary regions that are hard to segment.
Sensitivity-Specificity (SS) loss that computes the weighted sum of the mean squared difference of specificity and sensitivity.
Hausdorff distance(HD) loss that estimated the Hausdorff distance from a convolutional neural network.

These are just a couple of loss functions used in image segmentation. To explore many more check out this repo.

Image Segmentation Datasets

If you are still here, chances are that you might be asking yourself where you can get some datasets to get started.
Let's look at a few.

Common Objects in COntext - Coco Dataset

COCO is a large-scale object detection, segmentation, and captioning dataset. The dataset contains 91 classes. It has 250,000 people with key points. Its download size is 37.57 GiB. It contains 80 object categories. It is available under the Apache 2.0 License and can be downloaded from here.

PASCAL Visual Object Classes (PASCAL VOC)

PASCAL has 9963 images with 20 different classes. The training/validation set is a 2GB tar file. The dataset can be downloaded from the official website.

The Cityscapes Dataset

This dataset contains images of city scenes. It can be used to evaluate the performance of vision algorithms in urban scenarios. The dataset can be downloaded from here.

The Cambridge-driving Labeled Video Database - CamVid

This is a motion-based segmentation and recognition dataset. It contains 32 semantic classes. This link contains further explanations and download links to the dataset.

Image Segmentation Frameworks

Now that you are armed with possible datasets, let's mention a few tools/frameworks that you can use to get started.

FastAI library - given an image this library is able to create a mask of the objects in the image.
Sefexa Image Segmentation Tool - Sefexa is a free tool that can be used for Semi-automatic image segmentation, analysis of images, and creation of ground truth
Deepmask - Deepmask by Facebook Research is a Torch implementation of DeepMask and SharpMask
MultiPath - This a Torch implementation of the object detection network from "A MultiPath Network for Object Detection". OpenCV - This is an open-source computer vision library with over 2500 optimized algorithms.
MIScnn - is a medical image segmentation open-source library. It allows setting up pipelines with state-of-the-art convolutional neural networks and deep learning models in a few lines of code. Fritz: Fritz offers several computer vision tools including image segmentation tools for mobile devices.

Final Thoughts

Hopefully, this article gave you some background into image segmentation and given you some tools and frameworks that you can use to get started.

We’ve covered:

what image segmentation is,
a couple of image segmentation architectures,
some image segmentation losses,
image segmentation tools and frameworks.

For more information check out the links attached to each of the architectures and frameworks.

Happy segmenting!

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Document Classification: 7 pragmatic approaches for small datasets

Jakub Czakon — Thu, 30 Apr 2020 07:59:25 +0000

This article was originally posted by Shahul ES on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Document or text classification is one of the predominant tasks in Natural language processing. It has many applications including news type classification, spam filtering, toxic comment identification, etc.

In big organizations the datasets are large and training deep learning text classification models from scratch is a feasible solution but for the majority of real-life problems your dataset is small and if you want to build your machine learning model you need to be smart.

In this article, I will talk about pragmatic approaches towards text representation which make document classification on small datasets doable.

Text Classification 101

The text classification workflow begins by cleaning and preparing the corpus out of the dataset. Then this corpus is represented by any of the different text representation methods which are then followed by modeling.

In this article, we will focus on the “Text Representation” step of this pipeline.

Example text classification dataset

We will use the data from Real or Not? NLP with disaster tweets kaggle competition. Here, the task is to predict which tweets are about real disasters and which ones are not.

If you want to follow the article step-by-step you may want to install all the libraries that I used for the analysis.

Let’s take a look at our data,

import pandas as pd

tweet= pd.read_csv('../input/nlp-getting-started/train.csv')
test=pd.read_csv('../input/nlp-getting-started/test.csv')

tweet.head(3)

The data contains of id, keyword, location, text, and target which is binary. We will only consider the tweets to predict the target.

print('There are {} rows and {} columns in train'.format(tweet.shape[0],tweet.shape[1]))
print('There are {} rows and {} columns in test'.format(test.shape[0],test.shape[1]))

The training dataset has less than 8000 tweets. That, combined with the fact that tweets are 280 characters tops make it a tricky, small(ish) dataset.

Text data preparation

Before we get into any NLP task, we need to do some data preprocessing and basic cleaning. It is not a focus of this article but if you want to read more about this step check out this article.

In short, we will:

Tokenize: the process by which sentences are converted to a list of tokens or words.
Remove stopwords: drop words like ‘a’ or ‘the’
Lemmatize: reduce the inflectional forms of each word into a common base or root (“studies”, “studying” -> “study”).

def preprocess_news(df):
    '''Function to preprocess and create corpus'''
    new_corpus=[]

    lem=WordNetLemmatizer()
    for text in df["question_text"]:
        words=[w for w in word_tokenize(text) if (w not in stop)]

        words=[lem.lemmatize(w) for w in words]

        new_corpus.append(words)
    return new_corpus

corpus=preprocess_news(df)

Now, let’s see how to represent this corpus so that we can feed this into any machine learning algorithm.

Text Representation

Text cannot be used directly as input to a machine learning model but needs to be represented in the numeric format first. This is known as text representation.

Countvectorizer

Countvectorizer provides an easy method to vectorize and represent a collection of text documents. It tokenizes the input text and builds a vocabulary of known words and then represents the documents using this vocabulary.

Let’s understand it by using an example,

text = ["She sells seashells in the seashore"]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

You can see that the Coutvectorizer has built a vocabulary out of the given text and then represented the words using a numpy sparse matrix. We can try and transfer another text using this vocabulary and observe the output to get a better understanding.

vector=vectorizer.transform(["I sell seashells in the seashore"])
vector.toarray()

You can see that:

The index positions 3 and 4 have zeroes meaning that these two words are not present in our vocabulary and all other positions have 1 meaning that these words are present in our vocabulary. The corresponding words missing from the vocabulary are “sells” and “she”. Now that you understand how Coutvectorizer works, we can fit and transform our corpus using it.

vec=CountVectorizer(max_df=10,max_features=10000)
vec.fit(df.question_text.values)
vector=vec.transform(df.question_text.values)

You should know that Countvectorizer has a few important parameters that you should adjust to your problem:

max_features: build a vocabulary that only considers the top n tokens ordered by term frequency across the corpus.
min_df: When building the vocabulary ignore terms that have a token frequency strictly lower than the given threshold
max_df: When building the vocabulary ignore terms that have a token frequency strictly higher than the given threshold.

What usually helps with selecting reasonable values (or ranges for hyperparameter optimization methods) is good exploratory data analysis. Check out my other article to read about it.

TfidfVectorizer

One issue with Countvectorizer is that common words like “the” will appear many times (unless you remove them at the preprocessing stage) and these words are not actually important. One popular alternative is Tfidfvectorizer. It is an acronym for Term frequency-inverse document frequency.

Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.

Let’s look at an example:

from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["She sells seashells by the seashore","The sea.","The seashore"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

The vocabulary again consists of 6 words and the inverse document frequency is calculated for each word, assigning the lowest score to “the” which occurred 4 times.

Then the scores are normalized between 0 and 1 and this text representation can be used as input into any machine learning model.

Word2vec

The big issue with the above approaches is that the context of the word is lost when representing it. Word embeddings provide a much better representation of the words in NLP by encoding some context information. It provides a mapping from a word to a corresponding n-dimensional vector.

Word2Vec was developed at Google by Tomas Mikolov, et al. and uses a shallow neural network to learn word embeddings. The vectors are learned by understanding the context in which the word occurs. Specifically, it looks at co-occurring words.

Given below is the co-occurrence matrix for the sentence “The cat sat on the mat”.

Word2vec is composed of two different models:

Continuous Bag of Words (CBOW) model can be thought of as learning word embeddings by training a model to predict a word given its context.
Skip-Gram model is the opposite, learning word embeddings by training a model to predict context given a word. The basic idea of word embedding is words that occur in similar context tend to be closer to each other in vector space. Let’s check how to implement word2vec in python.

import gensim
from gensim.models import Word2Vec

model = gensim.models.Word2Vec(corpus, 
                               min_count = 1, size = 100, window = 5)

Now you have created your word2vec model, some of the important parameters that you can actually change and observe the differences are,

size: this indicates the embedding size of the resulting vector for each word.
min_count: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold
window: The number of words surrounding the word is considered when building the representation. Also known as the window size. In this article, we focus on pragmatic approaches for small datasets and we will use pre-trained word vectors instead of training vectors from our corpus. This method is guaranteed to yield better performance.

First, you will have to download the trained vectors from here. Then you can load the vectors using gensim.

from  gensim.models.KeyedVectors import load_word2vec_format

def load_word2vec():
    word2vecDict = load_word2vec_format(
        '../input/word2vec-google/GoogleNews-vectors-negative300.bin',
        binary=True, unicode_errors='ignore')
    embeddings_index = dict()
    for word in word2vecDict.wv.vocab:
        embeddings_index[word] = word2vecDict.word_vec(word)

    return embeddings_index

Let’s check the embedding,

w2v_model=load_word2vec()
w2v_model['London'].shape

You can see that the word is represented using a 300-dimensional vector. So every word in your corpus can be represented like this and this embedding matrix is used to train your model.

FastText

Now, let’s learn about fastText which is an extremely useful module available in gensim. FastText has been developed by Facebook and yields great performance and speed in text classification tasks.

It supports both Continuous Bag of Words and Skip-Gram models. The main difference between previous models and FastText is that it breaks the word in several n-grams.

Let’s take the word orange for example.

The trigrams of word orange are,org,ran,ang,nge(ignoring the starting and ending boundaries of the word).

The word embedding vector (text representation)for orange will be the sum of these n-grams. Rare words or typos can now be properly represented since it is highly likely that some of their n-grams also appears in other words.

For example, for a word like stupedofantabulouslyfantastic, which might never have been in any corpus, gensim might return any two of the following solutions: a zero vector or a random vector with low magnitude.

FastText, however, can produce better vectors by breaking the word into chunks and using the vectors for those chunks to create a final vector for the word. In this particular case, the final vector might be closer to the vectors of fantastic and fantabulous.

Again, we will use a pre-trained model rather than training our own word embeddings.

For this, you can download pre-trained vectors from here.

Each line of this file contains a word and it’s a corresponding n-dimensional vector. We will create a dictionary using this file for mapping each word to its vector representation.

from gensim.models import FastText 

def load_fasttext():

    print('loading word embeddings...')
    embeddings_index = {}
    f = open('../input/fasttext/wiki.simple.vec',encoding='utf-8')
    for line in tqdm(f):
        values = line.strip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    print('found %s word vectors' % len(embeddings_index))

    return embeddings_index

embeddings_index=load_fastext()

Let’s check the embedding for a word,

embeddings_index['london'].shape

GloVe ( Global vectors for word representation)

GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford. The basic idea of GloVe is to derive a semantic relationship between words using a co-occurrence matrix. The idea is very similar to word2vec but there are slight differences. Go here to read more.

For this, we will use pre-trained glove vectors which are trained on large corpora. This is guaranteed to perform better in almost any situation.You can download it from here.

After downloading we can load our pre-trained word model. Before that, you should understand the format in which it is made available. Each line contains a word and its corresponding n-dimensional vector representation. Like this,

So, to use this you should first prepare a dictionary that contains the mapping between word and corresponding vector. This can be called an embedding dictionary.

Let’s create one for our purpose.

def load_glove():
    embedding_dict = {}
    path = '../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt'
    with open(path, 'r') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vectors = np.asarray(values[1:], 'float32')
            embedding_dict[word] = vectors
    f.close()

    return embedding_dict


embeddings_index = load_glove()

Now, we have a dictionary containing every word in the glove pre-trained vectors and their corresponding vector in a dictionary. Let’s check the embedding for a word.

embeddings_index['london'].shape

Universal Sentence Encoding

Till now we were dealing with representing words and these techniques are most useful for word-level operations. Sometimes we need to explore sentence level operations. These encoders are called sentence encoders.

A good sentence encoder is expected to encode sentences in such a way that the vectors of similar sentences have a minimal distance between them in the vector space.

For example,

It is sunny today
It is rainy today
It is cloudy today. These sentences will be encoded and represented so that they are close to each other in the vector space.

Let’s move on and check how to implement universal sentence encoder and find similar sentences using it.

You can download the pertained vectors from here.

We will load the module using the TensorFlow hub.

module_url = "../input/universalsentenceencoderlarge4"
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.load(module_url)

Next, we will create the embedding for each sentence in our list.

sentence_list=df.question_text.values.tolist()
sentence_emb=embed(sentence_list)['outputs'].numpy()

Here is an article to read more about universal sentence encoder.

Elmo, BERT, and others.

When using any of the above embedding methods one thing we forget about is the context in which the word was used. This is one of the main drawbacks of such word representation models.

For example, the word word “stick” will be represented using the same vector independent of the context in which it was used which doesn’t make much sense. With the recent developments in the field of NLP and models like BERT (bidirectional encoder representation from transformers), this has been made possible. Here is an article to read more.

Text Classification

In this section, we will prepare the embedding matrix which is passed to the Keras Embedding layer to learn text representations. You can use the same steps to prepare the corpus for any word-level embedding methods.

Let’s create a word index and fix a maximum sentence length, pad each sentence in our corpus using Keras Tokenizer and pad_sequences.

MAX_LEN=50
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(corpus)
sequences=tokenizer_obj.texts_to_sequences(corpus)

tweet_pad=pad_sequences(sequences,
                        maxlen=MAX_LEN,
                        truncating='post',
                        padding='post')

Let’s check the number of unique words in our corpus,

word_index=tokenizer_obj.word_index
print('Number of unique words:',len(word_index))

Using this word index dictionary and embedding dictionary you can create an embedding matrix for our corpus. This embedding matrix is passed on to the embedding layer of the neural network to learn word representations.

def prepare_matrix(embedding_dict, emb_size=300):
    num_words = len(word_index)
    embedding_matrix = np.zeros((num_words, emb_size))

    for word, i in tqdm(word_index.items()):
        if i > num_words:
            continue

    emb_vec = embedding_dict.get(word)
    if emb_vec is not None:
        embedding_matrix[i] = emb_vec

    return embedding_matrix

We can define our neural network and pass this embedding index to the Embedding layer of the network. We pass the vectors onto the Embedding layer and set trainable=False to prevent the weights from being updated.

def new_model(embedding_matrix):
    inp = Input(shape=(MAX_LEN,))

    x = Embedding(num_words, embedding_matrix.shape[1], weights=[embedding_matrix],
                  trainable=False)(inp)

    x = Bidirectional(
        LSTM(60, return_sequences=True, name='lstm_layer', 
             dropout=0.1, recurrent_dropout=0.1))(x)

    x = GlobalAveragePool1D()(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)

    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

For example, to run the model using word2vec embeddings,

embeddings_index=load_word2vec()
embedding_matrix=prepare_matrix(embeddings_index)
model=new_model(embedding_matrix)

history=model.fit(X_train,y_train,
                  batch_size=8,
                  epochs=5,
                  validation_data=(X_test,y_test),
                  verbose=2)

You can call your desired type of embeddings and follow the same steps to implement any of them.

Comparison

So which text classification method worked best in our example problem?

You can use Neptune to compare the performance of our model using different embeddings by simply setting up an experiment.

Glove embeddings performed a little better in test sets when compared to the other two embeddings. You may be able to get better results by doing extensive cleaning on the data and tuning the model.

You can explore experiments here if you want to.

Final Thoughts

In this article, we discussed and implemented different feature representation methods for text classification that you can use for smaller datasets.

Hopefully, you will find them useful in your projects.

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

8 Creators and Core Contributors Talk About Their Model Training Libraries From PyTorch Ecosystem

Jakub Czakon — Tue, 21 Apr 2020 10:41:57 +0000

I started using Pytorch to train my models back in early 2018 with 0.3.1 release. I got hooked by the Pythonic feel, ease of use and flexibility.

It was just so much easier to do things in Pytorch than in Tensorflow or Theano.
But something I missed was the Keras-like high-level interface to PyTorch and there was not much out there back then.

Fast-forward to 2020, and we have 6 high-level training APIs in the PyTorch Ecosystem.:

Skorch
Catalyst
Fastai
PyTorch Ignite
PyTorch Lightning
TorchBearer

But which one should you choose?
What are the pros and cons of using each one?

I thought: who can explain the differences between those libraries better than the authors themselves?
I picked up my proverbial phone and asked them to write an article with me. They all agreed and this is how this post was created!

So, I’ve asked authors to talk about the following aspects of their libraries:

Philosophy of the project
API structure
The learning curve for new users
Build-in features (what you get out-of-the-box)
Extension capabilities (simplicity of integration in research)
Reproducibility
Distributed training
Productionalization
Popularity … and they really did answer thoroughly 🙂

Skorch

The philosophy behind skorch development can be summarized as follows:

follow the sklearn API
don’t hide PyTorch
don’t reinvent the wheel
be hackable

These principles laid out the design space within which we operate. Regarding the scikit-learn API, it presents itself, most obviously, in how you train and predict:

from skorch import NeuralNetClassifier

net = NeuralNetClassifier(...)
net.fit(X_train, y_train)
net.predict(X_test)

Because skorch is using this simple and well-established API everyone should be able to start using it very quickly.

But the sklearn integration goes deeper than calling “fit” and “predict”. You can seamlessly integrate your skorch model within sklearn Pipelines, use sklearn’s numerous metrics (no need to re-implement F1, R², etc.), and use it with GridSearchCV.

When it comes to parameter sweeps: you can use any other hyperparameter search strategy as long as there is a sklearn-compatible implementation.

We are especially proud that you can search on almost any hyper-parameter without additional work. For example, if your module has an initialization parameter called num_units, you can grid search that parameter right away.

Here is a list of things you can grid search out-of-the-box:

any parameter on your Module (number of units and layers, nonlinearity, dropout rate, …)
optimizer (learning rate, momentum…)
criterion
DataLoader (batch size, shuffling, …)
callbacks (any parameter, even on your custom callbacks)

This is how it looks like in code:

from sklearn.model_selection import GridSearchCV

params = {
    'lr': [0.01, 0.02],
    'max_epochs': [10, 20],
    'module__num_units': [10, 20],
    'optimizer__momentum': [0.6, 0.9, 0.95],
    'iterator_train__shuffle': [True, False],
    'callbacks__mycallback__someparam': [1, 2, 3],
}

net = NeuralNetClassifier(...)
gs = GridSearchCV(net, params, cv=3, scoring='accuracy')
gs.fit(X, y)

print(gs.best_score_, gs.best_params_)

As far as I’m aware, no other framework provides this flexibility. On top of that, by using the dask parallel backend, you can distribute the hyper-parameter search across your cluster without too much hassle.

Using the mature sklearn API, skorch users can avoid the boilerplate code that is typically seen when writing train loops, validation loops, and hyper-parameter search in pure PyTorch.

From the PyTorch side, we decided not to hide the backend behind an abstraction layer, as is the case in keras, for example. Instead, we expose numerous components known from PyTorch. As a user, you can use PyTorch’s Dataset (think torchvision, including TTA), DataLoader, and learning rate schedulers. Most importantly, you can use PyTorch Modules with almost no restrictions.

We thus made a conscious effort to re-use as many existing features from sklearn and PyTorch as possible instead of re-inventing the wheel. This makes skorch easy to use on top of your existing codebase or to remove it after your initial experimentation phase without any lock-in effect.

For instance, you can replace the neural net with any sklearn model or you can extract the PyTorch module and use it without skorch.

On top of re-using existing features, we added some of our own. Most notably, skorch works with many common data types out-of-the-box. On top of Datasets, you can use:

numpy arrays,
torch tensors,
pandas DataFrames,
Python dictionaries holding heterogeneous data,
external/custom datasets like ImageFolder from torchvision.

We’ve put extra effort to make these work well with sklearn.

Additionally, we implemented a simple yet powerful callback system, which you can use to adapt most of skorch’s behavior to your liking. Some of the callbacks that we provide are:

learning rate schedulers,
scoring functions (using custom or sklearn metrics),
early stopping,
checkpointing,
parameter freezing,
and TensorBoard and Neptune integration.

If this is not enough to satisfy your customization needs, we took pains to facilitate implementing your own callbacks or your own model trainers. Our documentation contains examples of how to implement custom callbacks and custom trainers, modifying every possible behavior right down to the training step.

The philosophy of not re-inventing the wheel should make skorch easy to learn for anyone who is familiar with sklearn and PyTorch. And since we designed skorch around customization and flexibility, it shouldn’t be too hard to master. To learn more about skorch check out these examples and notebooks.

Skorch is geared towards, and used in, production. We addressed some common issues regarding productionalization, specifically:

we make sure to be backward compatible and to give a sufficiently long deprecation period where necessary.
you can train on GPU and serve on CPU,
you can pickle a whole sklearn Pipeline containing the skorch model for later re-use.
we provide a helper function to turn your training code into a command line script that exposes all your model parameters, including their documentation, as command line arguments, with just three lines of extra code

That being said, I have implemented, or know people who have implemented, more research-y stuff, like GANs and numerous types of semi-supervised learning techniques. This does require more profound knowledge of skorch, though, so you might have to dig deeper in the docs or ask us for pointers on github.

I personally haven’t come across anyone using skorch with reinforcement learning, but I would like to hear what experience people had with that.

Since our initial release of skorch in the summer of 2017, the project has matured a lot and an active community has grown around it. In a typical week, a handful of issues are opened on github or a question is asked on stackoverflow. We answer most questions within a day, and if there is a good feature request or bug report, we try to guide the reporter towards implementing it themselves.

This way, we have had more than 20 contributors over the project’s lifetime, with 3 of them being regulars, which means the project’s health is not dependent on a single person.

The big difference between skorch and some other higher-level frameworks, say fastai, is that skorch doesn’t come “batteries-included”. That means, it’s up to the user to implement their own modules or to use the modules of one of the many existing collections (say, torchvision). Skorch provides the skeleton, but you have to bring the meat.

When not to use Skorch

super custom PyTorch code, possibly reinforcement learning
backend agnostic code (switch between PyTorch, tensorflow, …)
there is no need at all for the sklearn API
avoid a very slight performance overhead

When to use skorch

gain sklearn API and all associated benefits like hyper-parameter search
most PyTorch workflows just work
avoid boilerplate, standardize code
use some of the many utilities discussed above

Catalyst

Philosophy

The idea behind the Catalyst is quite simple:

collect all the technical, dev-heavy, Deep Learning stuff in a framework,
make it easy to re-use boring day-to-day components,
focus on research and hypothesis testing in our projects.

To make that happen we looked at a typical Deep Learning project, which usually has the following structure:

for stage in stages:
    for epoch in epochs:
        for dataloader in dataloaders:
            for batch in dataloader:
                handle(batch)

If you think about it, most of the time, all you need to do is specify the handle method for the new model and how batches of data should be fed to that model. Why then, so much of our time is spent implementing pipelines and debugging training loops rather than developing something new or testing a hypothesis?

We realized that it is possible to separate the engineering from the research so that we can invest our time once in the high-quality, reusable engineering backbone and use it across all the projects.

That is how Catalyst was born: an Open Source PyTorch framework, that allows you to write compact but full-features pipelines, abstracts engineering boilerplate away and lets you focus on the main part of your project.

Our mission at Catalyst. Team is to use our software engineering and deep learning expertise to standardize workflows and enable cross-domain communication between deep learning and reinforcement learning researchers.

We believe that reduced development friction and free flow of ideas will lead to future breakthroughs in DL and such an R&D Ecosystem will help make that happen.
The learning curve
Catalyst can be easily adopted by both DL newcomers and seasoned experts thanks to two APIs:

Notebook API, which was developed with a focus on easy experimentation and Jupyter Notebooks usage - to start your path into reproducible DL research.
Config API, which mostly focuses on scalability and CLI interface - to bring the power of DL/RL even on large clusters.

When it comes to PyTorch user experience we really want to keep it as simple as possible:

You define your loaders, model, criterion, optimizer, and scheduler as you usually would:

import torch

# data
loaders = {"train": ..., "valid": ...}

# model, criterion, optimizer
model = Net()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)

and you pass those PyTorch objects to Catalyst Runner

from catalyst.dl import SupervisedRunner

# experiment setup
logdir = "./logdir"
num_epochs = 42

# model runner
runner = SupervisedRunner()

# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir=logdir,
    num_epochs=num_epochs,
    verbose=True,)

Clearly decoupled engineering from deep learning with almost no boilerplate. This is how we feel deep learning code should look like.

To get started with both APIs you can follow our tutorials and pipelines or if you don’t want to choose, just check out the most common ones: classification and segmentation.
Design and Architecture

The most interesting part about Notebook and Config API is that they use the same “backend” logic – Experiment, Runner, State and Callback abstractions, which are the core features of Catalyst.

Experiment: an abstraction that contains information about the experiment – a model, a criterion, an optimizer, a scheduler, and their hyperparameters. It also contains information about the data and transformations used. In general, the Experiment knows what you would like to run.
Runner: a class that knows how to run an experiment. It contains all the logic of how to run the experiment, stages (another distinctive feature of Catalyst), epoch and batches.
State: some intermediate storage between Experiment and Runner that saves the current state of the Experiments – model, criterion, optimizer, schedulers, metrics, loggers, loaders, etc
Callback: a powerful abstraction that lets you customize your experiment run logic. To give users maximum flexibility and extensibility we allow callback execution anywhere in the training loop:

on_stage_start
    on_epoch_start
       on_loader_start
           on_batch_start
           # ... 
       on_batch_end
    on_epoch_end
on_stage_end

on_exception

By implementing these methods you can make any additional logic possible.

As a result, you can implement any Deep Learning pipeline in a few lines of code (and after Catalyst.RL 2.0 release – Reinforcement Learning pipeline), combining it from available primitives (thanks to the community, their number is growing every day).

Everything else (Models, Criterions, Optimizers, Schedulers) are pure PyTorch primitives. Catalyst does not create any wrappers or abstractions on top but rather makes it easy to reuse those building blocks between different frameworks and domains.

Extension capabilities / Simplicity of integration in research

Thanks to flexible framework design and Callbacks-mechanism, Catalyst is easily extendable for a large number of DL-based projects. You can check out our Catalyst-powered repositories on awesome-catalyst-list.

If you are interested in Reinforcement Learning – there are a large number of RL-based repos and competition solutions also. To compare Catalyst.RL with other RL frameworks you could check out Open Source RL list.

Other built-in features (what you get out of the box)

Knowing that you can extend it easily gives comfort but there are a ton of features that you get out-of-the-box. Some of them include:

Based on a flexible callback system, Catalyst has easily integrated such common Deep Learning best practices, such as gradient accumulation, gradient clipping, weight decay correction, top-K best checkpoints saving, tensorboard integration, and many other useful day-to-day deep learning utils.
Thanks to our contributors and contrib modules, Catalyst has access to all recent SOTA features, like AdamW, OneCycle, SWA, Ranger, LookAhead, and many other research developments.
Moreover, we integrate with such popular libraries like Nvidia apex, Albumentations, SMP, transformers, wandb, and neptune.ai just out of the box to make your research more user-friendly. Thanks to such integrations, Catalyst has full support for test-time augmentations, mixed precision, and distributed training.
For the industry needs, we also have framework-wise support for PyTorch tracing which makes putting models in production easier. Furthermore, we deploy predefined Catalyst-based docker images with each release for easier integration.
Finally, we support additional solutions for both model serving – ReAction (industry-oriented) and experiments monitoring – Alchemy (research-oriented).

Everything is integrated into the library and covered by CI tests (we have a dedicated gpu-server for that). And thanks to Catalyst scripts, you can schedule a large number of experiments and run them in parallel over all available GPUs from the command line (check catalyst-parallel-run for more info).

Reproducibility

We’ve put a lot of work to make experiments that you run with Catalyst reproducible. Thanks to library-wise determinism Catalyst-based experiments are reproducible not only between server runs on one server but also between several runs over different servers and different hardware parts (with docker encapsulation, of course). See experiments here if interested.

Moreover, Reinforcement Learning experiments are also reproducibility-oriented (as RL far as RL can be reproducible). For example, with synchronous experiment runs, you can achieve very close performance, thanks to determinism in sampled trajectories. This is notoriously hard and as far as I am aware Catalyst has the most reproducible RL pipelines out there.

To achieve this new level of reproducibility in DL and RL we had to create several additional features:

Full source code dumping: thanks to Experiments, Runner and Callbacks abstractions, it’s quite easy to save these primitive for further usage.
Catalyst source code dumpling: with such feature even working with the dev version of Catalyst, you can always reproduce experiment results.
Environment versioning: Catalyst dumps pip and conda packages versions (it can be later used to define your docker images)
Finally, Catalyst supports several monitoring tools, like Alchemy, Neptune.ai, Wandb to store all your experiment metrics and additional info for better research progress tracking and reproducibility.

Thanks to those library-wise solutions, you can be sure that the pipelines you implement in Catalyst are reproducible with all the experiment logs and checkpoints saved for future reference.

Distributed training

Based on our integrations, Catalyst already has native support for distributed training. Moreover, we support Slurm training and working on better Kubernetes integration for both DL and RL pipelines.

Productionalization

Now that we know how Catalyst helps with deep learning research we can talk about deploying trained models to production.

As was already mentioned, Catalyst supports model tracing out-of-the-box. It lets you convert PyTorch models (that use Python code) to TorchScript model (that has everything integrated). TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.

Additionally, to help Catalyst users deploy their pipelines into production systems, Catalyst.Team has a Docker Hub with pre-build Catalyst-based images( including fp16 support).

Moreover, to help researchers bring their ideas into production and real-world applications, we’ve created Catalyst.Ecosystem:

Reaction: our own PyTorch Serving solution with sync/async API, batch mode support, quest, and all other typical backends that you would expect from a well-designed production system.
Alchemy: our monitoring tools for experiment tracking, model comparison and research results sharing.

Popularity

Since the first pypi release 12 months ago Catalyst has gained ~1.5k stars on Github and over 100k downloads. We are proud to be part of such an Open Source Ecosystem and extremely grateful to all our users and contributors for constant support and feedback.

One of the online communities that was especially helpful was ods.ai: one of the largest slack channels for Data Scientists and Machine learning practitioners in the world (40k+ users). Without their ideas and feedback, Catalyst wouldn’t get where it is today.

Special thanks to our early-adopters,

Bac Nguyen Xuan
Eugene Khvedchenya
Alex Gaziev
and contributors that make it all worth it.

Acknowledgments

Since the beginning of the development of the Сatalyst, a lot of people have influenced it in a lot of different ways. As a token of my appreciation a HUGE THANK YOU to: I want to express personal thanks to:

Roman Tezikov for great Catalyst tutorials
Eugene Kachan for many Config API improvements and pipelines
David Kuryakin for ReAction design
Aleksey Grinchuk and Valentin Khrulkov for many RL algorithms implemented together
Alex Gaziev for a bunch of Config API improvements
Andrey Zharkov and Artem Zolkin for Catalyst.GAN initiative
Yury Kashnitsky for Catalyst.NLP movement
Evgeny Semyonov for MLComp creation
Eugene Khvedchenya for Pytorch-toolbelt library
Nguyen Xuan Bac and Andrey Lukyanenko for many Kaggle Catalyst-based solutions
Vsevolod Poletaev for Experiment idea and PoC
Aleksandr Belskikh for Callbacks-based system inspiration
Artur Kuzin for multi-stage pipelines support requirement
Vladimir Iglovikov for countless pieces of useful advice and Ivan Stepanenko for awesome Catalyst. Ecosystem design

Thanks to all that support, Catalyst has become a part of Kaggle docker image, was added to the PyTorch Ecosystem and now we are developing our own DL R&D Ecosystem to accelerate your research and production needs.

To read more about Catalyst. Ecosystem, please check our vision and project manifesto.

Finally, we are always happy to help our Catalyst.Friends: companies/startups/research labs, who are already using Catalyst or are considering using it for their next project.

Thanks for reading, and…
Break the cycle – use Catalyst!

When to use Catalyst

To have flexible and reusable codebase without boilerplate. You want to share your expertise with other researchers from different Deep Learning areas.
Boost your research speed with Catalyst.Ecosystem.

When not to use Catalyst

You have only started your deep learning path – in this way low-level PyTorch is a great introduction.
You want to create very specific, custom, pipelines with a bunch of irreproducible tricks 🙂

Fastai

Note:

What follows is about the version 2 of fastai that will be released in July 2020. You can preview it here and it is documented here. If you read this post after it has been released, it will be in the main repository and will be documented there.

Fastai is a deep learning library which provides:

practitioners: with high-level components that can quickly and easily provide state of the art results in standard deep learning domains,
researchers: with low-level components that can be mixed and matched to build new things. It aims to do both things without substantial compromises in ease of use, flexibility, or performance.

This is possible thanks to a carefully layered architecture. It expresses common underlying patterns of many deep learning and data processing techniques in terms of decoupled abstractions. What is important is that these abstractions can be expressed clearly and concisely which makes fastai approachable and rapidly productive, but also deeply hackable and configurable.

A high-level API offers customizable models with sensible defaults, which is built on top of a hierarchy of lower-level building blocks.

This article covers a representative subset of the features of the library. For details, see our the fastai paper, and the documentation.

API

When talking about fastai API one needs to distinguish High and Middle/Low-level API.
We will talk about both in the following sections.

High-level API

The high-level API is very useful to beginners and practitioners who are mainly interested in applying pre-existing deep learning methods.

It offers concise APIs for main application areas:

vision,
text,
tabular
time-series analysis,
recommendation (collaborative filtering)

These APIs choose intelligent default values and behaviors based on all available information.

For instance, fastai provides a Learner class which brings together architecture, optimizer, and data, and automatically chooses an appropriate loss function where possible.

To give another example, generally, a training set should be shuffled, and a validation set should not be shuffled. fastai provides a single Dataloaders class which automatically constructs validation and training data loaders with these details already handled.

To see how those “clear and concise code” principles in action let’s fine-tune an imagenet model on the Oxford IIT Pets dataset and achieve close to state-of-the-art accuracy within a couple of minutes of training on a single GPU:

from fastai.vision.all import *

path = untar_data(URLs.PETS)
dls = ImageDataloaders.from_name_re(path=path, bs=64,
    fnames = get_image_files(path/"images"), path = r'/([^/]+)_\d+.jpg$',
    item_tfms=RandomResizedCrop(450, min_scale=0.75), 
    batch_tfms=[*aug_transforms(size=224, max_warp=0.), 
                Normalize.from_stats(*imagenet_stats)])

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(4)

This is not an excerpt. These are all of the lines of code necessary for this task.
Each line of code does one important task, allowing the user to focus on what they need to do, rather than minor details:

from fastai.vision.all import *

imports all the necessary pieces from the library. It’s important to note that the library has been designed carefully to avoid these styles of imports cluttering the namespace.

path = untar_data(URLs.PETS)

downloads a standard dataset from the fast.ai datasets collection (if not previously downloaded) to a configurable location, extracts it (if not previously extracted), and returns a pathlib.Path object with the extracted location.

dls = ImageDataloaders.from_name_re(path=path, bs=64,
    fnames = get_image_files(path/"images"), pat = r'/([^/]+)_\d+.jpg$',
    item_tfms=RandomResizedCrop(450, min_scale=0.75), 
    batch_tfms=[*aug_transforms(size=224, max_warp=0.), 
    Normalize.from_stats(*imagenet_stats)])

sets up the Dataloaders. Note the separation of item level and batch level transforms:

item transforms are applied to individual images on the CPU
batch transforms are applied to a mini batch on the GPU (if available).

aug_transforms() selects a set of data augmentations. As always in fastai, a default that works well across a variety of vision datasets is chosen but can be fully customized if needed.

learn = cnn_learner(dls, resnet34, metrics=error_rate)

reates a Learner, which **combines an optimizer, a model, and the data* to train on. Each application (vision, text, tabular) has a customized function that creates a Learner, which automatically handles whatever details it can for the user. For instance, in this image classification problem, it will:

download an ImageNet-pretrained model, if not already available,
remove the classification head of the model,
replace it with a head appropriate for this particular dataset,
set appropriate optimizer, weight decay, learning rate, and so forth

learn.fine_tune(4)

fine-tunes the model. In this case, it is using the 1-cycle policy, which is a recent best practice for training deep learning models but is not widely available in other libraries. A lot of things happen under the hood in .fine_tune():

annealing both the learning rates and the momentums,
printing metrics on the validation set,
displaying results in an HTML or console table
recording losses and metrics after every batch and so forth.
A GPU will be used if one is available.
It will first train the head for one epoch while the body of the model is frozen, then fine-tunes for as many epochs given (here 4) using discriminative learning rates.

One of the strengths of the fastai library is how consistent the API is across applications.

For example, fine-tuning a pretrained model on the IMDB dataset (a text classification task) using ULMFiT can be done in 6 lines of code:

from fastai2.text.all import *

path = untar_data(URLs.IMDB)
dls = TextDataloaders.from_folder(path, valid='test')
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e-2)

Users get a very similar experience in other domains like tabular, time series or recommendation systems. Once a Learner has been trained, you can explore the results with the command learn.show_results(). How those results are presented depends on the application, in vision you get labeled pictures, in text you get a dataframe summarizing samples, targets and predictions. In our pets classification example you would get something like this:

In the IMDb classification problem, you’d get something like this:

Another important high-level API component is the data block API, which is an expressive API for data loading. It is the first attempt we are aware of, to systematically define all of the steps necessary to prepare data for a deep learning model, and give users a mix and match recipe book for combining these pieces (which we refer to as data blocks).

Here is an example of how to use the data block API to get the MNIST dataset ready for modeling:

mnist = DataBlock(
    blocks=(ImageBlock(cls=PILImageBW), CategoryBlock), 
    get_items=get_image_files, 
    splitter=GrandparentSplitter(),
    get_y=parent_label)
dls = mnist.databunch(untar_data(URLs.MNIST_TINY), batch_tfms=Normalize)

Mid and low-level API

In the previous section, you saw how you can get a lot done quickly with the high-level api which has a ton of out-of-the-box functionalities. However, there are situations, when you need to tweak things or extend what is already there.

This is where middle and low-level APIs come into the picture:

mid-level API provides the core deep learning and data-processing methods for each of these applications,
low-level API provide a library of optimized primitives and functional and object-oriented foundations, which allows the mid-level to be developed and customized.

The training loop can be customized using the Learner novel two-way callback system. It allows gradients, data, losses, control flow, and** anything** else to be read and changed at any point during training.

There is a rich history of using callbacks to allow for customization of numeric software, and today nearly all modern deep learning libraries provide this functionality. However, fastai’s callback system is the first that we are aware of that supports the design principles necessary for complete two-way callbacks:

A callback **should be available at every single point during training* which gives users full flexibility. Every callback should be able to access every piece of information available at that stage in the training loop, including hyper-parameters, losses, gradients, input and target data, and so forth ;

Every callback should be able to modify all these pieces of information, at any time before they are used,

All the tweaks of the training loop (different schedulers, mixed-precision training, reporting on TensorBoard, wandb, Neptune, or equivalent, MixUp, oversampling strategies, distributed training, GAN training…) are implemented in callbacks that the end-user can mix and match with their own, making it easier to experiment with things and do ablation studies. Convenience methods are there to add those callbacks for the user, making training in mixed precision as easy as saying

learn = learn.to_fp16()

or training in a distributed environment as easy as

learn = learn.to_distributed()

fastai also provides a new, generic optimizer abstraction that allows recent optimization techniques, like LAMB, RAdam or AdamW, to be implemented in a few lines of code.

It is possible thanks to refactoring optimizer abstractions into two basic pieces:

stats, which track and aggregate statistics such as gradient moving averages
steppers, which combine stats and hyper-parameters to “step” the weights using some function. This foundation has allowed us to write most of fastai’s optimizers in 2-3 lines of code, while in other popular libraries that would take you 50+.

There are many other mid-tier and low-level APIs that make it easy for researchers and developers to build new methods on top of a fast and flexible foundation.

The library is already in wide use in research, industry, and teaching.
We have used it to create a complete, and very popular deep learning course: Practical deep learning for coders (the first video of the last iteration has 256k views).

The repository has 16.9k stars and is used in more than 2,000 projects at the time of writing. The community is very active on the fast.ai forum, be it to clarify points of the course that are unclear, help with debugging or team up to tackle a new deep learning project.

When to use fastai

The goal is to have something easy enough for beginners but flexible enough for researchers/practitioners.

When not to use fastai

The only thing I can think of is that you wouldn’t use fastai to serve in production a model you trained in a different framework, since we don’t deal with that aspect.

PyTorch Ignite

Pytorch Ignite is a high-level library that helps with training neural networks in PyTorch. Since its beginning in 2018, our goal has been to:

“make the common things easy and the hard things possible”.

Why use Ignite?

Ignite’s high level of abstraction assumes little about the type of model or multiple models that user is training. We only require the user to define the closure to be run in the training and optional validation loop. It gives users a lot of flexibility and allows them to use Ignite in tasks such as co-training multiple models (i.e. GANs) or tracking multiple losses and metrics in your training loop

Ignite concepts and API

There are a few core objects in the Ignite’s API that you need to learn:

Engine: the essence of the library
Events & Handlers: interaction with the Engine (e.g. early stopping, checkpoints, logging)
Metrics: out-of-the-box metrics for various tasks We will present some basics to understand the main ideas but feel free to dig deeper into examples in the repository.

Engine

It simply loops over provided data, executes a processing function and returns a result.

A Trainer is an Engine with model’s weights update as processing function.

from ignite.engine import Engine

def update_model(trainer, batch):
    model.train()
    optimizer.zero_grad()
    x, y = prepare_batch(batch)
    y_pred = model(x)
    loss = criterion(y_pred, y)
    loss.backward()
    optimizer.step()
    return loss.item()

trainer = Engine(update_model)
trainer.run(data, max_epochs=100)

An Evaluator (object to validate model) is an Engine with on-line metric computation logic as processing function.

from ignite.engine import Engine

total_loss = []
def compute_metrics(_, batch):
    x, y = batch
    model.eval()
    with torch.no_grad():
        y_pred = model(x)
        loss = criterion(y_pred, y)
        total_loss.append(loss.item())

    return loss.item()

evaluator = Engine(compute_metrics)
evaluator.run(data, max_epochs=1)
print(f”Loss: {torch.tensor(total_loss).mean()}”)

This code can silently train a model and compute total loss.

In the next section we will see how to make the training and validation more user-friendly.

Events & Handlers

In order to improve the flexibility of Engine and allow users to interact at each step of the run, we introduced events and handlers. The idea is that users could execute a custom code inside of the training loop as an event handler, similar to callbacks in other libraries.

fire_event(Events.STARTED)

while epoch < max_epochs:
    fire_event(Events.EPOCH_STARTED)
    # run once on data
    for batch in data:
        fire_event(Events.ITERATION_STARTED)

        output = process_function(batch)

        fire_event(Events.ITERATION_COMPLETED)
    fire_event(Events.EPOCH_COMPLETED)
fire_event(Events.COMPLETED)

At each fire_event call, all its event handlers are executed. For example, users may want to set up some run-dependent variables at the beginning of training (Events.STARTED) and update the learning rate on each iteration (Events.ITERATION_COMPLETED). With Ignite the code will look like this:

train_loader = …
model = …
optimizer = …
criterion = ...
lr_scheduler = …

def process_function(engine, batch):
    # … user function to update model weights

trainer = Engine(process_function)

@trainer.on(Events.STARTED)
def setup_logging_folder(_):
    # create a folder for the run
    # set up some run dependent variables

@trainer.on(Events.ITERATION_COMPLETED)
def update_lr(engine):
    lr_scheduler.step()

trainer.run(train_loader, max_epochs=50)

The cool thing with handlers (vs “callback” interfaces) is that it can be any function with the correct signature (we only require the first argument to be engine), e.g. lambda, simple function, class method etc. We do not require to inherit from an interface and override possibly its abstract methods.

trainer.add_event_handler(
    Events.STARTED, lambda engine: print("Start training"))

# attach handler with args, kwargs
mydata = [1, 2, 3, 4]


def on_training_ended(engine, data):
    print("Training is ended. mydata={}".format(data))


trainer.add_event_handler(
    Events.COMPLETED, on_training_ended, mydata)

Built-in events filtering

There are cases when users would like to execute the code periodically/once or with a custom rule like:

run the validation every 5 epochs,
store a checkpoint every 1000 iterations,
change a variable on 20th epoch,
log gradients on the first 10 iterations.
etc.

Ignite provides such flexibility to separate “the code to execute” from the logic “when to execute the code”.

For example, to run the validation every 5 epochs it is simply coded:

@trainer.on(Events.EPOCH_COMPLETED(every=5))
def run_validation(_):
    # run validation

Similarly, to change some training variable once on 20th epoch:

@trainer.on(Events.EPOCH_STARTED(once=20))
def change_training_variable(_):
    # ...

More generally, user can provide its own events filtering function:

def first_x_iters(_, event):
    if event < 10:
        return True
    return False

@trainer.on(Events.ITERATION_COMPLETED(event_filter=first_x_iters))
def log_gradients(_):
# …

Out-of-the-box handlers

Ignite provides a list of handlers and metrics to simplify user’s code:

Checkpoint : to save training checkpoints (composed of trainer, model(s), optimizer(s), lr scheduler(s), etc) to save best models (by validation score)
EarlyStopping: stops the training if no progress is done (by validation score)
TerminateOnNan: stops the training if NaN is encountered
Optimizer Parameters Scheduling: concatenate, add a warm-up, setup linear or cosine annealing, linear piecewise scheduling of any optimizer parameter (lr, momentum, betas, …)

Logging to common platforms: TensorBoard, Visdom, MLflow, Polyaxon or Neptune (batch losses, metrics GPU mem/utilization, optimizer parameters and more).

Metrics

Ignite also provides a list of out-of-the-box metrics for various tasks: Precision, Recall, Accuracy, Confusion Matrix, IoU etc, ~20 regression metrics

For example, below we compute validation accuracy on the validation dataset:

from ignite.metrics import Accuracy

def compute_predictions(_, batch):
    # …
    return y_pred, y_true

evaluator = Engine(compute_predictions)
metric = Accuracy()
metric.attach(evaluator, "val_accuracy")
evaluator.run(val_loader)
> evaluator.state.metrics[“val_accuracy”] = 0.98765

Go here and here to see the full list of available metrics.

Ignite metrics have this cool property that users can compose its own metric by using basic arithmetical operations or torch methods:

precision = Precision(average=False)
recall = Recall(average=False)
F1_per_class = (precision * recall * 2 / (precision + recall))
F1_mean = F1_per_class.mean()  # torch mean method
F1_mean.attach(engine, "F1")

Library structure

The library is composed of two main modules:

Core module contains bases like Engine, metrics, some essential handlers. It has PyTorch as the only dependency.
Contrib module may depend on other libraries (e.g. scikit-learn, tensorboardX, visdom, tqdm, etc) and can potentially have backward compatibility breaking changes between versions. Both modules are largely covered by unit tests.

Extension capabilities / Simplicity of integration in research

We believe that our event/handler system is rather flexible and gives people the ability to interact with every part of the training process. Because of that, we’ve seen Ignite being used to train GANs (we provide two basic examples to train DCGAN and CycleGAN) or Reinforcement Learning models.

According to Github’s “Used by”, Ignite was used by researchers for their papers:

BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning, github
A Model to Search for Synthesizable Molecules, github
Localised Generative Flows, github
Extracting T Cell Function and Differentiation Characteristics from the Biomedical Literature, github

Because of those (and other research projects) we strongly believe that Ignite gives you enough flexibility to do deep learning research.

Integrations with other libraries/frameworks

Ignite plays nicely with other libraries or frameworks if their features do not overlap. Some cool integrations that we have include:

hyperparameter tuning with Ax (Ignite example).
hyperparameter tuning with Optuna (Optuna example).
logging to TensorBoard, Visdom, MLflow, Polyaxon, Neptune (Ignite’s code), Chainer UI (Chainer’s code).
Training with mixed precision using Nvidia Apex (Ignite’s examples).

Reproducibility

We’ve put a lot of effort into making Ignite training reproducible:

Ignite’s Engine automatically handles the random states and when it is possible forces the data loaders to provide same data samples on different runs;
Ignite integrates with experiment tracking systems like MLflow, Polyaxon, Neptune. This helps to keep track of software, parameter, and data dependencies of ML experiments;
We provide several examples and “references” (inspired from torchvision) of reproducible training on vision tasks (e.g. classification on CIFAR10, ImageNet, and segmentation on Pascal VOC12).

Distributed training

Distributed training is also supported by Ignite but we leave up to the user to set up its type of parallelism: model or data.

For example, in data distributed configuration users are required to correctly set up the distributed process group, wrap the model, use distributed sampler etc. Ignite handles metrics computation: reduction of the value across all processes.

We provide several examples (e.g. distributed CIFAR10) to display how to use Ignite in a distributed configuration.

Popularity

At the moment of writing, Ignite had about 2.5k stars and according to Github’s “Used by” feature is used by 205 repositories.
Some honorable mentions are:

State-of-the-Art Conversational AI with Transfer Learning by HuggingFace
Tutorial on Transfer Learning in NLP held at NAACL 2019 by HuggingFace

Thomas Wolf from HuggingFace also left some awesome feedback for the library in one of his blog articles (Thanks, Thomas!):

“Using the awesome PyTorch ignite framework and the new API for Automatic Mixed Precision (FP16/32) provided by NVIDIA’s apex, we were able to distill our +3k lines of competition code in less than 250 lines of training code with distributed and FP16 options!”

Deep-Reinforcement-Learning-Hands-On-Second-Edition by Max Lapan This is a book on Deep Reinforcement Learning by Max Lapan wherein the second edition examples are made with Ignite.
Project MONAI: AI Toolkit for Healthcare Imaging. This project primarily focused on the healthcare research to develop DL models for medical imaging uses Ignite for end-to-end training. For other use-cases, please take a look at Ignite’s github page and its “Used by”.

When to use Ignite

Remove boilerplate and standardize your code using highly customizable modules of Ignite’s API.
When you require factorized code but don’t want to sacrifice on flexibility to support your complicated training strategies
Use the rich array of utilities like metrics, handlers, and loggers available to evaluate/debug your model with ease

When not to use Ignite

When there is a super custom PyTorch code where Ignite’s API is overhead.
When completely satisfied by pure PyTorch API or another high-level library

Thank you for reading! Pytorch-Ignite presented to you with love by the PyTorch community!

PyTorch Lightning

Philosophy

PyTorch Lightning is a very lightweight wrapper on PyTorch which is more like a coding standard than a framework. The format allows you to get rid of a ton of boilerplate code while keeping it easy to follow
.
The use of hooks, standard across every part of the training, means you can override any part of the internal functionality down to how the backward pass is done - it is extremely flexible.

The result is a framework that gives researchers, students, and production teams the ultimate flexibility to try crazy ideas without having to learn yet another framework while automating away all the engineering details.

Lightning has two additional, more ambitious motivations: reproducibility of research and democratization of best practices in the deep learning community.

Notable features

Train on CPU, GPU or TPUs without changing your code!
Only library to support TPU training (Trainer(num_tpu_cores=8))
Trivial multi-node training
Trivial multi-GPU training
Trivial 16 bit precision support
Built-in performance profiler (Trainer(profile=True))
Tons of integrations with libraries like tensorboard, comet.ml, neptune.ai, etc… (Trainer(logger=NeptuneLogger(...)))

Team

Lightning has 90+ contributors and a core team of 8 contributors who make sure the project moves forward lightning fast.

Documentation
Lightning documentation is extremely thorough yet simple and easy to use.

API

At the core, Lightning has an API that centers around two objects, the Trainer and the LightningModule.

The Trainer abstracts away all the engineering details and the LightningModule captures all the science/research code. This decoupling makes the research code more readable and allows it to run on arbitrary hardware.

LightningModule

All the research logic goes into LightningModule.

For example, in a cancer detection system, this part would handle the main things like the object detection model, data loaders for medical images etc.

It groups the core ingredients you need to build a deep learning system:

The computations (init, forward).
What happens in the training loop (training_step).
What happens in the validation loop (validation_step).
What happens in the testing loop (test_step).
The optimizer(s) to use (configure_optimizers).
The data to use (train, test, val dataloaders).

Let’s take a look at the example from the docs and unpack what is happening there.

import pytorch_lightning as pl


class MNISTExample(pl.LightningModule):

    def __init__(self):
        super(CoolSystem, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss']
                                for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

    def test_step(self, batch, batch_idx):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}

    def test_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['test_loss']
                                for x in outputs]).mean()
        tensorboard_logs = {'test_loss': avg_loss}
        return {'avg_test_loss': avg_loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers
        # and learning_rate schedulers
        # (LBFGS it is automatically supported,
        # no need for closure function)
        return torch.optim.Adam(self.parameters(), lr=0.02)

    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(
            MNIST(os.getcwd(), train=True, download=True,
                  transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(
            MNIST(os.getcwd(), train=True, download=True,
                  transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def test_dataloader(self):
        # OPTIONAL
        return DataLoader(
            MNIST(os.getcwd(), train=False, download=True,
                  transform=transforms.ToTensor()), batch_size=32)

As you can see, the LightningModule builds on top of pure PyTorch code and simply organizes them in nine methods:

init(): Defines our model or multiple models, and initializes the weights
forward(): You can think of it as your standard PyTorch forward method but with additional flexibility to define what you want to happen at the prediction/inference level.
training_step(): Defines what happens in the training loop. It combines a forward pass, loss calculation, and any other logic you want to execute during training.
validation_step(): Defines what happens in the validation loop. For example, you can go calculate loss or accuracy for each batch and store them in the logs.
validation_end(): Everything that you want to happen after the validation loop ends. For example, you may want to calculate the average loss or accuracy over validation batches
test_step(): What you want to happen to each batch at inference time. You can put your Test Time Augmentation logic or other things here.
test_end(): Similarly to validation_end, you can use it to aggregate the batch results calculated during test_step
configure_optimizers(): initialize an optimizer or multiple optimizers
train/val/test_dataloader(): returns your PyTorch DataLoaders for train, validation, and test sets. Since every PytorchLightning system needs to implement those methods it is really easy to see exactly what is happening in the research.

For example, to understand what a paper is doing, all you have to do is look at the training_step of the LightningModule!

This readability and a close mapping between the core research concepts and implementation lies at the core of Lightning.

Trainer

This is where the engineering part of deep learning happens.

In the cancer detection system, this might mean how many GPUs you use, when you save checkpoints when you stop training, etc… These are details that make up a lot of the “secret sauce” of research which are standard best practices across deep learning projects (ie: not hugely relevant to cancer detection).

Notice that the LightningModule has nothing about GPUs or 16-bit precision or early stopping or logging or anything like that. All of that is automatically handled by the trainer.

from pytorch_lightning import Trainer

model = MNISTExample()

# most basic trainer, uses good defaults
trainer = Trainer()    
trainer.fit(model)

That’s all it takes to train this model! The trainer handles everything for you including:

Early stopping
Automatic logging to Tensorboard (or comet, mlflow, neptune, etc…)
Auto checkpointing
And more (we’ll talk about that in the next sections)

All of this is free out of the box!

The learning curve

Since LightningModule is simply reorganizing pure Pytorch objects and everything is “out in the open” it is trivial to refactor your PyTorch code to the Lightning format.

For more information about making the switch from pure PyTorch to Lightning read this article.

Build-in features (what you get out of the box)

Lightning gives a ton of advanced features out-of-the-box.
For instance, it takes a one-liner to use things like:

Multi-gpu training

Trainer(gpus=8)

TPU training

Trainer(num_tpu_cores=8)

Multi-node training

Trainer(gpus=8, num_nodes=8, distributed_backend=’ddp’)

Gradient Clipping

Trainer(gradient_clip_val=2.0)

Accumulated Gradients

Trainer(accumulate_grad_batches=12)

16-bit precision

Trainer(use_amp=True)

Truncated back-propagation through time

Trainer(truncated_bptt_steps=3)

and a lot more.

If you would like to see the full list of free-magic features go here.

Extension capabilities / Simplicity of integration in research

Having a bunch of in-built functionalities is great but for researchers, it’s crucial to not have to learn yet another library, and directly control key parts of research such as data-processing without having other abstractions operate on those.

This flexible format allows for the most freedom in training and validating. This interface should be thought of as a system, not as a model. The system might have multiple models (GANs, seq-2-seq, etc…) or just one model, such as this simple MNIST example.

Thus researchers are free to try as many crazy things as they want, and ONLY have to worry about the LightningModule.

But maybe you need even MORE flexibility. In this case, you can do things like:

Change how the backward step is done.
Change how 16-bit is initialized.
Add your own way of doing distributed training.
Add Learning rate schedulers.
Use multiple optimizers.
Change the frequency of optimizer updates.
And many many more things.

Under the hood, everything in Lightning is implemented as hooks that can be overridden by the user. This makes EVERY single aspect of training highly configurable — which is exactly the flexibility a research or production team needs.

But wait you say… this is too simple for your use case? No worries, Lightning was designed while doing research at NYU and Facebook AI Research for my PhD to be as flexible as possible for researchers.

Here are some examples:

Need your own backward pass? Override this hook:

def backward(self, use_amp, loss, optimizer):
    if use_amp:
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
    else:
        loss.backward()

Need your own amp init? Override this hook:

def configure_apex(self, amp, model, optimizers, amp_level):
    model, optimizers = amp.initialize(
        model, optimizers, opt_level=amp_level,
    )

    return model, optimizers

Want to go as deep as adding your own DDP implementation? Override these two hooks:

def configure_ddp(self, model, device_ids):
    # Lightning DDP simply routes to test_step, val_step, etc...
    model = LightningDistributedDataParallel(
        model,
        device_ids=device_ids,
        find_unused_parameters=True
    )
    return model

def init_ddp_connection(self):
    # use slurm job id for the port number
    # guarantees unique ports across jobs from same grid search
    try:
        # use the last 4 numbers in the job id as the id
        default_port = os.environ['SLURM_JOB_ID']
        default_port = default_port[-4:]

        # all ports should be in the 10k+ range
        default_port = int(default_port) + 15000

    except Exception as e:
        default_port = 12910

    # if user gave a port number, use that one instead
    try:
        default_port = os.environ['MASTER_PORT']
    except Exception:
        os.environ['MASTER_PORT'] = str(default_port)

    # figure out the root node addr
    try:
        root_node = os.environ['SLURM_NODELIST'].split(' ')[0]
    except Exception:
        root_node = '127.0.0.2'

    root_node = self.trainer.resolve_root_node_address(root_node)
    os.environ['MASTER_ADDR'] = root_node
    dist.init_process_group(
        'nccl',
        rank=self.proc_rank,
        world_size=self.world_size
    )

PASTE CODE HERE

There are 10s of hooks like these and we add more as researchers request them.

The bottom line is that Lightning is trivial to use for a new user and infinitely extensible if you’re a researcher or production team working with the bleeding-edge AI research.

Readability and moving towards Reproducibility

As I mentioned, Lightning was created with a second more ambitious broad motivation: Reproducibility. While true reproducibility requires standard code, standard seeds, standard hardware, etc… Lightning contributes to reproducible research in two ways:

to **standardize the format of the ML code*,

decouple the engineering from the science so that the approach can be tested in different systems.

The result is an expressive, powerful API for doing research.

If every research project and paper was implemented using the LightningModule template, it would be very easy to find out what’s going on (but perhaps not easy to understand haha)

Distributed training

Lightning makes multi-GPU or even multi-GPU multi-node training trivial.

For instance, if you want to train the above example on multiple GPUs just add the following flags to the trainer:

trainer = Trainer(gpus=4, distributed_backend='dp')    
trainer.fit(model)

Using the above flags will run this model on 4 GPUs.
If you want to run on say 16 GPUs, where you have 4 machines each with 4 GPUs, change the trainer flags to this:

trainer = Trainer(gpus=4, nb_gpu_nodes=4, distributed_backend='ddp')    
trainer.fit(model)

And submit the following SLURM job:

#!/bin/bash -l

# SLURM SUBMIT SCRIPT
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=0
#SBATCH --time=0-02:00:00

# activate conda env
source activate $1

# -------------------------
# debugging flags (optional)
 export NCCL_DEBUG=INFO
 export PYTHONFAULTHANDLER=1

# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo

# might need the latest cuda
# module load NCCL/2.4.7-1-cuda.10.0
# -------------------------

# run script from above
srun python3 mnist_example.py

This is crazy simple considering how much happens under the hood.

For more information about distributed training with Pytorch lightning read this article about “How To Train A GAN On 128 GPUs Using PyTorch”.

Productionalization

Lightning models can be easily deployed because they’re still simple PyTorch models under the hood. This means we can leverage all the engineering advancements from the PyTorch community on supporting deployment.

Popularity

Pytorch Lightning has over 3800 stars on Github and has recently hit 110k downloads.
More importantly, the community is growing rapidly with over 90 contributors, many from the top AI labs in the world adding new features daily.
You can talk to us on Github or Slack.

When to use PyTorch Lightning

Lightning is made for professional researchers and production teams working on cutting edge research. It’s great when you know what you need to do. This focus means it adds advanced features for people looking to test/build things very quickly without getting bogged down in the details.

When not to use PyTorch Lightning

Although lightning is made for professional researchers and data scientists, new-comers can still benefit. For new-comers, we recommend they build a simple MNIST system from scratch using pure PyTorch. This will show them how to set up a training loop, etc. Once they understand how that works and how the forward/backward pass work, they can move into lightning.

Torchbearer

Our part of the blog will be a little different from the others because torchbearer is coming to an end (sort of). In particular, we are joining the PyTorch-Lightning team. The move came about from a meeting with William Falcon at NeurIPS 2019, and was recently announced on the PyTorch blog.

So, instead of trying to sell you torchbearer, we thought we should write about what we did well, what we did wrong, and why we are moving to Lightning.

What we did well

The lib got pretty popular and got to 500+ stars on GitHub which was far more than we had ever imagined.
We became a part of the PyTorch ecosystem. It was an important experience for us that allowed us to feel like a valued part of a wider community. We’ve built a comprehensive set of built-in callbacks and metrics. This was one of our key successes; a lot of powerful outcomes can be achieved in a single line of code with torchbearer. An important feature of torchbearer that **enables extreme flexibility* is the state object. This is a mutable dictionary that houses all of the variables that are in use by the core training loop. By editing these variables in callbacks at different points in the loop, most highly complex outcomes can be achieved.
It was always important to us that torchbearer had good documentation. We focused on example-led docs that can be executed in your browser with Google Colab. The example library has been a success, giving quick information on the more powerful use cases of torchbearer.
A final thing to note is that torchbearer has been used by both of us over the past two years for our PhD research. We count this as a success because we have almost never had to change the torchbearer API in order to prototype our ideas, even the ridiculous ones!

What we did wrong

The state object, which makes this library so flexible, is also problematic. The ability to access any part of the library from any other leads itself towards abuse in the same way that global variables do. In particular, determining how and when a particular variable in the state object was changed is challenging once more than one object is acting on it. Additionally, for state to be effective you need to know what each variable is and in which callbacks you can access it, so the learning curve is steep.
By its nature, torchbearer does not lend itself to distributed training, or even to some extent low precision training. Since every part of state is available at all times, how do you chunk this and distribute it across devices? PyTorch can deal with this in some way, in that torchbearer can be used when distributed, but it is unclear exactly what is happening to state at these times.
Changing the core training loop was non-trivial. Torchbearer offers a way to completely write your own core loop, but you then have to manually write in callback points to ensure all the built-in Torchbearer functionality. Coupling this with a lower standard of documentation compared to other aspects of the library, custom loops were overly complicated and likely completely unknown to most users.
Managing an open-source project while working on our PhDs ended up being more difficult than expected. As a result, some parts of the library were thoroughly tested and stable (since they were important for our PhD work), while others were under-developed and buggy.
During our initial growth, we decided to dramatically change the core API. This significantly improved Torchbearer, but also meant a lot of effort moving from one version to the next. It felt justified as we were still pre 1.0.0 stable release but it certainly contributed to some users choosing other libraries.

Why we are joining Pytorch Lightning?

The first key reason for our willingness to move to Lightning is its popularity. With Lightning we become part of the fastest-growing PyTorch training library, that has already eclipsed many of its competitors.\
The second key reason for our move, and a key part of the success of Lightning, is that it was built from the ground up to support distributed training and low precision, both challenging to implement in torchbearer. These practical considerations made in the early stages of Lightning’s development are invaluable to the modern deep learning practitioner and would be challenging to retro-fit in torchbearer.
In addition, at Lightning we will be part of a larger team of core developers. This will enable us to ensure greater stability and to support a broader range of use cases than is possible with just two developers as we have now.

Ultimately, we have always believed that the best way to move things forward would be to join efforts with another library. This is our chance to do that and help Lightning become the best training library for PyTorch.

(Subjective) Comparison and Final Thoughts

At this point, I want to give a…

huge THANK YOU to all the authors!

Wow, this is a lot of first-hand info and I hope it will make it easier to choose the library that works for you.

As I was working on this article with them and looking closer at what their libraries have to offer (and creating some Pull Requests), I gained my own personal perspective that I want to share with you here.

Skorch

If you want the sklearn-like API then Skorch is your lib. It is well tested and documented. It actually gives more flexibility then what I had anticipated before working on this article which was a nice surprise. That said the focus of this lib is not cutting edge research but rather production applications. I feel that it really delivers on their promise and does exactly what it was built to do. I really respect tools/libs like that.

Fastai

Fastai for a long time has been a great choice for people getting into deep learning. It can get you state-of-the-art results in 10 lines of almost magical code. But there is another side to the library, perhaps lesser-known, that lets you access lower-level APIs and create custom building blocks that give researchers and practitioners flexibility to implement very complex systems. Maybe it was the uber-popular fastai deep learning course that created a false image of this library in my mind but I will definitely take it for a spin in the future, especially with the recent v2 pre-release.

Pytorch Ignite

Ignite is an interesting animal. With its, a bit exotic (for my personal taste), engine, event and handler API you can do pretty much whatever you want. It has a ton of features out-of-the-box and I definitely understand why many researchers use it in their daily work. It took me a moment to get familiar with the framework but you just need to stop thinking in “callback terms” and you’ll be fine. That said, the API doesn’t speak to me as clearly as some other libs. You should check it out though, as it may be a great choice for you.

Catalyst

Before looking into Catalyst I thought it was a heavy(ish) framework for creating deep learning pipelines. Now my view is completely different. It decouples engineering stuff from research in a beautiful way. Pure PyTorch objects go into a trainer that deals with the training. It is very flexible and has a separate module that deals with Reinforcement Learning. It also gives you a lot of features out-of-the-box when it comes to reproducibility, and serving models in production. And those multistage pipelines I told you about? You can easily create them with minimal overhead. Overall I think it is a great project and a lot of people out there could benefit from using it.

Pytorch Lightning

Lightning also wants to separate science from engineering and I think it does a great job at that. There are just a ton of in-built features that make it even more appealing.
But something that makes this library a bit different is that it enables reproducibility by making deep learning research implementations readable. It is really easy to follow the logic inside of the LightningModule where the training step (among other things) is not abstracted away. I think communicating research projects in this way can be extremely effective. It is getting very popular very quickly and with authors of Torchbearer joining the core developer team I think that this project has a bright future in front of it, Lightning bright even 🙂

So which one should you choose?
As always it depends but I think you now have enough information to make a good decision!

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Keras Metrics: Everything You Need To Know

Jakub Czakon — Mon, 13 Apr 2020 07:58:50 +0000

This article was originally posted by Derrick Mwiti on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Keras metrics are functions that are used to evaluate the performance of your deep learning model. Choosing a good metric for your problem is usually a difficult task.

you need to understand which metrics are already available in Keras and tf.keras and how to use them,
in many situations you need to define your own custom metric because the metric you are looking for doesn’t ship with Keras.
sometimes you want to monitor model performance by looking at charts like ROC curve or Confusion Matrix after every epoch. Lucky for you, this article explains all that!

Keras metrics 101

In Keras, metrics are passed during the compile stage as shown below. You can pass several metrics by comma separating them.

from keras import metrics

model.compile(loss='mean_squared_error', optimizer='sgd',
              metrics=[metrics.mae,
                       metrics.categorical_accuracy])

How you should choose those evaluation metrics?

Some of them are available in Keras, others in tf.keras. Sometimes you need to implement your own custom metrics.

Let’s go over all of those situations.

Which metrics are available in Keras?

Keras provides a rich pool of inbuilt metrics. Depending on your problem, you’ll use different ones.

Let’s look at some of the problems you may be working on.

Binary classification

Binary classification metrics are used on computations that involve just two classes. A good example is building a deep learning model to predict cats and dogs. We have two classes to predict and the threshold determines the point of separation between them.binary_accuracy and accuracy are two such functions in Keras.

binary_accuracy, for example, computes the mean accuracy rate across all predictions for binary classification problems.

keras.metrics.binary_accuracy(y_true, y_pred, threshold=0.5)

The accuracy metric computes the accuracy rate across all predictions. y_true represents the true labels while y_pred represents the predicted ones.

keras.metrics.accuracy(y_true, y_pred)

The confusion_matrix displays a table showing the true positives, true negatives, false positives, and false negatives.

keras.metrics.confusion_matrix(y_test, y_pred)

In the above confusion matrix, the model made 3305 + 375 correct predictions and 106 + 714 wrong predictions.

You can also visualize it as a matplotlib chart which we will cover later.

Multiclass classification

These metrics are used for classification problems involving more than two classes. Extending our animal classification example you can have three animals, cats, dogs, and bears. Since we are classifying more than two animals, this is a multiclass classification problem.

The shape of y_true is the number of entries by 1 that is (n,1) but the shape of y_pred is the number of entries by number of classes(n,c)

categorical_accuracy metric computes the mean accuracy rate across all predictions.

keras.metrics.categorical_accuracy(y_true, y_pred)

sparse_categorical_accuracy is similar to the categorical_accuracy but mostly used when making predictions for sparse targets. A great example of this is working with text in deep learning problems such as word2vec. In this case, one works with thousands of classes with the aim of predicting the next word. This task produces a situation where the y_true is a huge matrix that is almost all zeros, a perfect spot to use a sparse matrix.

keras.metrics.sparse_categorical_accuracy(y_true, y_pred)

top_k_categorical_accuracy computes the top-k-categorical accuracy rate. We take top k predicted classes from our model and see if the correct class was selected as top k. If it was we say that our model was correct.

keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=5)

Regression

The metrics used in regression problems include Mean Squared Error, Mean Absolute Error, and Mean Absolute Percentage Error. These metrics are used when predicting numerical values such as sales and prices of houses. Check out this resource for a complete guide on regression metrics.

from keras import metrics

model.compile(loss='mse', optimizer='adam', 
              metrics=[metrics.mean_squared_error, 
                       metrics.mean_absolute_error, 
                       metrics.mean_absolute_percentage_error])
                       metrics.categorical_accuracy])

How to create custom metric in Keras?

As we had mentioned earlier, Keras also allows you to define your own custom metrics.

The function you define has to take y_true and y_pred as arguments and must return a single tensor value. These objects are of type Tensor with float32 data type.The shape of the object is the number of rows by 1. For example, if you have 4,500 entries the shape will be (4500, 1).

You can use the function by passing it at the compilation stage of your deep learning model.

model.compile(...metrics=[your_custom_metric])

How to calculate F1 score in Keras (precision, and recall as a bonus)?

Let’s see how you can compute the f1 score, precision and recall in Keras. We will create it for the multiclass scenario but you can also use it for binary classification.

The f1 score is the weighted average of precision and recall. So to calculate f1 we need to create functions that calculate precision and recall first. Note that in multiclass scenario you need to look at all classes not just the positive class (which is the case for binary classification)

def recall(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

    recall = true_positives / (all_positives + K.epsilon())
    return recall

def precision(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))

    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_score(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

The next step is to use these functions at the compilation stage of our deep learning model. We are also adding the Keras accuracy metric that is available by default.

model.compile(...,metrics=['accuracy', f1_score, precision, recall])

Let’s now fit the model to the training and test set.

model.fit(x_train, y_train, epochs=5)

Now you can evaluate your model and access the metrics you have just created.

(loss, 
accuracy, 
f1_score, precision, recall) = model.evaluate(x_test, y_test, verbose=1)

Great, you now know how to create custom metrics in keras.

That said, sometimes you can use something that is already there, just in a different library like tf.keras 🙂

Which metrics are available in tf.keras?

Recently Keras has become a standard API in TensorFlow and there are a lot of useful metrics that you can use.

Let’s look at some of them.
Unlike in Keras where you just call the metrics using keras.metrics functions, in tf.keras you have to instantiate a Metric class.

For example:

tf.keras.metrics.Accuracy()

There is quite a bit of overlap between keras metrics and tf.keras. However, there are some metrics that you can only find in tf.keras.

Let’s take a look at those.

tf.keras Classification Metrics

tf.keras.metrics.AUC computes the approximate AUC (Area under the curve) for ROC curve via the Riemann sum.

model.compile('sgd', loss='mse', metrics=[tf.keras.metrics.AUC()])

You can use precision and recall that we have implemented before, out of the box in tf.keras.

model.compile('sgd', loss='mse', 
               metrics=[tf.keras.metrics.Precision(), 
                        tf.keras.metrics.Recall()])

tf.keras Segmentation Metrics

tf.keras.metrics.MeanIoU – Mean Intersection-Over-Union is a metric used for the evaluation of semantic image segmentation models. We first calculate the IOU for each class:

model.compile(... metrics=[tf.keras.metrics.MeanIoU(num_classes=2)])

tf.keras Regression Metrics

Just like Keras, tf.keras has similar regression metrics. We won’t dwell on them much but there is an interesting metric to highlight called MeanRelativeError.

MeanRelativeError takes the absolute error for an observation and divides it by constant. This constant, normalizer, can be the same for all observations or different for each sample.

Therefore, the mean relative error is the average of the relative errors.

tf.keras.metrics.MeanRelativeError(normalizer=[1, 3, 2, 3])

How to create a custom metric in tf.keras?

In tf.keras you can create a custom metric by extending the keras.metrics.Metric class.
To do so you have to override the update_state, result, and reset_state functions:

update_state() does all the updates to state variables and calculates the metric,
result() returns the value for the metric from state variables,
reset_state() sets the metric value at the beginning of each epoch to a predefined constant (typically 0)

class MulticlassTruePositives(tf.keras.metrics.Metric):
    def __init__(self, name='multiclass_true_positives', **kwargs):
        super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
        self.true_positives = self.add_weight(name='tp', initializer='zeros')

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
        values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
        values = tf.cast(values, 'float32')
        if sample_weight is not None:
            sample_weight = tf.cast(sample_weight, 'float32')
            values = tf.multiply(values, sample_weight)
        self.true_positives.assign_add(tf.reduce_sum(values))

    def result(self):
        return self.true_positives

    def reset_states(self):
        # The state of the metric will be reset at the start of each epoch.
        self.true_positives.assign(0.)

Then we simply pass it at compile stage:

model.compile(...,metrics=[MulticlassTruePositives()])

Performance charts: ROC curve and Confusion Matrix in Keras

Sometimes the performance cannot be represented as one number but rather as a performance chart. Examples of such charts are ROC curve or confusion matrix. In those cases, you may want to log those charts somewhere for further inspection.

To do it you need to create a callback that will track the performance of your model on every epoch end. Then, you can take a look at the improvement in a folder or an experiment tracking tool.
So let’s do that.

First, we need a callback that creates ROC curve and confusion matrix at the end of each epoch.

import os

from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc


class PerformanceVisualizationCallback(Callback):
    def __init__(self, model, validation_data, image_dir):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

        os.makedirs(image_dir, exist_ok=True)
        self.image_dir = image_dir

    def on_epoch_end(self, epoch, logs={}):
        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]             
        y_pred_class = np.argmax(y_pred, axis=1)

        # plot and save confusion matrix
        fig, ax = plt.subplots(figsize=(16,12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))

       # plot and save roc curve
        fig, ax = plt.subplots(figsize=(16,12))
        plot_roc(y_true, y_pred, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))

Now we simply pass it to the model.fit() callbacks argument.

performance_cbk = PerformanceVisualizationCallback(
                      model=model,
                      validation_data=validation_data,
                      image_dir='performance_vizualizations')

history = model.fit(x=x_train,
                    y=y_train,
                    epochs=5,
                    validation_data=validation_data,
                    callbacks=[performance_cbk])

You can have multiple callbacks if you want to.

Now you will be able to look at those visualizations as your model trains:

Note:

If you want to log everything to the experiment tracking tool like Neptune your callback would look a bit different:

from keras.callbacks import Callback
import neptune
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
import matplotlib.pyplot as plt

neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')

class NeptuneLoggerCallback(Callback):
    def __init__(self, model, validation_data):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

    def on_batch_end(self, batch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'batch_{log_name}', log_value)

    def on_epoch_end(self, epoch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'epoch_{log_name}', log_value)

        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]

        y_pred_class = np.argmax(y_pred, axis=1)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        neptune.log_image('confusion_matrix', fig)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_roc(y_true, y_pred, ax=ax)
        neptune.log_image('roc_curve', fig)

Notice that you don’t need to create folders for images as the charts will be sent to your tool directly. On the flip side you have to create an experiment to start tracking your runs.
Once you have that it is business as usual.

neptune_logger=NeptuneLoggerCallback(model=model,
                                     validation_data=validation_data)

history = model.fit(x=x_train,
                    y=y_train,
                    epochs=5,
                    validation_data=validation_data,
                    callbacks=[neptune_logger])

You can explore metrics and performance charts in the app.

How to plot Keras history object?

Whenever fit() is called, it returns a History object that can be used to visualize the training history. It contains a dictionary with loss and metric values at each epoch calculated both for training and validation datasets.

For example, lets extract the ‘accuracy’ metric and use matplotlib to plot it.

import matplotlib.pyplot as plt

history = model.fit(x_train, y_train, 
                    validation_split=0.25, 
                    epochs=50, batch_size=16, verbose=1)

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_‘accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

Keras Metrics Example

Ok, so you’ve gone a long way and learned a bunch. To refresh your memory let’s put it all together in an single example.
We’ll start by taking the mnist dataset and created a simple CNN model:

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
validation_data = x_test, y_test

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

We’ll create a custom metric, multiclass f1 score in keras:

def recall(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

    recall = true_positives / (all_positives + K.epsilon())
    return recall

def precision(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))

    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_score(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

We’ll create a custom tf.keras metric: MulticlassTruePositives to be exact:

class MulticlassTruePositives(tf.keras.metrics.Metric):
    def __init__(self, name='multiclass_true_positives', **kwargs):
        super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
        self.true_positives = self.add_weight(name='tp', initializer='zeros')

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
        values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
        values = tf.cast(values, 'float32')
        if sample_weight is not None:
            sample_weight = tf.cast(sample_weight, 'float32')
            values = tf.multiply(values, sample_weight)
        self.true_positives.assign_add(tf.reduce_sum(values))

    def result(self):
        return self.true_positives

    def reset_states(self):
        # The state of the metric will be reset at the start of each epoch.
        self.true_positives.assign(0.)

We’ll compile the keras model with our metrics:

import keras

model.compile(optimizer='sgd',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy',
                       keras.metrics.categorical_accuracy,
                       f1_score, 
                       recall_score, 
                       precision_score,
                       tf.keras.metrics.TopKCategoricalAccuracy(k=5),
                       MulticlassTruePositives()])

We’ll implement keras callback that plots ROC curve and Confusion Matrix to a folder:

import os

from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc

class PerformanceVisualizationCallback(Callback):
    def __init__(self, model, validation_data, image_dir):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

        os.makedirs(image_dir, exist_ok=True)
        self.image_dir = image_dir

    def on_epoch_end(self, epoch, logs={}):
        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]             
        y_pred_class = np.argmax(y_pred, axis=1)

        # plot and save confusion matrix
        fig, ax = plt.subplots(figsize=(16,12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))

       # plot and save roc curve
        fig, ax = plt.subplots(figsize=(16,12))
        plot_roc(y_true, y_pred, ax=ax)
        fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))

performance_viz_cbk = PerformanceVisualizationCallback(
                                       model=model,
                                       validation_data=validation_data,
                                       image_dir='perorfmance_charts')

We’ll run training and monitor the performance:

history = model.fit(x=x_train,
                    y=y_train,
                    epochs=5,
                    validation_data=validation_data,
                    callbacks=[performance_viz_cbk])

We’ll visualize metrics from keras history object:

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

We will monitor and explore your experiments in a tool like TensorBoard or Neptune. You just need to add another callback or modify the one you have created before:

Tensorboard

from  tf.keras.callbacks import TensorBoard

tensorboard_cbk = TensorBoard(log_dir="logs/training-example/")

history = model.fit(..., callbacks=[performance_viz_cbk, 
                                    tensorboard_cbk])

With TensorBoard you need to start a local server and explore your runs in the browser.

tensorboard --logdir logs/training-example/

Neptune

neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')

class NeptuneLoggerCallback(Callback):
    def __init__(self, model, validation_data):
        super().__init__()
        self.model = model
        self.validation_data = validation_data

    def on_batch_end(self, batch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'batch_{log_name}', log_value)

    def on_epoch_end(self, epoch, logs={}):
        for log_name, log_value in logs.items():
            neptune.log_metric(f'epoch_{log_name}', log_value)

        y_pred = np.asarray(self.model.predict(self.validation_data[0]))
        y_true = self.validation_data[1]

        y_pred_class = np.argmax(y_pred, axis=1)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_confusion_matrix(y_true, y_pred_class, ax=ax)
        neptune.log_image('confusion_matrix', fig)

        fig, ax = plt.subplots(figsize=(16, 12))
        plot_roc(y_true, y_pred, ax=ax)
        neptune.log_image('roc_curve', fig)

neptune_logger = NeptuneLoggerCallback(model=model,
                                       validation_data=validation_data)

history = model.fit(..., callbacks=[neptune_logger])

Check this example experiment run if you are interested:

Final Thoughts

Hopefully, this article gave you some background into model evaluation techniques in keras.

We’ve covered:

built-in methods in keras and tf.keras, *implementation of your own custom metrics, *how you can visualize custom performance charts as your model is training.

For more information check out the Keras Repository and TensorFlow Metrics documentation.

Happy training!

How to Do Hyperparameter Tuning on Any Python Script in 3 Easy Steps

Jakub Czakon — Wed, 25 Mar 2020 10:12:48 +0000

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

You wrote a Python script that trains and evaluates your machine learning model. Now, you would like to automatically tune hyperparameters to improve its performance?

I got you!

In this article, I will show you how to convert your script into an objective function that can be optimized with any hyperparameter optimization library.

It will take just 3 steps and you will be tuning model parameters like there is no tomorrow.

Ready?

Let's go!

I suppose your main.py script looks something like this one:

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split

data = pd.read_csv('data/train.csv', nrows=10000)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
(X_train, X_valid, 
y_train, y_valid )= train_test_split(X, y, test_size=0.2, random_state=1234)

train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

params = {'objective': 'binary',
          'metric': 'auc',
          'learning_rate': 0.4,
          'max_depth': 15,
          'num_leaves': 20,
          'feature_fraction': 0.8,
          'subsample': 0.2}

model = lgb.train(params, train_data,
                  num_boost_round=300,
                  early_stopping_rounds=30,
                  valid_sets=[valid_data],
                  valid_names=['valid'])

score = model.best_score['valid']['auc']
print('validation AUC:', score)

Step 1: Decouple search parameters from code

Take the parameters that you want to tune and put them in a dictionary at the top of your script. By doing that you effectively decouple search parameters from the rest of the code.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split

SEARCH_PARAMS = {'learning_rate': 0.4,
                 'max_depth': 15,
                 'num_leaves': 20,
                 'feature_fraction': 0.8,
                 'subsample': 0.2}

data = pd.read_csv('../data/train.csv', nrows=10000)

X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)

train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

params = {'objective': 'binary',
          'metric': 'auc',
          **SEARCH_PARAMS}

model = lgb.train(params, train_data,
                  num_boost_round=300,
                  early_stopping_rounds=30,
                  valid_sets=[valid_data],
                  valid_names=['valid'])

score = model.best_score['valid']['auc']
print('validation AUC:', score)

Step 2: Wrap training and evaluation into a function

Now, you can put the entire training and evaluation logic inside of a train_evaluate function. This function takes parameters as input and outputs the validation score.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split

SEARCH_PARAMS = {'learning_rate': 0.4,
                 'max_depth': 15,
                 'num_leaves': 20,
                 'feature_fraction': 0.8,
                 'subsample': 0.2}

def train_evaluate(search_params):
    data = pd.read_csv('../data/train.csv', nrows=10000)
    X = data.drop(['ID_code', 'target'], axis=1)
    y = data['target']
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)
    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)
    params = {'objective': 'binary',
              'metric': 'auc',
              **search_params}
    model = lgb.train(params, train_data,
                      num_boost_round=300,
                      early_stopping_rounds=30,
                      valid_sets=[valid_data],
                      valid_names=['valid'])
    score = model.best_score['valid']['auc']
    return score

if __name__ == '__main__':
    score = train_evaluate(SEARCH_PARAMS)
    print('validation AUC:', score)

Step 3: Run Hypeparameter Tuning script

We are almost there.

All you need to do now is to use this train_evaluate function as an objective for the black-box optimization library of your choice.

I will use Scikit Optimize which I have described in great detail in another article but you can use any hyperparameter optimization library out there.

In a nutshell I:

define the search SPACE,
create the objective function that will be minimized,
run the optimization via skopt.forest_minimize function.

In this example, I will try 100 different configurations starting with 10 randomly chosen parameter sets.

import skopt
from script_step2 import train_evaluate

SPACE = [
    skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
    skopt.space.Integer(1, 30, name='max_depth'),
    skopt.space.Integer(2, 100, name='num_leaves'),
    skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
    skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]

@skopt.utils.use_named_args(SPACE)
def objective(**params):
    return -1.0 * train_evaluate(params)

results = skopt.forest_minimize(objective, SPACE, n_calls=30, n_random_starts=10)

best_auc = -1.0 * results.fun
best_params = results.x

print('best result: ', best_auc)
print('best parameters: ', best_params)

This is it.

The results object contains information about the best score and parameters that produced it.

Note:

If you want to visualize your training and save diagnostic charts after it finishes you can add one callback and one function call to log every hyperparameter search to Neptune.

Just use this optuna monitoring helper function.

import neptune
import neptunecontrib.monitoring.skopt as sk_utils
import skopt
from script_step2 import train_evaluate

neptune.init('jakub-czakon/blog-hpo')
neptune.create_experiment('hpo-on-any-script', upload_source_files=['*.py'])

SPACE = [
    skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
    skopt.space.Integer(1, 30, name='max_depth'),
    skopt.space.Integer(2, 100, name='num_leaves'),
    skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
    skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]

@skopt.utils.use_named_args(SPACE)
def objective(**params):
    return -1.0 * train_evaluate(params)

monitor = sk_utils.NeptuneMonitor()
results = skopt.forest_minimize(objective, SPACE, n_calls=100, n_random_starts=10, callback=[monitor])

sk_utils.log_results(results)
neptune.stop()

Now, when you run your parameter sweep you will see the following:

Check out the skopt hyperparameter sweep experiment with all the code, charts and results.

Final thoughts

In this article, you've learned how to optimize hyperparameters of pretty much any Python script in just 3 steps.

Hopefully, with this knowledge, you will build better machine learning models with less effort.

Happy training!

Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

Jakub Czakon — Mon, 23 Mar 2020 06:40:31 +0000

This article was originally posted by Shahul ES on the Neptune blog.

Exploratory data analysis is one of the most important parts of any machine learning workflow and Natural Language Processing is no different. But which tools you should choose to explore and visualize text data efficiently?

In this article, we will discuss and implement nearly all the major techniques that you can use to understand your text data and give you a complete(ish) tour into Python tools that get the job done.

Before we start: Dataset and Dependencies

In this article, we will use a million news headlines dataset from Kaggle.

If you want to follow the analysis step-by-step you may want to install the following libraries:

pip install \
   pandas matplotlib numpy \
   nltk seaborn sklearn gensim pyldavis \
   wordcloud textblob spacy textstat

Now, we can take a look at the data.

news= pd.read_csv('data/abcnews-date-text.csv',nrows=10000)
news.head(3)

The dataset contains only two columns, the published date, and the news heading.

For simplicity, I will be exploring the first 10000 rows from this dataset. Since the headlines are sorted by publish_date it is actually 2 months from February/19/2003 until April/07/2003.

Ok, I think we are ready to start our data exploration!

Analyzing text statistics

Text statistics visualizations are simple but very insightful techniques.

They include:

word frequency analysis,
sentence length analysis,
average word length analysis,
etc.

Those really help explore the fundamental characteristics of the text data.

To do so, we will be mostly using histograms (continuous data) and bar charts (categorical data).

First, I’ll take a look at the number of characters present in each sentence. This can give us a rough idea about the news headline length.

news['headline_text'].str.len().hist()

Code Snippet that Generates this Chart

The histogram shows that news headlines range from 10 to 70 characters and generally, it is between 25 to 55 characters.

Now, we will move on to data exploration at a word-level. Let’s plot the number of words appearing in each news headline.

text.str.split().\
    map(lambda x: len(x)).\
    hist()

Code Snippet that Generates this Chart

It is clear that the number of words in news headlines ranges from 2 to 12 and mostly falls between 5 to 7 words.

Up next, let’s check the average word length in each sentence.

news['headline_text'].str.split().\
   apply(lambda x : [len(i) for i in x]). \
   map(lambda x: np.mean(x)).hist()

Code Snippet that Generates this Chart

The average word length ranges between 3 to 9 with 5 being the most common length. Does it meanz that people are using really short words in news headlines?

Let’s find out.

One reason why this may not be true is stopwords. Stopwords are the words that are most commonly used in any language such as “the”,” a”,” an” etc. As these words are probably small in length these words may have caused the above graph to be left-skewed.

Analyzing the amount and the types of stopwords can give us some good insights into the data.

To get the corpus containing stopwords you can use the nltk library. Nltk contains stopwords from many languages. Since we are only dealing with English news I will filter the English stopwords from the corpus.

import nltk
nltk.download('stopwords')
stop=set(stopwords.words('english'))

Now, we’ll create the corpus.

corpus=[]
new= news['headline_text'].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
from collections import defaultdict
dic=defaultdict(int)
for word in corpus:
    if word in stop:
        dic[word]+=1

and plot top stopwords.

Code Snippet that Generates this Chart

We can evidently see that stopwords such as “to”,” in” and “for” dominate in news headlines.

So now we know which stopwords occur frequently in our text, let’s inspect which words other than these stopwords occur frequently.

We will use the counter function from the collections library to count and store the occurrences of each word in a list of tuples. This is a very useful function when we deal with word-level analysis in natural language processing.

counter=Counter(corpus)
most=counter.most_common()
x, y= [], []
for word,count in most[:40]:
    if (word not in stop):
        x.append(word)
        y.append(count)

sns.barplot(x=y,y=x)

Code Snippet that Generates this Chart

Wow! The “us”, “Iraq” and “war” dominate the headlines over the last 15 years.

Here ‘us’ could mean either the USA or us (you and me). us is not a stopword, but when we observe other words in the graph they are all related to the US — Iraq war and “us” here probably indicate the USA.

Ngram exploration

Ngrams are simply contiguous sequences of n words. For example “riverbank”,” The three musketeers” etc.
If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on.

Looking at most frequent n-grams can give you a better understanding of the context in which the word was used.

To implement n-grams we will use ngrams function from nltk.util. For example:

from nltk.util import ngrams
list(ngrams(['I' ,'went','to','the','river','bank'],2))

Now that we know how to create n-grams lets visualize them.

To build a representation of our vocabulary we will use Countvectorizer. Countvectorizer is a simple method used to tokenize, vectorize and represent the corpus in an appropriate form. It is available in sklearn.feature_engineering.text.

So with all this, we will analyze the top bigrams in our news headlines.

def get_top_ngram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) 
                  for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:10]
top_n_bigrams=get_top_ngram(news['headline_text'],2)[:10]
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x)

Code Snippet that Generates this Chart

We can observe that the bigrams such as ‘anti-war’, ’killed in’ that are related to war dominate the news headlines.

How about trigrams?

top_tri_grams=get_top_ngram(news['headline_text'],n=3)
x,y=map(list,zip(*top_tri_grams))
sns.barplot(x=y,y=x)

Code Snippet that Generates this Chart

We can see that many of these trigrams are some combinations of “to face court” and “anti war protest”. It means that we should put some effort into data cleaning and see if we were able to combine those synonym terms into one clean token.

Topic Modeling exploration with pyLDAvis

Topic modeling is the process of using unsupervised learning techniques to extract the main topics that occur in a collection of documents.

Latent Dirichlet Allocation (LDA) is an easy to use and efficient model for topic modeling. Each document is represented by the distribution of topics and each topic is represented by the distribution of words.

Once we categorize our documents in topics we can dig into further data exploration for each topic or topic group.

But before getting into topic modeling we have to pre-process our data a little. We will:

tokenize: the process by which sentences are converted to a list of tokens or words.
remove stopwords
lemmatize: reduces the inflectional forms of each word into a common base or root.
convert to the bag of words: Bag of words is a dictionary where the keys are words(or ngrams/tokens) and values are the number of times each word occurs in the corpus.

With NLTK you can tokenize and lemmatize easily:

import nltk
nltk.download('punkt')
nltk.download('wordnet')
def preprocess_news(df):
    corpus=[]
    stem=PorterStemmer()
    lem=WordNetLemmatizer()
    for news in df['headline_text']:
        words=[w for w in word_tokenize(news) if (w not in stop)]

        words=[lem.lemmatize(w) for w in words if len(w)>2]

        corpus.append(words)
    return corpus
corpus=preprocess_news(news)

Now, let’s create the bag of words model using gensim

dic=gensim.corpora.Dictionary(corpus)
bow_corpus = [dic.doc2bow(doc) for doc in corpus]

and we can finally create the LDA model:

lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 4, 
                                   id2word = dic,                                    
                                   passes = 10,
                                   workers = 2)
lda_model.show_topics()

The topic 0 indicates something related to the Iraq war and police. Topic 3 shows the involvement of Australia in the Iraq war.

You can print all the topics and try to make sense of them but there are tools that can help you run this data exploration more efficiently. One such tool is pyLDAvis which visualizes the results of LDA interactively.

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic)
vis

Code Snippet that Generates this Chart

On the left side, the area of each circle represents the importance of the topic relative to the corpus. As there are four topics, we have four circles.

The distance between the center of the circles indicates the similarity between the topics. Here you can see that the topic 3 and topic 4 overlap, this indicates that the topics are more similar.
On the right side, the histogram of each topic shows the top 30 relevant words. For example, in topic 1 the most relevant words are police, new, may, war, etc

So in our case, we can see a lot of words and topics associated with war in the news headlines.

Wordcloud

Wordcloud is a great way to represent text data. The size and color of each word that appears in the wordcloud indicate it’s frequency or importance.

Creating wordcloud in python with is easy but we need the data in a form of a corpus. Luckily, I prepared it in the previous section.

from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
def show_wordcloud(data):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=100,
        max_font_size=30,
        scale=3,
        random_state=1)

    wordcloud=wordcloud.generate(str(data))
    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    plt.imshow(wordcloud)
    plt.show()
show_wordcloud(corpus)

Code Snippet that Generates this Chart

Again, you can see that the terms associated with the war are highlighted which indicates that these words occurred frequently in the news headlines.

There are many parameters that can be adjusted. Some of the most prominent ones are:

stopwords: The set of words that are blocked from appearing in the image.
max_words: Indicates the maximum number of words to be displayed.
max_font_size: maximum font size.

There are many more options to create beautiful word clouds. For more details, you can refer here.

Sentiment analysis

Sentiment analysis is a very common natural language processing task in which we determine if the text is positive, negative or neutral. This is very useful for finding the sentiment associated with reviews, comments which can get us some valuable insights out of text data.

There are many projects that will help you do sentiment analysis in python. I personally like TextBlob and Vader Sentiment.

Textblob

Textblob is a python library built on top of nltk. It has been around for some time and is very easy and convenient to use.
The sentiment function of TextBlob returns two properties:

polarity: is a floating-point number that lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.
subjectivity: refers to how someone’s judgment is shaped by personal opinions and feelings. Subjectivity is represented as a floating-point value which lies in the range of [0,1].

I will run this function on our news headlines.

from textblob import TextBlob
TextBlob('100 people killed in Iraq').sentiment

TextBlob claims that the text “100 people killed in Iraq” is negative and is not an opinion or feeling but rather a factual statement. I think we can agree with TextBlob here.

Now that we know how to calculate those sentiment scores we can visualize them using a histogram and explore data even further.

def polarity(text):
    return TextBlob(text).sentiment.polarity
news['polarity_score']=news['headline_text'].\
   apply(lambda x : polarity(x))
news['polarity_score'].hist()

Code Snippet that Generates this Chart

You can see that the polarity mainly ranges between 0.00 and 0.20. This indicates that the majority of the news headlines are neutral.

Let’s dig a bit deeper by classifying the news as negative, positive and neutral based on the scores.

def sentiment(x):
    if x<0:
        return 'neg'
    elif x==0:
        return 'neu'
    else:
        return 'pos'

news['polarity']=news['polarity_score'].\
   map(lambda x: sentiment(x))
plt.bar(news.polarity.value_counts().index,
        news.polarity.value_counts())

Code Snippet that Generates this Chart

Yep, 70 % of news is neutral with only 18% of positive and 11% of negative.

Let’s take a look at some of the positive and negative headlines.

news[news['polarity']=='pos']['headline_text'].head()

Positive news headlines are mostly about some victory in sports.

news[news['polarity']=='neg']['headline_text'].head()

Yep, pretty negative news headlines indeed.

Vader Sentiment Analysis

The next library we are going to discuss is VADER. Vader works better in detecting negative sentiment. It is very useful in the case of social media text sentiment analysis.

VADER or Valence Aware Dictionary and Sentiment Reasoner is a rule/lexicon-based, open-source sentiment analyzer pre-built library, protected under the MIT license.

VADER sentiment analysis class returns a dictionary that contains the probabilities of the text for being positive, negative and neutral. Then we can filter and choose the sentiment with most probability.

We will do the same analysis using VADER and check if there is much difference.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
def get_vader_score(sent):
    # Polarity score returns dictionary
    ss = sid.polarity_scores(sent)
    #return ss
    return np.argmax(list(ss.values())[:-1])
news['polarity']=news['headline_text'].\
    map(lambda x: get_vader_score(x))
polarity=news['polarity'].replace({0:'neg',1:'neu',2:'pos'})
plt.bar(polarity.value_counts().index,
        polarity.value_counts())

Code Snippet that Generates this Chart

Yep, there is a slight difference in distribution. Even more headlines are classified as neutral 85 % and the number of negative news headlines has increased (to 13 %).

Named Entity Recognition

Named entity recognition is an information extraction method in which entities that are present in the text are classified into predefined entity types like “Person”,” Place”,” Organization”, etc. By using NER we can get great insights about the types of entities present in the given text dataset.

Let us consider an example of a news article.

In the above news, the named entity recognition model should be able to identify
entities such as RBI as an organization, Mumbai and India as Places, etc.

There are three standard libraries to do Named Entity Recognition:

In this tutorial, I will use spaCy which is an open-source library for advanced natural language processing tasks. It is written in Cython and is known for its industrial applications. Besides NER, spaCy provides many other functionalities like pos tagging, word to vector transformation, etc.

SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it supports the following entity types:

There are three pre-trained models for English in spaCy. I will use en_core_web_sm for our task but you can try other models.

To use it we have to download it first:

python -m spacy download en_core_web_sm

Now we can initialize the language model:

import spacy
nlp = spacy.load("en_core_web_sm")

One of the nice things about Spacy is that we only need to apply nlp function once, the entire background pipeline will return the objects we need.

doc=nlp('India and Iran have agreed to boost the economic viability \
of the strategic Chabahar port through various measures, \
including larger subsidies to merchant shipping firms using the facility, \
people familiar with the development said on Thursday.')
[(x.text,x.label_) for x in doc.ents]

We can see that India and Iran are recognized as Geographical locations (GPE), Chabahar as Person and Thursday as Date.

We can also visualize the output using displacy module in spaCy.

from spacy import displacy
displacy.render(doc, style='ent')

This creates a very neat visualization of the sentence with the recognized entities where each entity type is marked in different colors.

Now that we know how to perform NER we can explore the data even further by doing a variety of visualizations on the named entities extracted from our dataset.

First, we will run the named entity recognition on our news headlines and store the entity types.

def ner(text):
    doc=nlp(text)
    return [X.label_ for X in doc.ents]
ent=news['headline_text'].\
    apply(lambda x : ner(x))
ent=[x for sub in ent for x in sub]
counter=Counter(ent)
count=counter.most_common()

Now, we can visualize the entity frequencies:

x,y=map(list,zip(*count))
sns.barplot(x=y,y=x)

Code Snippet that Generates this Chart

Now we can see that the GPE and ORG dominate the news headlines followed by the PERSON entity.

We can also visualize the most common tokens per entity. Let’s check which places appear the most in news headlines.

def ner(text,ent="GPE"):
    doc=nlp(text)
    return [X.text for X in doc.ents if X.label_ == ent]
gpe=news['headline_text'].apply(lambda x: ner(x))
gpe=[i for x in gpe for i in x]
counter=Counter(gpe)
x,y=map(list,zip(*counter.most_common(10)))
sns.barplot(y,x)

Code Snippet that Generates this Chart

I think we can confirm the fact that the “us” means the USA in news headlines. Let’s also find the most common names that appeared in news headlines.

per=news['headline_text'].apply(lambda x: ner(x,"PERSON"))
per=[i for x in per for i in x]
counter=Counter(per)
x,y=map(list,zip(*counter.most_common(10)))
sns.barplot(y,x)

Code Snippet that Generates this Chart

Saddam Hussain and George Bush were the presidents of Iraq and the USA during wartime. Also, we can see that the model is far from perfect classifying “vic govt” or “nsw govt” as a person rather than a government agency.

Exploration through Parts of Speach Tagging in python

Parts of speech (POS) tagging is a method that assigns part of speech labels to words in a sentence. There are eight main parts of speech:

Noun (NN)- Joseph, London, table, cat, teacher, pen, city
Verb (VB)- read, speak, run, eat, play, live, walk, have, like, are, is
Adjective(JJ)- beautiful, happy, sad, young, fun, three
Adverb(RB)- slowly, quietly, very, always, never, too, well, tomorrow
Preposition (IN)- at, on, in, from, with, near, between, about, under
Conjunction (CC)- and, or, but, because, so, yet, unless, since, if
Pronoun(PRP)- I, you, we, they, he, she, it, me, us, them, him, her, this
Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!

This is not a straightforward task, as the same word may be used in different sentences in different contexts. However, once you do it, there are a lot of helpful visualizations that you can create that can give you additional insights into your dataset.

I will use the nltk to do the parts of speech tagging but there are other libraries that do a good job (spacy, textblob).

Let’s look at an example.

import nltk
sentence="The greatest comeback stories in 2019"
tokens=word_tokenize(sentence)
nltk.pos_tag(tokens)

Note

You can also visualize the sentence parts of speech and its dependency graph with spacy.displacy module.

doc = nlp('The greatest comeback stories in 2019')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

We can observe various dependency tags here. For example, DET tag denotes the relationship between the determiner “the” and the noun “stories”.

You can check the list of dependency tags and their meanings here.

Ok, now that we now what POS tagging is, let’s use it to explore our headlines dataset.

def pos(text):
    pos=nltk.pos_tag(word_tokenize(text))
    pos=list(map(list,zip(*pos)))[1]
    return pos
tags=news['headline_text'].apply(lambda x : pos(x))
tags=[x for l in tags for x in l]
counter=Counter(tags)
x,y=list(map(list,zip(*counter.most_common(7))))
sns.barplot(x=y,y=x)

Code Snippet that Generates this Chart

We can clearly see that the noun (NN) dominates in news headlines followed by the adjective (JJ). This is typical for news articles while for artistic forms higher adjective(ADJ) frequency could happen quite a lot.

You can dig deeper into this by investigating which singular noun occur most commonly in news headlines. Let us find out.

def get_adjs(text):
    adj=[]
    pos=nltk.pos_tag(word_tokenize(text))
    for word,tag in pos:
        if tag=='NN':
            adj.append(word)
    return adj
words=news['headline_text'].apply(lambda x : get_adjs(x))
words=[x for l in words for x in l]
counter=Counter(words)
x,y=list(map(list,zip(*counter.most_common(7))))
sns.barplot(x=y,y=x)

Code Snippet that Generates this Chart

Nouns such as “war”, “iraq”, “man” dominate in the news headlines. You can visualize and examine other parts of speech using the above function.

Exploring through text complexity

It can be very informative to know how readable (difficult to read) the text is and what type of reader can fully understand it. Do we need a college degree to understand the message or a first-grader can clearly see what the point is?

You can actually put a number called readability index on a document or text. Readability index is a numeric value that indicates how difficult (or easy) it is to read and understand a text.

There are many readability score formulas available for the English language. Some of the most prominent ones are:

Textstat is a cool Python library that provides an implementation of all these text statistics calculation methods. Let’s use Textstat to implement Flesch Reading Ease index.

Now, you can plot a histogram of the scores and visualize the output.

from textstat import flesch_reading_ease
news['headline_text'].\
   apply(lambda x : flesch_reading_ease(x)).hist()

Code Snippet that Generates this Chart

Almost all of the readability scores fall above 60. This means that an average 11-year-old student can read and understand the news headlines. Let’s check all news headlines that have a readability score below 5.

x=[i for i in range(len(reading)) if reading[i]<5]
news.iloc[x]['headline_text'].head()

You can see some of the complex words being used in news headlines like “capitulation”,” interim”,” entrapment” etc. These words may have caused the scores to fall under 5.

Final Thoughts

In this article, we discussed and implemented various exploratory data analysis methods for text data. Some common, some lesser-known but all of them could be a great addition to your data exploration toolkit.

Hopefully, you will find some of them useful in your current and future projects.

To make data exploration even easier, I have created a “Exploratory Data Analysis for Natural Language Processing Template” that you can use for your work.

Get Exploratory Data Analysis for Natural Language Processing Template

Also, as you may have seen already, for every chart in this article, there is a code snippet that creates it. Just click on the button below a chart.

Happy exploring!

Optuna vs Hyperopt: Which Hyperparameter Optimization Library Should You Choose?

Jakub Czakon — Mon, 13 Jan 2020 10:39:50 +0000

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Thinking which library should you choose for hyperparameter optimization?

Been using Hyperopt for a while and feel like changing?

Just heard about Optuna and you want to see how it works?

Good!

In this article I will:

show you an example of using Optuna and Hyperopt on a real problem,
compare Optuna vs Hyperopt on API, documentation, functionality, and more,
give you my overall score and recommendation on which hyperparameter optimization library you should use.
Let’s do it.

Evaluation criteria

Ease of use and API
Options methods and hyper(hyperparameters)

Search Space
Optimization Methods
Callbacks
Persisting and Restarting
Run Pruning
Handling Exceptions

Documentation
Visualizations
Speed and Parallelization
Experimental Results

Ease of use and API

In this section I want to see how to run a basic hyperparameter tuning script for both libraries, see how natural and easy-to-use it is and what is the API.

Optuna

You define your search space and objective in one function.

Moreover, you sample the hyperparameters from the trial object. Because of that, the parameter space is defined at execution. For those of you who like Pytorch because of this imperative approach, Optuna will feel natural.

def objective(trial):
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100),
              'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 1000),
              'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
              'subsample': trial.suggest_uniform('subsample', 0.1, 1.0)}
    return train_evaluate(params)

Then, you create the study object and optimize it. What is great is that you can choose whether you want to maximize or minimize your objective. That is useful when optimizing a metric like AUC because you don’t have to change the sign of the objective before training and then convert best results after training to get a positive score.

study = optuna.create_study(direction='Optuna > Hyperopt
maximize')
study.optimize(objective, n_trials=100)

That is it.

Everything you may want to know about the optimization is available in the study object.

What I love about Optuna is that I get to define how I want to sample my search space on-the-fly which gives me a lot of flexibility. Ability to choose a direction of optimization is also pretty nice.

If you want to see the full code example you can scroll down to the Example script.

10 / 10

Hyperopt

You start by defining your parameter search space:

SPACE = {'learning_rate': 
hp.loguniform('learning_rate',np.log(0.01),np.log(0.5)),
         'max_depth': 
hp.choice('max_depth', range(1, 30, 1)),
         'num_leaves': 
hp.choice('num_leaves', range(2, 100, 1)),
         'subsample': 
hp.uniform('subsample', 0.1, 1.0)}

Then, you create an objective function that you want to minimize. That means you will have to flip the sign of your objective for the-higher-the-better metric like AUC.

def objective(params):
    return -1.0 * train_evaluate(params)

Finally, you instantiate the Trials() object and minimize your objective on the parameter search SPACE.

trials = Trials()
_ = fmin(objective, SPACE, trials=trials, algo=tpe.suggest, max_evals=100)

…and done!

All the information about the hyperparameters that were tested and the corresponding score are kept in the trials object.

The thing that I don’t like is the fact that I need to instantiate the Trials() even in the simplest of cases. I would rather have fmin return the trials and do the instantiation by default.

9 / 10

Both libraries do a good job here but I feel that Optuna is slightly better because of the flexibility, imperative approach to sampling parameters and a bit less boilerplate.

Ease of use and API
Optuna > Hyperopt

Jump back to the evaluation criteria ->

Options, methods, and hyper(hyperparameters)

In real-life scenarios running hyperparameter optimization requires a lot of additional options away from the golden path. Areas that I am particularly interested in are:

search space
optimization methods/algorithms
callbacks
persisting and restarting parameter sweeps
pruning unpromising runs
handling exceptions

In this section, I will compare Optuna and Hyperopt on exactly those.

Search Space

In this section I want to compare the search space definition, flexibility in defining a complex space and sampling options for each parameter type (Float, Integer, Categorical).

Optuna

You can find sampling options for all hyperparameter types:

for categorical parameters you can use trials.suggest_categorical
for integers there is trials.suggest_int
for float parameters you have trials.suggest_uniform, trials.suggest_loguniform and even, more exotic, trials.suggest_discrete_uniform

Especially for the integer parameters, you could wish for more options but it deals with most use-cases.

A great feature of this library is that you sample from the parameter space on-the-fly and you can do it however you like.

You can use if statements, you can change intervals from which you search, you can use the information from the trial object to guide your search.

def objective(trial):
    classifier_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest'])
    if classifier_name == 'SVC':
        svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
        classifier_obj = sklearn.svm.SVC(C=svc_c)
    else:
        rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
        classifier_obj = sklearn.ensemble.RandomForestClassifier(max_depth=rf_max_depth)

    ...

This is awesome, you can do literally anything!

10 / 10

Hyperopt

Search space is where Hyperopt really gives you a ton of sampling options:

for categorical parameters you have hp.choice
for integers you get hp.randit, hp.quniform, hp.qloguniform and hp.qlognormal
for floats we have hp.normal, hp.uniform, hp.lognormal and hp.loguniform

As far as I know this is the most extensive sampling functionality out there.

You define your search space before you run optimization but you can create very complex parameter spaces:

SPACE = hp.choice('classifier_type', [
    {
        'type': 'naive_bayes',
    },
    {
        'type': 'svm',
        'C': hp.lognormal('svm_C', 0, 1),
        'kernel': hp.choice('svm_kernel', [
            {'ktype': 'linear'},
            {'ktype': 'RBF', 'width': hp.lognormal('svm_rbf_width', 0, 1)},
            ]),
    },
    {
        'type': 'dtree',
        'criterion': hp.choice('dtree_criterion', ['gini', 'entropy']),
        'max_depth': hp.choice('dtree_max_depth',
            [None, hp.qlognormal('dtree_max_depth_int', 3, 1, 1)]),
        'min_samples_split': hp.qlognormal('dtree_min_samples_split', 2, 1, 1),
    },
    ])

By combining hp.choice with other sampling methods we can have conditional spaces. This is useful when you are optimizing hyperparameters for a machine learning pipeline that involves preprocessing, feature engineering and model training.

10 / 10

I have to say I like them both. I can define nested search spaces easily and I have a lot of sampling options for all the parameter types. Optuna has an imperative parameter definition, which gives more flexibility while Hyperopt has more parameter sampling options.

Search Space
Optuna = Hyperopt

Jump back to the evaluation criteria ->

Optimization methods

Both Optuna and Hyperopt are using the same optimization methods under the hood. They have:

rand.suggest (Hyperopt) and samplers.random.RandomSampler (Optuna)

Your standard random search over the parameters.

tpe.suggest (Hyperopt) and samplers.tpe.sampler.TPESampler (Optuna)

Tree of Parzen Estimators (TPE). The idea behind this method is similar to what was explained in the previous blog post about Scikit Optimize. We use a cheap surrogate model to estimate the performance of the expensive objective function on a set of parameters.

The difference between the methods used in Scikit Optimize and Tree of Parzen Estimators (TPE) is that instead of estimating the actual performance (point estimation) we want to estimate the density in the tails. We want to be able to tell whether a run will be good (right tail) or bad (left tail).

I like the following explanation taken from the AutoML_Book by amazing folks over at AutoML.org Freiburg.

Instead of modeling the probability p(y|λ) of observations y given the > configurations λ, the Tree Parzen Estimator models density functions p(λ|y < α) and p(λ|y ≥ α). Given a percentile α (usually set to 15%), the observations are divided in good observations and bad observations and simple 1-d Parzen windows are used to model the two distributions.

By using p(λ|y < α) and p(λ|y ≥ α) you can estimate the expected improvement of a parameter configuration over previous best.

Interestingly, both for Optuna and Hyperopt, there are no options to specify the α parameter in the optimizer.

Optuna

integration.SkoptSampler

Optuna lets you use samplers from Scikit-Optimize (skopt).

Skopt offers a bunch of Tree-Based methods as a choice for your surrogate model.

In order to use them you need to:

create a SkoptSampler instance specifying the parameters of the surrogate model and acquisition function in the skopt_kwargs argument,
pass the sampler instance to the optuna.create_study method

from optuna.integration import SkoptSampler

sampler = SkoptSampler(skopt_kwargs={'base_estimator':'RF',
                                     'n_random_starts':10,
                                     'base_estimator':'ET',
                                     'acq_func':'EI',
                                     'acq_func_kwargs': {'xi':0.02})
study = optuna.create_study(sampler=sampler)
study.optimize(objective, n_trials=100)

pruners.SuccessiveHalvingPruner

You can also use one of the multiarmed bandit methods called Asynchronous Successive Halving Algorithm (ASHA). If you are interested in the details please read the paper but the general idea is to:

run a bunch of parameter configurations for some time
prune the (half of) the least promising runs every
run a bunch of parameter configurations for some more time
prune the (half of) the least promising runs every
stop when only one configuration is left

By doing so, the search can focus on the more promising runs. However, the static allocation of the budgets to configurations is a problem in practice (which a newer approach called HyperBand solves).

It is very easy to use ASHA in Optuna. Just pass a SuccesiveHalvingPruner to .create_study() and you are good to go:

from optuna.pruners import SuccessiveHalvingPruner

optuna.create_study(pruner=SuccessiveHalvingPruner())
study.optimize(objective, n_trials=100)

Nice and simple.

If you would like to learn more, you may want to check out my article about Scikit Optimize.

Overall, there are a lot of options when it comes to optimization functions right now. However, there are some important ones, like Hyperband or BOHB missing.

8 / 10

Hyperopt

atpe.suggest

Recently added, adaptive TPE was invented at ElectricBrain and it is actually a series of (not so) little improvements that they experimented with on top of TPE.

The authors explain their approach and modifications they made to TPE thoroughly in this fascinating blog post.

It is super easy to use. Instead of tpe.suggest you need to pass atpe.suggest to your fmin function.

from hyperopt import fmin, atpe

best = fmin(objective, SPACE, 
            max_evals=100, 
            algo=atpe.suggest)

I really like this effort to include new optimization algorithms in the library, especially since it’s a new original approach not just an integration with the existing algorithm.

Hopefully, in the future, multi-armed bandit methods like Hyperband, BOHB, or tree-based methods like SMAC3 will be included as well.

8 / 10

Optimization methods
Optuna = Hyperopt

Jump back to the evaluation criteria ->

Callbacks

In this section, I want to see how easy it is to define callbacks to monitor/snapshot/modify training after each iteration. It is useful, especially when your training is long and/or distributed.

Optuna

User callbacks are nicely supported with the callbacks argument in of the .optimize() method. Just pass a list of callables that take study and trial as input and you are good to go.

def neptune_monitor(study, trial):
    neptune.log_metric('run_score', trial.value)
    neptune.log_text('run_parameters', str(trial.params))
...
study.optimize(objective, n_trials=100, callbacks=[neptune_monitor])

Because you can access both study and trial you have all the flexibility you can possibly want to checkpoint, do early stopping or modify future search.

10 / 10

Hyperopt

There are no callbacks per se, but you can put your callback function inside the objective and it will be executed every time the objective is called.

def monitor_callback(params, score):
    neptune.send_metric('run_score', score)
    neptune.send_text('run_parameters', str(params))

def objective(params):
    score = -1.0 * train_evaluate(params) 
    monitor_callback(params, score)
    return score

I don’t love it but I guess I can live with that.

6 / 10

Optuna makes it really easy with the callbacks argument while in Hyperopt you have to modify the objective.

Callbacks
Optuna > Hyperopt

Note:

If you want to monitor your Optuna experiments and log all the charts, visualizations, and results you can use Neptune helpers:

opt.utils.neptune_monitor: logs run scores and run parameters and plots the scores so far
opt_utils.log_study: logs best results, best param, and the study object itself

Just add this to your script:

import neptune
import neptunecontrib.monitoring.optuna as opt_utils

neptune.init('jakub-czakon/blog-hpo')
neptune.create_experiment(name='optuna sweep')

monitor = opt_utils.NeptuneMonitor()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, callbacks=[monitor])
opt_utils.log_study(study)

Persisting and restarting

Saving and loading your hyperparameter searches can save you time, money, and can help get better results. Let’s compare both frameworks on that.

Optuna

Simply use joblib.dump to pickle the trials object.

study.optimize(objective, n_trials=100)
joblib.dump(study, 'artifacts/study.pkl')

… and you can load it later with joblib.load to restart your search.

study = joblib.load('../artifacts/study.pkl')
study.optimize(objective, n_trials=200)

That’s it.

For distributed setups, you can use the name of the study the URL to the database where your distributed study is to instantiate new study. For example:

study = optuna.create_study(
                    study_name='example-study', 
                    storage='sqlite:///example.db', 
                    load_if_exists=True)

Nice and easy.

More about running distributed hyperparameter optimization with Optuna in the Speed and Parallelization section.

10 / 10

Hyperopt

Similarly to Optuna use joblib.dump to pickle the trials object.

trials = Trials()  
_ = fmin(objective, SPACE, trials=trials, 
         algo=tpe.suggest, max_evals=100)
joblib.dump(trials, 'artifacts/hyperopt_trials.pkl')

… load it with joblib.load and restart.

trials = joblib.load('artifacts/hyperopt_trials.pkl')
_ = fmin(objective, SPACE, trials=trials, 
         algo=tpe.suggest, max_evals=200)

Simple and works with no problems.

If you are optimizing hyperparameters in a distributed fashion you can load MongoTrials() object that connects to MongoDB. More about running distributed hyperparameter optimization with Hyperopt in the Speed and Parallelization section.

10 / 10

Both make it easy and get the job done.

Persisting and restarting
Optuna = Hyperopt

Jump back to the evaluation criteria ->

Run Pruning

Not all hyperparameter configurations are created equal. For some of them, you can tell very quickly that they will not produce high scores. Ideally, you would like to stop those runs as soon as possible try different parameters instead.

Optuna gives you an option to do that with Pruning Callbacks. Many machine learning frameworks are supported:

KerasPruningCallback, TFKerasPruningCallback
TensorFlowPruningHook
PyTorchIgnitePruningHandler, PyTorchLightningPruningCallback
FastAIPruningCallback
LightGBMPruningCallback
XGBoostPruningCallback
and more

You can read about them in the docs.

For example, in the case of lightGBM training you would pass this callback to the lgb.train function.

def train_evaluate(X, y, params, pruning_callback=None):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    callbacks = [pruning_callback] if pruning_callback is not None else None

    model = lgb.train(params, train_data,
                      num_boost_round=NUM_BOOST_ROUND,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=[valid_data],
                      valid_names=['valid'],
                      callbacks=callbacks)
    score = model.best_score['valid']['auc']
    return score

def objective(trial):
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100),
              'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 1000),
              'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
              'subsample': trial.suggest_uniform('subsample', 0.1, 1.0)}

    pruning_callback = LightGBMPruningCallback(trial, 'auc', 'valid')
    return train_evaluate(params, pruning_callback)

Only Optuna gives you this option so it is a clear win.

Run Pruning
Optuna > Hyperopt

Jump back to the evaluation criteria ->

Handling Exceptions

If one of your runs fails due to the wrong parameter combination, random training error or some other problem you could lose all the parameter_configuration:score pairs evaluated so far in a study.

You can use callbacks to save this information after every iteration or use a DB to store it as explained in the Speed and Parallelization
section.

However, you may want to let this study continue even when the exception happens. To make it possible, Optuna lets you pass the allowed exceptions to the .optimize() method.

def objective(trial):
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100)}

    print(non_existent_variable)

    return train_evaluate(params)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, catch=(NameError,))

Again, only Optuna supports this.

Handling Exceptions
Optuna > Hyperopt

Jump back to the evaluation criteria ->

Documentation

When you are a user of a library or a framework it is absolutely crucial to find the information you need when you need it. This is where documentation/support channels come into the picture and they can make or break a library.

Let’s see how Optuna and Hyperopt compare on that.

Optuna

It is really good.

There is a proper webpage that explains all the basic concepts and shows you where to find more information.

Also, there is complete and very easy-to-understand documentation on read-the-docs.

It contains:

Tutorials with both simple and advanced examples
API Reference with all the functions containing beautiful docstrings. To give you an idea imagine having charts inside of your docstrings so that you can understand what is happening inside your function better. Check out the BaseSampler if you don’t believe me.

It is also important to mention that the supporting team from Preferred Networks really takes care of this project. They respond to Github issues and the community is growing around it with great feature ideas and PRs coming in. Checkout the Github project issues section to see what is going on there.

10 / 10

Hyperopt

It was recently updated and now it is quite alright.

You can find it here.

You can easily find information about:

how to get started
how to define both simple and advances search spaces
how to run the installation
how to run Hyperopt in parallel via MongoDB or Spark

Unfortunately, there were some things that I didn’t like:

missing API reference with the docstrings all functions/methods
docstrings themselves are missing for most of methods/functions which forces you to read the implementation (there are some positive side effects here:) )
no examples of using Adaptive TPE. I wasn’t sure if I am using it correctly, whether I should specify some additional (hyper)hyper parameters. Missing docstrings didn’t help me here either.
some links to 404 in the docs.

Overall, it has improved a lot lately, but I was still a bit lost at times. I hope that with time it will get even better so stay tuned.

The good thing is, there are a lot of blog posts about it. Some of them that I found useful are:

“Parameter Tuning with Hyperopt” by District Data Labs
“Hyperopt tutorial for Optimizing Neural Networks Hyperparameters” by Vooban
“On Using Hyperopt: Advanced Machine Learning” by Tanay Agrawal
“An Introductory Example of Bayesian Optimization in Python with Hyperopt” by Will Koehrsen

The documentation is not the strongest side of this project but because it’s a classic there are a lot of resources out there.

6 / 10

Documentation
Optuna > Hyperopt

Jump back to the evaluation criteria ->

Visualizations

Visualizing hyperparameter searches can be very useful. You can gain information on interactions between parameters and see where you should search next.

That is why I want to compare visualization suits that Optuna and Hyperopt offer.

Optuna

A few great visualizations are available in the optuna.visualization module:

plot_contour: plots parameter interactions on an interactive chart. You can choose which hyperparameters you would like to explore.

plot_contour(study, params=['learning_rate',
                            'max_depth',
                            'num_leaves',
                            'min_data_in_leaf',
                            'feature_fraction',
                            'subsample'])

plot_optimization_history: shows the scores from all trials as well as the best score so far at each point.

plot_optimization_history(study)

plot_parallel_coordinate: interactively visualizes the hyperparameters and scores

plot_parallel_coordinate(study)

plot_slice: shows the evolution of the search. You can see where in the hyperparameter space your search went and which parts of the space were explored more.

plot_slice(study)

Overall, visualizations in Optuna are incredibile!

They let you zoom in on the hyperparameter interactions and help you decide on how to run your next parameter sweep. Amazing job.

10 / 10

Hyperopt

There are three visualization functions in the hyperopt.plotting module:

main_plot_history: shows you the results of each iteration and highlights the best score.

main_plot_history(trials)

main_plot_histogram: shows you the histogram of results over all iterations.

main_plot_histogram(trials)

main_plot_vars: I don’t really know what it does as I couldn’t get it to run and there were no docstrings nor examples (again, the documentation is far from perfect).

Summing up, there are some basic visualization utilities but they are not super useful.

3 / 10

I am very impressed by the visualizations available in Optuna. Useful, interactive, and beautiful.

Visualizations
Optuna > Hyperopt

Note:

If you want to play with those visualizations you can use the study object that I saved as ‘study.pkl’ for each experiment.

For example go to artifacts of this one.

Jump back to the evaluation criteria ->

Speed and Parallelization

When it comes to hyperparameter optimization, being able to distribute your training on your machine or many machines (cluster) can be crucial.

That is why, I checked the distributed training options for both Optuna and Hyperopt.

Optuna

You can run distributed hyperparameter optimization on one machine or a cluster of machines and it is actually really simple.

For one machine you simply change the n_jobs parameter in your .optimize()method.

study.optimize(objective, n_trials=100, n_jobs=12)

To run it on a cluster you need to do is create a study that resides in a database (you can choose among many Relational DBs).

There are two options to do that. You can do it via command-line interface:

optuna create-study \
    --study-name "distributed-example" \
    --storage "sqlite:///example.db"

You can also create a study in your optimization script.

By using load_if_exists=True you can treat your master script and worker scripts in the same way which simplifies things a lot!

study = optuna.create_study(
    study_name='distributed-example', 
    storage='sqlite:///example.db',
    load_if_exists=True)
study.optimize(objective, n_trials=100)

Finally, you can run your worker scripts from many machines and they will all use the same information from the study database.

terminal-1$ python run_worker.py
terminal-25$ python run_worker.py

Easy and works like a charm!

10 / 10

Hyperopt

You can distribute your computation over a cluster of machines. Good, step-by-step instructions can be found in this blog post by Tanay Agrawal but in a nutshell, you need to:

Start a server with MongoDB on it which will consume results from your worker training scripts and send out the next parameter set to try,
In your training script, instead of Trials() create a MongoTrials() object pointing to the database server you have started in the previous step,
Move your objective function to a separate objective.py script and rename it to function,
Compile your Python training script,
Run hyperopt-mongo-worker

Though it gets the job done it doesn’t feel quite perfect. You need to do some juggling around the objective function, and starting MongoDB could have been provided in the CLI to makes things easier.

It is also important to mention that integration with Spark via SparkTrials object was recently added. There is a step by step guide to help you get started and you can even use the spark-installation script to makes things easier.

best = hyperopt.fmin(fn = objective,
                     space = search_space,
                     algo = hyperopt.tpe.suggest,
                     max_evals = 64,
                     trials = hyperopt.SparkTrials())

Works exactly the way you would expect it to work.

Nice and simple!

9 / 10

Both libraries support distributed training which is great. However, Optuna does a bit better job with simpler, more user-friendly interface.

Speed and Parallelization
Optuna = Hyperopt

Jump back to the evaluation criteria ->

Experimental results*

Just to be clear those are the results on just one example problem and one run per lib/configuration and they do not guarantee generalization. To run a proper benchmark, you would run it multiple times on various datasets.

That being said, as a practitioner, I would hope to see some improvements over the random search for each problem. Otherwise, why bother with an HPO library?

Ok, so as an example let’s tweak the hyperparameters of the lightGBM model on a tabular, binary classification problem. If you want to use the same dataset as I did you should:

download it from kaggle
use the first 10000 rows from the train.csv file

To make the training quick I fixed the number of boosting rounds to 300 with a 30 round early stopping.

import lightgbm as lgb
from sklearn.model_selection import train_test_split

NUM_BOOST_ROUND = 300
EARLY_STOPPING_ROUNDS = 30

def train_evaluate(X, y, params):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                          test_size=0.2, 
                                                          random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    model = lgb.train(params, train_data,
                      num_boost_round=NUM_BOOST_ROUND,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=[valid_data], 
                      valid_names=['valid'])

    score = model.best_score['valid']['auc']
    return score

All the training and evaluation logic is put inside the train_evaluate function. We can treat it as a black box that takes the data and hyperparameter set and produces the AUC evaluation score.

Note:

You can actually turn every script that takes parameters as inputs and outputs the score into such train_evaluate. Once that is done you can treat it as black box and tune your parameters.

I show how to do that step-by-step in a different post “How to Do Hyperparameter Tuning on Any Python Script in 3 Easy Steps”.

To train a model on a set of parameters you need to run something like this:

import pandas as pd

N_ROWS=10000
TRAIN_PATH = '/mnt/ml-team/minerva/open-solutions/santander/data/train.csv'

data = pd.read_csv(TRAIN_PATH, nrows=N_ROWS)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']

MODEL_PARAMS = {'boosting': 'gbdt',
                'objective':'binary',
                'metric': 'auc',
                'num_threads': 12,
                'learning_rate': 0.3,
                }

score = train_evaluate(X, y, MODEL_PARAMS)
print('Validation AUC: {}'.format(score))

For this study, I tried to find the best parameters within 100 run budget.

I ran 6 experiments:

Random search (from hyperopt) as a reference
Tree of Parzen Estimator search strategies for both Optuna and Hyperopt
Adaptive TPE from Hyperopt
TPE from Optuna with a pruning callback for more runs but within the same time frame. It turns out that 400 runs with pruning takes as much time as 100 runs without it.
Optuna with Random Forest surrogate model from skopt.Sampler

You may want to scroll down to the Example Script at the end.

If you want to explore all of those experiments in more detail you can simply go to the experiment dashboard.

Note:

Both Optuna and Hyperopt improved over the random search which is good.

TPE implementation from Optuna was slightly better than Hyperopt’s Adaptive TPE but not by much. On the other hand, when running hyperparameter optimization, those small improvements are exactly what you are going for.

What is interesting is that TPE implementation from HPO and Optuna give vastly different results on this problem. Maybe the cutoff point between good and bad parameter configurations λ is chosen differently or sampling methods have defaults that work better for this particular problem.

Moreover, using pruning decreased training time by 4x. I could run 400 searches in the time that runs 100 without pruning. On the flip side, using pruning got a lower score. It may be different for your problem but it is important to consider that when making a decision whether to use pruning or not.

For this section, I assigned points based on the improvements over the random search strategy.

Hyperopt got (0.850 – 0.844)100 = **6*
Optuna got (0.854 – 0.844)100 = **10*

Experimental results
Optuna = Hyperopt

Jump back to the evaluation criteria ->

Conclusions

Let’s take a look at the overall scores:

Even if you look at it generously and consider only the features that both libraries share, Optuna is a better framework.

It is on-par or slightly better on all criteria and:

it has better documentation
it has way better visualization suite
it has some features like pruning, callbacks, and exception handling that hyperopt doesn’t support

After doing all this research I am convinced that Optuna is a great library for hyperparameter optimization.

Moreover, I think that you should strongly consider switching from Hyperopt if you were using that in the past.

Example script

import lightgbm as lgb
import neptune
import neptunecontrib.monitoring.optuna as opt_utils
import optuna
import pandas as pd
from sklearn.model_selection import train_test_split

N_ROWS = 10000
TRAIN_PATH = '../data/train.csv'
NUM_BOOST_ROUND = 300
EARLY_STOPPING_ROUNDS = 30
STATIC_PARAMS = {'boosting': 'gbdt',
                 'objective': 'binary',
                 'metric': 'auc',
                 }
N_TRIALS = 100
data = pd.read_csv(TRAIN_PATH, nrows=N_ROWS)

X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']


def train_evaluate(X, y, params):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    model = lgb.train(params, train_data,
                      num_boost_round=NUM_BOOST_ROUND,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=[valid_data],
                      valid_names=['valid'])
    score = model.best_score['valid']['auc']
    return score


def objective(trial):
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100),
              'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 1000),
              'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
              'subsample': trial.suggest_uniform('subsample', 0.1, 1.0)}
    all_params = {**params, **STATIC_PARAMS}

    return train_evaluate(X, y, all_params)


neptune.init('jakub-czakon/blog-hpo')
neptune.create_experiment(name='optuna sweep')

monitor = opt_utils.NeptuneMonitor()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=N_TRIALS, callbacks=[monitor])
opt_utils.log_study(study)

neptune.stop()

24 Evaluation Metrics for Binary Classification (And When to Use Them)

Jakub Czakon — Fri, 20 Dec 2019 20:57:10 +0000

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Not sure which evaluation metric you should choose for your binary classification problem? After reading this blog post you should have a good idea.

You will learn about a bunch of common and lesser-known evaluation metrics and charts to understand how to choose the model performance metric for your problem. Specifically, for each metric, I will talk about:

What is the definition and intuition behind it,
The non-technical explanation that you can communicate to business stakeholders,
How to calculate or plot it,
When should you use it.

With that, you will understand the trade-offs so that making metric related decisions will be easier.

I will present all the good stuff in a moment, but first, let’s define our classification problem.

Before we start: problem definition

You will be using those evaluation metrics in the context of a project, so I prepared an example fraud-detection problem based on a recent kaggle competiton.

I selected 43 features and sampled 66000 observations from the original dataset adjusting the fraction of positive class to 0.09.

Then I trained a bunch of lightGBM classifiers with different hyperparameters. I only used learning_rate and n_estimators parameters because I wanted to have an intuition as to which models are “truly” better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, as use more trees and smaller learning rates, it gets tricky but I think it is a decent proxy.

So for combinations of learning_rate and n_estimators, I did the following:

defined hyperparameter values:

MODEL_PARAMS = {'random_state': 1234,    
                'learning_rate': 0.1,                
                'n_estimators': 10}

predicted on test data:log_binary_classification_metrics(y_test, y_test_pred)

model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)

predicted on test data:

y_test_pred = model.predict_proba(X_test)

logged all the metrics for each run:

log_binary_classification_metrics(y_test, y_test_pred)

For full code base go to this repository or scroll down to the example script.

You can also explore experiment runs with:

evaluation metrics
performance charts
metric by threshold plots

Ok, now we are ready to talk about those classification metrics!

Learn about the following evaluation metrics

Confusion Martix
False positive rate | Type-I error
False negative rate | Type-II error
True negative rate | Specificity
Negative predictive value
False discovery rate
True positive rate | Recall | Sensitivity
Positive predictive value | Precision
Accuracy
F beta score
F1 score
F2 score
Cohen Kappa
Matthews correlation coefficient | MCC
ROC curve
ROC AUC score
Precision-Recall curve
PR AUC | Average precision
Log loss
Brier score
Cumulative gain chart
Lift curve | Lift chart
Kolmogorov-Smirnov plot
Kolmogorov Smirnov statistics

I know it is a lot to go over at once. That is why you can jump to the section that is interesting to you and read just that.

1. Confusion Matrix

How to compute:

It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.

It is calculated on class predictions, which means the outputs from your model need to be thresholded first.

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
cm = confusion_matrix(y_true, y_pred_class)
tn, fp, fn, tp = cm.ravel()

How does it look:

So in this example, we can see that:

11918 predictions were true negatives,
872 were true positives,
82 were false positives,
333 predictions were false negatives.

Also, as we already know, this is an imbalanced problem. By the way, if you want to read more about imbalanced problems I recommend taking a look at this article by Tom Fawcett.

When to use it:

Pretty much always. I like to see the nominal values rather than normalized to get a feeling on how the model is doing on different, often imbalanced, classes.

Jump back to the evaluation metrics list ->

2. False Positive Rate | Type I error

When we predict something when it isn’t we are contributing to the false positive rate. You can think of it as a fraction of false alerts that will be raised based on your model predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)

How models score in this metric (threshold=0.5):

For all the models type-1 error alerts are pretty low but by adjusting the threshold we can get an even lower ratio. Since we have true negatives in the denominator, our error will tend to be low just because the dataset is imbalanced.

How does it depend on the threshold:

Obviously, if we increase the threshold only higher scored observations will be classified as positive. In our example, we can see that to reach perfect FPR of 0 we need to increase the threshold to 0.83. However, that will likely mean only very few predictions classified.

When to use it:

You rarely would use this metric alone. Usually as an auxiliary one with some other metric,
If the cost of dealing with an alert is high you should consider increasing the threshold to get fewer alerts.

Jump back to the evaluation metrics list ->

3. False Negative Rate | Type II error

When we don’t predict something when it is, we are contributing to the false negative rate. You can think of it as a fraction of missed fraudulent transactions that your model lets through.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)

How models score in this metric (threshold=0.5):

We can see that in our example, type-2 errors are quite a bit higher then type-1 errors. Interestingly our BIN-98 experiment that had the lowest type-1 error has the highest type-2 error. There is a simple explanation based on the fact that our dataset is imbalanced and with type-2 error we don’t have true negatives in the denominator.

How does it depend on the threshold:

If we decrease the threshold, more observations will be classified as positive. At a certain threshold, we will mark everything as positive (fraudulent for example). We can actually get to the FNR of 0.083 by decreasing the threshold to 0.01.

When to use it:

Usually, it is not used alone but rather with some other metric,
If the cost of letting the fraudulent transactions through is high and the value you get from the users isn’t you can consider focusing on this number.

Jump back to the evaluation metrics list ->

True Negative Rate | Specificity

It measures how many observations out of all negative observations have we classified as negative. In our fraud detection example, it tells us how many transactiohttps://i1.wp.com/neptune.ml/wp-content/uploads/cohen_kappa_eq.png?zoom=1.100000023841858&fit=184%2C76&ssl=1ns, out of all non-fraudulent transactions, we marked as clean.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)

How models score in this metric (threshold=0.5):

Very high specificity for all the models. If you think about it, in our imbalanced problem you would expect that. Classifying negative cases as negative is a lot easier than classifying positive cases and hence the score is high.

How does it depend on the threshold:

The higher the threshold the more observations are truly negative observations we can recall. We can see that starting from say threshold=0.4 our model is doing really well in classifying negative cases as negative.

When to use it:

Usually, you don’t use it alone but rather as an auxiliary metric,
When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient “you are healthy”. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.

Jump back to the evaluation metrics list ->

5. Negative Predictive Value

It measures how many predictions out of all negative predictions were correct. You can think of it as precision for negative class. With our example, it tells us what is the fraction of correctly predicted clean transactions in all non-fraudulent predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
negative_predictive_value = tn/ (tn + fn)

How models score in this metric (threshold=0.5):

All models score really high and no wonder, since with an imbalanced problem it is easy to predict negative class.

How does it depend on the threshold:

The higher the threshold the more cases are classified as negative and the score goes down. However, in our imbalanced example even at a very high threshold, the negative predictive value is still good.

When to use it:

When we care about high precision on negative predictions. For example, imagine we really don’t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.

Jump back to the evaluation metrics list ->

6. False Discovery Rate

It measures how many predictions out of all positive predictions were incorrect. You can think of it as simply 1-precision. With our example, it tells us what is the fraction of incorrectly predicted fraudulent transactions in all fraudulent predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_discovery_rate = fp/ (tp + fp)

How models score in this metric (threshold=0.5):

The “best model” is incredibly shallow lightGBM which we expect to be incorrect (deeper model should work better).

That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.

How does it depend on the threshold:

The higher the threshold, the less positive predictions. The less positive predictions, the ones that are classified as positive have higher certainty scores. Hence, the false discovery rate goes down.

When to use it

Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.

Jump back to the evaluation metrics list ->

7. True Positive Rate | Recall | Sensitivity

It measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions.

When you are optimizing recall you want to put all guilty in prison.

How to compute:

from sklearn.metrics import confusion_matrix, recall_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)

# or simply

recall_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

Our best model can recall 0.72 fraudulent transactions at the threshold 0.5. the difference in recall between our models is quite significant and we can clearly see better and worse models. Of course, for every model, we can adjust the threshold to recall all fraudulent transactions.

How does it depend on the threshold:

For the threshold of 0.1, we classify the vast majority of transactions as fraudulent and hence get really high recall of 0.917. As the threshold increases the recall falls.

When to use it:

Usually, you will not use it alone but rather coupled with other metrics like precision.,
That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.

Jump back to the evaluation metrics list ->

8. Positive Predictive Value | Precision

It measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent.

When you are optimizing precision you want to make sure that people that you put in prison are guilty.

How to compute:

from sklearn.metrics import confusion_matrix, precision_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)

# or simply

precision_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

It seems like all the models have pretty high precision at this threshold. The “best model” is incredibly shallow lightGBM which obviously smells fishy. That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.

Of course, for every model, we can adjust the threshold to increase precision. That is because if we take a small fraction of high scoring predictions the precision on those will likely be high.

How does it depend on the threshold:

The higher the threshold the better the precision and with a threshold of 0.68 we can actually get a perfectly precise model. Over this threshold, the model doesn’t classify anything as positive and so we don’t plot it.

When to use it:

Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
When raising false alerts is costly when you want all the positive predictions to be worth looking at you should optimize for precision.

Jump back to the evaluation metrics list ->

9. Accuracy

It measures how many observations, both positive and negative, were correctly classified.

You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.

How to compute:

from sklearn.metrics import confusion_matrix, accuracy_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)

# or simply

accuracy_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

We can see that for all the models we beat the dummy model (all clean transactions) by a large margin. Also, the models that we’d expect to be better are in fact at the top.

How does it depend on the threshold:

With accuracy, you can really use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over standard 0.5 could bump the score by a tiny bit 0.9686->0.9688.

When to use it:

When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
When every class is equally important to you.

Jump back to the evaluation metrics list ->

10. F beta score

Simply put, it combines precision and recall into one metric. The higher the score the better our model is. You can calculate it in the following way:

When choosing beta in your F-beta score the more you care about recall over precision the higher beta you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.

With 01 our optimal threshold moves toward lower thresholds and with beta=1 it is somewhere in the middle.

How to compute:

from sklearn.metrics import fbeta_score

y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)

Jump back to the evaluation metrics list ->

11. F1 score (beta=1)

It’s the harmonic mean between precision and recall.

How models score in this metric (threshold=0.5):

As we can see combining precision and recall gave us a more realistic view of our models. We get 0.808 for the best one and a lot of room for improvement.

What is good is that it seems to be ranking our models correctly with those larger lightGBMs at the top.

How does it depend on the threshold:

We can adjust the threshold to optimize F1 score. Notice that for both precision and recall you could get perfect scores by increasing or decreasing the threshold. Good thing is, you can find a sweet spot for F1 metric. As you can see, getting the threshold just right can actually improve your score by a bit 0.8077->0.8121.

When to use it:

Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.

Jump back to the evaluation metrics list ->

12. F2 score (beta=2)

It’s a metric that combines precision and recall, putting 2x emphasis on recall.

How models score in this metric (threshold=0.5):

This score is even lower for all the models than F1 but can be increased by adjusting the threshold considerably.
Again, it seems to be ranking our models correctly, at least in this simple example.

How does it depend on the threshold:

We can see that with a lower threshold and therefore more true positives recalled we get a higher score. You can usually find a sweet spot for the threshold. Possible gain from 0.755 -> 0.803 show how important threshold adjustments can be here.

When to use it:

I’d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it

Jump back to the evaluation metrics list ->

13. Cohen Kappa Metric

In simple words, Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies.

To calculate it one needs to calculate two things: “observed agreement” (po) and “expected agreement” (pe). Observed agreement (po) is simply how our classifier predictions agree with the ground truth, which means it is just accuracy. The expected agreement (pe) is how the predictions of the random classifier that samples according to class frequencies agree with the ground truth, or accuracy of the random classifier.

From an interpretation standpoint, I like that it extends something very easy to explain (accuracy) to situations where your dataset is imbalanced by incorporating a baseline (dummy) classifier.

How to compute:

from sklearn.metrics import cohen_kappa_score

cohen_kappa_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

We can easily distinguish the worst/best models based on this metric. Also, we can see that there is still a lot of room to improve our best model.

How does it depend on the threshold:

With the chart just like the one above we can find a threshold that optimizes cohen kappa. In this case, it is at 0.31 giving us some improvement 0.7909 -> 0.7947 from the standard 0.5.

When to use it:

This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion/alternative to accuracy.

Jump back to the evaluation metrics list ->

14. Matthews Correlation Coefficient | MCC

It’s a correlation between predicted classes and ground truth. It can be calculated based on values from the confusion matrix:

Alternatively, you could also calculate the correlation between y_true and y_pred.

How to compute:

from sklearn.metrics import matthews_corrcoef

y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

We can clearly see improvements in our model quality and a lot of room to grow, which I really like. Also, it ranks our models reasonably and puts models that you’d expect to be better on top. Of course, MCC depends on the threshold that we choose.

How does it depend on the threshold:

We can adjust the threshold to optimize MCC. In our case, the best score is at 0.53 but what I really like is that it is not super sensitive to threshold changes.

When to use it:

When working on imbalanced problems,
When you want to have something easily interpretable.

Jump back to the evaluation metrics list ->

15. ROC Curve

It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.

Of course, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better.

Extensive discussion of ROC Curve and ROC AUC score can be found in this article by Tom Fawcett.

How to compute:

from scikitplot.metrics import plot_roc

fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)

How does it look:

We can see a healthy ROC curve, pushed towards the top-left side both for positive and negative class. It is not clear which one performs better across the board as with FPR < ~0.15 positive class is higher and starting from FPR~0.15 the negative class is above.

Jump back to the evaluation metrics list ->

16. ROC AUC score

In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and hence higher ROC AUC score.

Alternatively, it can be shown that ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows how good at ranking predictions your model is. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

How to compute:

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_true, y_pred_pos)

How models score in this metric:

We can see improvements and the models that one would guess to be better are indeed scoring higher. Also, the score is independent of the threshold which comes in handy.

When to use it:

You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
You should not use it when your data is heavily imbalanced. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.

Jump back to the evaluation metrics list ->

17. Precision-Recall Curve

It is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.

You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. Obviously, the higher the recall the lower the precision. Knowing at which recall your precision starts to fall fast can help you choose the threshold and deliver a better model.

How to compute:

from scikitplot.metrics import plot_precision_recall

fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)

How does it look:

We can see that for the negative class we maintain high precision and high recall almost throughout the entire range of thresholds. For the positive class precision is starting to fall as soon as we are recalling 0.2 of true positives and by the time we hit 0.8, it decreases to around 0.7.

Jump back to the evaluation metrics list ->

18. PR AUC score | Average precision

Similarly to ROC AUC score you can calculate the Area Under the Precision-Recall Curve to get one number that describes model performance.

You can also think about PR AUC as the average of precision scores calculated for each recall threshold [0.0, 1.0]. You can also adjust this definition to suit your business needs by choosing/clipping recall thresholds if needed.

How to compute:

from sklearn.metrics import average_precision_score

average_precision_score(y_true, y_pred_pos)

How models score in this metric:

The models that we suspect to be “truly” better are in fact better in this metric which is definitely a good thing. Overall, we can see high scores but way less optimistic then ROC AUC scores (0.96+).

When to use it:

when you want to communicate precision/recall decision to other stakeholders
when you want to choose the threshold that fits the business problem.
when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).

Jump back to the evaluation metrics list ->

19. Log loss

Log loss is often used as the objective function that is optimized under the hood of machine learning models. Yet, it can also be used as a performance metric.

Basically, we calculate the difference between ground truth and predicted score for every observation and average those errors over all observations. For one observation the error formula reads:

The more certain our model is that an observation is positive when it is, in fact, positive the lower the error. But this is not a linear relationship. It is good to take a look at how the error changes as that difference increases:

So our model gets punished very heavily when we are certain about something that is untrue. For example, when we give a score of 0.9999 to an observation that is negative our loss jumps through the roof. That is why sometimes it makes sense to clip your predictions to decrease the risk of that happening.

If you want to learn more about log-loss read this article by Daniel Godoy.

How to compute:

from sklearn.metrics import log_loss

log_loss(y_true, y_pred)

How models score in this metric:

It is difficult to really see strong improvement and get an intuitive feeling for how strong the model is. Also, the model that was chosen as the best one before (BIN-101) is in the middle of the pack. That can suggest that using log-loss as a performance metric can be a risky proposition.

When to use it:

Pretty much always there is a performance metric that better matches your business problem. Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.

Jump back to the evaluation metrics list ->

20. Brier score

It is a measure of how far your predictions lie from the true values. For one observation it simply reads:

Basically, it is a mean square error in the probability space and because of that, it is usually used to calibrate probabilities of the machine learning models. If you want to read more about probability calibration I recommend that you read this article by Jason Brownlee.

It can be a great supplement to your ROC AUC score and other metrics that focus on other things.

How to compute:

from sklearn.metrics import brier_score_loss

brier_score_loss(y_true, y_pred_pos)

How models score in this metric:

Model from the experiment BIN-101 has the best calibration and for that model, on average our predictions were off by 0.16 (√0.0263309).

When to use it:

When you care about calibrated probabilities.

Jump back to the evaluation metrics list ->

21. Cumulative gains chart

In simple words, it helps you gauge how much you gain by using your model over a random model for a given fraction of top scored predictions.

Simply put:

you order your predictions from highest to lowest and
for every percentile you calculate the fraction of true positive observations up to that percentile.

It makes it easy to see the benefits of using your model to target given groups of users/accounts/transactions especially if you really care about sorting them.

How to compute:

from scikitplot.metrics import plot_cumulative_gain

fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)

How does it look:

We can see that our cumulative gains chart shoots up very quickly as we increase the sample of highest-scored predictions. By the time we get to the 20th percentile over 90% of positive cases are covered. You could use this chart to prioritize and filter out possible fraudulent transactions for processing.

Say we were to use our model to assign possible fraudulent transactions for processing and we needed to prioritize. We could use this chart to tell us where it makes the most sense to choose a cutoff.

When to use it:

Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

Jump back to the evaluation metrics list ->

22. Lift curve | lift chart

It is pretty much just a different representation of the cumulative gains chart:

we order the predictions from highest to lowest
for every percentile, we calculate the fraction of true positive observations up to that percentile for our model and for the random model,
we calculate the ratio of those fractions and plot it.

It tells you how much better your model is than a random model for the given percentile of top scored predictions.

How to compute:

from scikitplot.metrics import plot_lift_curve

fig, ax = plt.subplots()
plot_lift_curve(y_true, y_pred, ax=ax)

How does it look:

So for the top 10% of predictions, our model is over 10x better than random, for 20% is over 4x better and so on.

When to use it:

Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

Jump back to the evaluation metrics list ->

23. Kolmogorov-Smirnov plot

KS plot helps to assess the separation between prediction distributions for positive and negative classes.

In order to create it you:

sort your observations by the prediction score,
for every cutoff point [0.0, 1.0] of the sorted dataset (depth) calculate the proportion of true positives and true negatives in this depth,
plot those fractions, positive(depth)/positive(all), negative(depth)/negative(all), on Y-axis and dataset depth on X-axis.

So it works similarly to cumulative gains chart but instead of just looking at positive class it looks at the separation between positive and negative class.

A good explanation of KS plot and KS statistic can be found in this article by Riaz Khan.

How to compute:

from scikitplot.metrics import plot_ks_statistic

fig, ax = plt.subplots()
plot_ks_statistic(y_true, y_pred, ax=ax)

How does it look:

So we can see that the largest difference is at a cutoff point of 0.034 of top predictions. After that threshold, it decreases at a moderate rate as we increase the percentage of top predictions. Around 0.8 it is really getting worse really fast. So even though the best separation is at 0.034 we could potentially push it a bit higher to get more positively classified observations.

Jump back to the evaluation metrics list ->

24. Kolmogorov-Smirnov statistic

If we want to take the KS plot and get one number that we can use as a metric we can look at all thresholds (dataset cutoffs) from KS plot and find the one for which the distance (separation) between the distributions of true positive and true negative observations is the highest.

If there is a threshold for which all observations above are truly positive and all observations below are truly negative we get a perfect KS statistic of 1.0.

How to compute:

from scikitplot.helpers import binary_ks_curve

res = binary_ks_curve(y_true, y_pred_pos)
ks_stat = res[3]

How models score in this metric:

By using the KS statistic as the metric we were able to rank BIN-101 as the best model which we truly expect to be “truly” best model.

When to use it:

when your problem is about sorting/prioritizing the most relevant observations and you care equally about positive and negative classes.
It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

Jump back to the evaluation metrics list ->

Final Thoughts

In this blog post, you’ve learned about various classification metrics and performance charts.

We went over metric definitions, interpretations, we learned how to calculate them, and talked about when to use them.

Hopefully, with all that knowledge you will be fully equipped to deal with metric-related problems in your future projects.

Bonus

To help you use the information from this blog post to the fullest, I have prepared:

logging helper function that calculates and logs all the metrics, performance charts, and metric by threshold charts
binary classification metrics cheatsheet with everything I talked about digested into a few pages.

Check those out below!

Logging helper function

If you want to log all of those metrics and performance charts that we covered for your machine learning project with just one function call and explore them in Neptune.

install the package:

pip install neptune-contrib[all]

import and run:

import neptunecontrib.monitoring.metrics as npt_metrics

npt_metrics.log_binary_classification_metrics(y_true, y_pred)

explore everything in the app:

Binary classification metrics cheatsheet

We’ve created a nice cheatsheet for you which takes all the content I went over in this blog post and puts it on a few-page, a digestible document which you can print and use whenever you need anything binary classification metrics related.

Download binary classification metrics cheatsheet

Example script

import lightgbm
import matplotlib.pyplot as plt
import neptune
from neptunecontrib.monitoring.utils import pickle_and_send_artifact
from neptunecontrib.monitoring.metrics import log_binary_classification_metrics
from neptunecontrib.versioning.data import log_data_version
import pandas as pd

plt.rcParams.update({'font.size': 18})
plt.rcParams.update({'figure.figsize': [16, 12]})
plt.style.use('seaborn-whitegrid')

# Define parameters
PROJECT_NAME = 'neptune-ml/binary-classification-metrics'

TRAIN_PATH = 'data/train.csv'
TEST_PATH = 'data/test.csv'
NROWS = None

MODEL_PARAMS = {'random_state': 1234,
                'learning_rate': 0.1,
                'n_estimators': 1500}

# Load data
train = pd.read_csv(TRAIN_PATH, nrows=NROWS)
test = pd.read_csv(TEST_PATH, nrows=NROWS)

feature_names = [col for col in train.columns if col not in ['isFraud']]

X_train, y_train = train[feature_names], train['isFraud']
X_test, y_test = test[feature_names], test['isFraud']

# Start experiment
neptune.init(PROJECT_NAME)
neptune.create_experiment(name='lightGBM training',
                          params=MODEL_PARAMS,
                          upload_source_files=['train.py', 'environment.yaml'])
log_data_version(TRAIN_PATH, prefix='train_')
log_data_version(TEST_PATH, prefix='test_')

# Train model
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)

# Evaluate model
y_test_pred = model.predict_proba(X_test)

log_binary_classification_metrics(y_test, y_test_pred)
pickle_and_send_artifact((y_test, y_test_pred), 'test_predictions.pkl')

neptune.stop()

DEV Community: Jakub Czakon

Text Classification: All Tips and Tricks from 5 Kaggle Competitions

Dealing with larger datasets

Small datasets and external data

Data Exploration and Gaining insights

Data Cleaning

Text Representations

Modeling

Loss functions

Optimizers

Callback methods

Evaluation and cross-validation

Runtime tricks

Model ensembling

Final thoughts

6 GAN Architectures You Really Should Know

GAN 101 and Vanilla GAN

CycleGAN:

PixelRNN

text-2-image

DiscoGAN

lsGAN

Final thoughts

Image Segmentation: Tips and Tricks from 39 Kaggle Competitions

Contents

External Data

Data Exploration and Gaining insights

Preprocessing

Data Augmentations

Modeling

Architectures

Hardware Setups

Loss Functions

Training tips

Evaluation and cross-validation

Ensembling methods

Post Processing

Final Thoughts

Image segmentation in 2020: Architectures, Losses, Datasets, and Frameworks

What is Image Segmentation?

Image Segmentation Architectures

U-Net

FastFCN - Fast Fully-connected network

Gated-SCNN

DeepLab

Mask R-CNN

Image Segmentation Loss functions

Focal Loss

Dice loss

Intersection over Union (IoU)-balanced Loss

Boundary loss

Weighted cross-entropy

Lovász-Softmax loss

Image Segmentation Datasets

Common Objects in COntext - Coco Dataset

PASCAL Visual Object Classes (PASCAL VOC)

The Cityscapes Dataset

The Cambridge-driving Labeled Video Database - CamVid

Image Segmentation Frameworks

Final Thoughts

Document Classification: 7 pragmatic approaches for small datasets

Text Classification 101

Example text classification dataset

Text data preparation

Text Representation

TfidfVectorizer

Word2vec

FastText

GloVe ( Global vectors for word representation)

Elmo, BERT, and others.

Text Classification

Comparison

Final Thoughts

8 Creators and Core Contributors Talk About Their Model Training Libraries From PyTorch Ecosystem

Skorch

Catalyst

Fastai

PyTorch Ignite

“make the common things easy and the hard things possible”.

Library structure