DEV Community: Rohit Gupta

EfficientNet for Beginners

Rohit Gupta — Wed, 02 Mar 2022 10:46:44 +0000

A very brief introduction of EfficientNet for beginners without much technical details.

Considering the problems faced by the older networks, Google released a paper in 2019 that dealt with a new family of CNNs i.e EfficientNet . These CNNs not only provide better accuracy but also improve the efficiency of the models by reducing the parameters and FLOPS (Floating Point Operations Per Second) manifold in comparison to the state of art models

What is new in this ? : In Efiicient Net : We perform scaling on 1.depth 2. width 3.resolution

1.Depth Scaling : It means keep on increasing the depth of network. We all know that more layers means more powerful network and hence better results but more layers results in exploding/vanishing gradient problem. To resolve this issue of exploding/vanishing gradient we did have ResNet but ResNet is computationally expensive.

2.Resolution Scaling : Low resolution images are mostly blurry. High Reso has more pixel and has more info. Network learns on more complex features and fine grained patterns. Learning gets better and accuracy increases.

3.Width : Increasing number of Feature Maps/ Channels. Increasing the number of feature maps. To get each and every feature of image, we need more feature maps. More fine grained features needed to be extracted.

If the input image is bigger(resolution), than there is more complex features and fine-grained patterns.

Why Depth Scaling : we have done Resolution Scaling, hence we have more data or more information in our input image which needs to be processed. So we need more layers.

But how much Depth Scaling for particular increment in the resolution of images? How many layers we need ?

Also from the paper :

1.Scaling up any dimension of network width, depth or resolution improves accuracy but the accuracy gain diminishes for bigger models.

2.In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth and resolution during scaling.

The main contributions of this paper are: a.Designing a simple mobile-size baseline architecture: EfficientNet-B0 b.Providing an effective compound scaling method for increasing the model size to achieve maximum accuracy gains.

How to do Scaling in the balanced way ? Compound Scaling helps in choosing the right value so that accuracy gains didn't diminishes. For Compound Scaling, we need a baseline model (Efficientnet-B0). Before continuing, we should know that EfficientNet-B0 is developed by Neural Architecture Search which is used for automating the design of ANNs and the networks designed by NAS are on par and outperform hand designed architectures.

EfficientNet-B0. It achieves 77.3% accuracy on ImageNet with only 5.3M parameters and 0.39B FLOPS. (Resnet-50 provides 76% accuracy with 26M parameters and 4.1B FLOPS).

On Efficientnet-B0, we will do Compound Scaling, to upscale the methods. Compound Scaling is done in the following way:

STEP 1: we first fix φ = 1, assuming twice more resources available, and do a small grid search of α, β, γ (by using equations 1 and 2 given in paper). In particular, we find the best values for EfficientNet-B0 are α = 1.2, β =1.1, γ = 1.15, under constraint of
α ·β2· γ2 ≈ 2

• STEP 2: we then fix α, β, γ as constants and scale up baseline network with different φ using Equation 3, to obtain EfficientNet-B1 to B7. Notably, it is possible to achieve even better performance by searching for α, β, γ directly around a large model, but the search cost becomes prohibitively more expensive on larger models. Our method solves this issue by only doing search once on the small baseline network (step 1), and then use the same scaling coefficients for all other models (step 2)

α = 1.2, β =1.1, γ = 1.15 simply means that if resolution is enhanced by 15%, than depth should be increased by 20% and width should be increased by 10%.

Results:This technique allowed the authors to produce models that provided accuracy higher than the existing ConvNets and that too with a monumental reduction in overall FLOPS and model size.

Original Paper
Google Blog

That's all folks.

If you have any doubt ask me in the comments section and I'll try to answer as soon as possible.
If you love the article follow me on Twitter: [https://twitter.com/guptarohit_kota]
If you are the Linkedin type, let's connect: www.linkedin.com/in/rohitgupta24

Happy Coding and Have an awesome day ahead 😀!

ConvNext : A ConvNet for the 2020s (Part I)

Rohit Gupta — Mon, 28 Feb 2022 13:10:44 +0000

The goal of the paper is to modernize the ResNet and bring back the glory to CNNs ;)
In other words, they tried to apply the concepts of Transformers to ResNet like archtitecture and make them better. They individually applied ideas and showcased how much of an improvement it shows.
.

They compared the impact on 2 things : Accuracy and Computation.

Finally they reached the conclusion that following Changes enhances the results :

-Large Kernel Size(7*7).
-Replace ReLu with GeLu
-Fewer Norm Layers
-Substitute BatchNorm with LayerNorm
-Inverted Block
-Grouped Convs reduces computation
-Add a "patchify layer"(to split an image into sequence of patches)

Below Image shows the comparison of ResNet, ViT with ConvNext. Diameter shows the Computation Power needed, hence more bigger the circle is, more computationally expensive model will be.

If augmentationare applied on ViT model, than comparison goes like this :

A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this
exploration is a family of pure ConvNet models dubbed ConvNeXt.

Also, another quote from paper is very good :

The full dominance of ConvNets in computer vision was
not a coincidence: in many application scenarios, a “sliding
window” strategy is intrinsic to visual processing, particularly when working with high-resolution images. ConvNets have several built-in inductive biases that make them well suited to a wide variety of computer vision applications. The most important one is translation equivariance, which is a desirable property for tasks like objection detection. ConvNets are also inherently efficient due to the fact that when used in a sliding-window manner, the computations are shared.

Translational Equivariance or just equivariance is a very important property of the convolutional neural networks where the position of the object in the image should not be fixed in order for it to be detected by the CNN. This simply means that if the input changes, the output also changes.
The property of translational equivariance is achieved in CNN’s by the concept of weight sharing. As the same weights are shared across the images, hence if an object occurs in any image it will be detected irrespective of its position in the image. This property is very useful for applications such as image classification, object detection, etc where there may be multiple occurrences of the object or the object might be in motion.

For more information on Translational Equivariance : Follow this Article

Details of Paper :
ResNet-50 is trained like Transformers but with 1.More Epochs,2.Image Augmentation,and 3.AdamW.
Researchers used a training recipe that is close to DeiT’s and
Swin Transformer’s. The training is extended to 300 epochs from the original 90 epochs for ResNets. We use the AdamW optimizer, data augmentation techniques such as Mixup, Cutmix, RandAugment, Random Erasing, and regularization schemes including Stochastic Depth and Label Smoothing.

Adding Patchify Layer : Researchers have replaced the ResNet-style stem cell with a patchify layer implemented using a 4×4, stride 4 convolutional layer. The accuracy has changed from 79.4% to 79.5%. This suggests that the stem cell in a ResNet may be substituted with a simpler “patchify” layer like ViT.

ResNeXt-ify : The use of depthwise convolution effectively reduces the network FLOPs and, as expected, the accuracy. Following the strategy proposed in ResNeXt, we increase the network width to the same number of channels as Swin-T’s (from 64 to 96).This brings the network performance to 80.5% with increased FLOPs.

More to come soon.

Official Paper Link
Awesome Video Explanation

That's all folks.

Happy Coding and Have an awesome day ahead 😀!

Cross Validation for Beginners

Rohit Gupta — Sat, 26 Feb 2022 20:44:01 +0000

While attempting to solve a ML problem, we do a train_test split. If this split is done randomly than it might be possible that some dataset might be completely present in test set and absent from training set or vice versa. This reduces the accuracy of model. So Cross Validation comes into picture.
Cross-validation is a step in the process of building a machine learning model which helps us ensure that our models fit the data accurately and also ensures that we do not overfit.Cross-validation is dividing training data into a few parts. We train the model on some of these parts and test on the remaining parts.

Types Of Cross Validation

i. Leave One Out CV :

Split a dataset into a training set and a testing set, using all but one observation as part of the training set.
Note that we only leave one observation “out” from the training set. This is where the method gets the name “leave-one-out” cross-validation.
Use "Leave One Out" as test set.
In the second experiment, "leave out" another set and take the rest of the data as training input.
Repeat the Process.

Cons : Computationally Expensive and results in Low Bias.

Low Bias : For the training and test set, we will get good results but when we will try the model on new data accuracy will go low and error rate goes high.

ii. K-Fold CV : We have some data and we have k value. For example : number of data == 1000 and k == 5. Hence first 200 samples(1000/5 = 200) will be test data. In second experiment, next 200 will be test data. Process will be iterated for 5 times.
Out of all the 5 iterations, we will get 5 accuracies and we can select the best out of 5.

Full Code

import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a CSV file called train.csv
    df = pd.read_csv("train.csv")
    # we create a new column called kfold and fill it with -1
    df["kfold"] = -1
    # the next step is to randomize the rows of the data
    df = df.sample(frac=1).reset_index(drop=True)
    # initiate the kfold class from model_selection module
    kf = model_selection.KFold(n_splits=5)
    # fill the new kfold column
    for fold, (trn_, val_) in enumerate(kf.split(X=df)):
        df.loc[val_, 'kfold'] = fold
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)

iii. Stratified CV : If we have a skewed dataset for classification with 90% positive samples and only 10% negative samples, we don't use random k-fold cross-validation. Using simple k-fold cross-validation for a dataset like this can result in folds with all negative samples. In these cases, we prefer using stratified k-fold cross-validation.
Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So, in each fold, we will have the same 90% positive and 10% negative samples. Stratified k-fold cross-validation keeps the ratio of labels in each fold constant.

import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a CSV file called train.csv
    df = pd.read_csv("train.csv")
    # we create a new column called kfold and fill it with -1
    df["kfold"] = -1
    # the next step is to randomize the rows of the data
    df = df.sample(frac=1).reset_index(drop=True)
    # initiate the kfold class from model_selection module
    kf = model_selection.KFold(n_splits=5)
    # fill the new kfold column
    for fold, (trn_, val_) in enumerate(kf.split(X=df)):
        df.loc[val_, 'kfold'] = fold
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)`

iv. Time Series CV : The method that can be used for cross-validating the time-series model is cross-validation on a rolling basis. Start with a small subset of data for training purpose, forecast for the later data points and then checking the accuracy for the forecasted data points. The same forecasted data points are then included as part of the next training dataset and subsequent data points are forecasted.

Full Code

That's all folks.

Happy Coding and Have an awesome day ahead 😀!

Positional Embeddings

Rohit Gupta — Fri, 25 Feb 2022 07:14:21 +0000

Positional Embeddings always looked like a different thing to me, so this post is all about explaining the same in plain english..
We all hear and read this word ("Positional Embeddings") wherever Transformer Neural Network comes up and now as Transformers are everywhere from Natural Language Processing to Image Classification(after ViT), it becomes more important to understand them.

What are Positional Embeddings or Positional Encodings?
Let's take an example : Consider the input as "King and Queen". Now if we change the order of the input as ""Queen and King", than the meaning of the input might get change. Same happens if the input is in the form of 16*16 images(as it happens in ViT), if order of images changes, than everything changes.

Also, the transformer doesn't process the input sequentially, Input is processed in parallel.

For each element it combines the information from the other element through self-attention, but each element does this aggregation on its own independently of what other elements do.
The Transformer model doesn't model the sequence of input anywhere explicitly. So to know the exact sequence of input Positional Embeddings comes into picture. They works as the hints to Transformers and tells the model about the sequence of inputs.

These embeddings are added to initial vector representations of the input.
Also,Every position have same identifier irrespective of what exactly the input is.

There is no notion of word order (1st word, 2nd word, ..) in the initial architecture. All words of input sequence are fed to the network with no special order or position; in contrast, in RNN architecture, n-th word is fed at step n, and in ConvNet, it is fed to specific input indices. Therefore model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words. Based on experiments, this addition not only avoids destroying the embedding information but also adds the vital position information.

The specific choice of (sin, cos) pair helps the model in learning patterns that rely on relative positions.

Further Reading: Article by Jay Alammar explains the paper with excellent visualizations.
The example on positional encoding calculates PE(.)the same, with the only difference that it puts sin in the first half of embedding dimensions (as opposed to even indices) and cos in the second half (as opposed to odd indices). This difference does not matter since vector operations would be invariant to the permutation of dimensions.

This article is inspired by this Youtube Video from AI Coffee Break with Letitia

That's all folks.

Have an awesome day ahead 😀!

Vision Transformer : An image is worth 16*16 words

Rohit Gupta — Thu, 24 Feb 2022 12:34:13 +0000

In computer vision, however, convolutional architectures remain dominant. Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention, some replacing the convolutions entirely. The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns.
Inspired by the Transformer scaling successes in NLP,in this research paper a standard Transformer was applied directly to images, with the fewest possible modifications An Image is worth 16 x 16 Words.

When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks
(ImageNet, CIFAR-100,VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.

Problem with CNN:
CNNs use kernels to aggregate very local information in each layer which than is passed to next layer which again use kernels to aggregate the local information.Hence CNN starts to look very locally. But Vision transformer resolved this problem.

How Transformers resolves it?
It considers a very large field of view from the very beginning. It overcomes the limitation of CNNs that looked very narrowly at first.Also, there is no decoder layer, instead there is an extra linear layer for the final classification called the MLP head.
The Transformer look at the data by taking an input image and splitting it into patches of 16*16 pixels.

All the patches are treated as simple tokensand are than flattened and lower dimensional linear embeddings are produced.
Than we add positional embeddings to the vector. Now this sequence is feeded into standard transformer encoder.
Model is pretrained with image labels.(fully supervised huge dataset.)
Finally, network is finetuned on the downstream dataset.

While resolving a NLP problem, input(like incomplete sentences) are first converted into numeric indices(by creating a vocabulary dictionary from the words present in training data) and then are fed into Transformers.

Positional Embedding :Positional encoding is a re-representation of the values of a word and its position in a sentence (given that is not the same to be at the beginning that at the end or middle).
But you have to take into account that sentences could be of any length, so saying '"X" word is the third in the sentence' does not make sense if there are different length sentences: 3rd in a 3-word-sentence is completely different to 3rd in a 20-word-sentence.
What a positional encoder does is to get help of the cyclic nature of sin(x) and cos(x) functions to return information of the position of a word in a sentence.
Source : stackexchange.com

In all, total 3 variants of ViTwere proposed

Hidden size D is the embedding size, which is kept fixed throughout the layers.
Transformers are better in general because they can be scaled up.

Problems which still need to be resolved
Transformers are unfocused in the initial epochs but later become focused to make write predictions after some training and therefore are more data hungry than CNNs.
Transformers find out very original and unexpected ways to look into data(input images) as there is no element in the architecture to tell the model how to do that exactly.However, CNNs are focussed on local view from the beginning by the convolutions.

Transformers lack the inductive biases of Convolutional Neural Networks (CNNs), such as translation invariance and a locally restricted receptive field. Invariance means that you can recognize an entity (i.e. object) in an image, even when its appearance or position varies. Translation in computer vision implies that each image pixel has been moved by a fixed amount in a particular direction

Conclusion
The key takeaway of this work is the formulation of an image classification problem as a sequential problem by using image patches as tokens, and processing it by a Transformer. That sounds good and simple but it needs massive data and very high computational power. If ViT is trained on datasets with more than 14M images it can approach or beat state-of-the-art CNNs.

ResNet For Begginers

Rohit Gupta — Sat, 19 Feb 2022 07:51:00 +0000

What was the Problem

When AlexNet won the ImageNet 2012 competition, every one started using more layers in a deep neural network to reduce the error rate. This works for less number of layers, but when we increase the number of layers, there is a common problem arises with it called Vanishing/Exploding gradient. This causes the gradient to become 0 or too large. Thus when we increases number of layers, the training rate decreases only upto a limit and than start increasing.

However in theory, with increase in number of layers, performance of neural network should increase.

How ResNet solved the Problem

ResNet used the concept of "Skip Connections" and obviously it worked ;)
For example, consider there are two neural networks as shown in image.

In the first neural network, output is f(x) and there is no problem of vanishing or exploding gradient as the number of layers are less.
And we know that when we will increase the number of layers, output will be changed and this output will not be efficient due to the exploding and vanishing gradient.
Now, in the second neural network (which is deeper, as shown in image), we have added some more layers and due to these layers our output is changed to y = b + f(x) from f(x)

In the second output, if somehow we would be able to make b = 0, than **y = 0 + f(x)**
In this way, we would be able to make the neural network more deeper, hence we will get more efficient model and also as we are getting the same output(as of shallow network) so there is no effect of vanishing or exploding gradient.

Or in other words, we can say that if input(of the extra layers which we have added to convert shallow network into deeper network) == output than we will get the desired results.

Main idea behind Residual Network is to make b = 0 , so that y=f(x), for deeper neural networks In shallow network: y = f(x) In deeper network: If operations of extra layers = b, than : y = f(x) + b y = f(x) + 0 [b = 0] Input of extra layers = Output of Extra Layers (b = 0)

ResNet Architecture

How to read ResNet 50 : Convolution operation is done on input image with the filter of size 7*7 and 64 such Filters with stride of 2 and padding of 3.
And than MaxPooling Layer is applied with filter size of 3*3 and stride of 2 and then other operations are applied as shown in image.

Dotted line shows the Convolution Block in Skip Connections.

There are two types of Blocks in ResNet :
a. Identity Block : where input size == output size. Hence there will be no Convolution Layer in Skip Connection.

Example : input image = 56*56*3 --> output image = 56*56*3

b. Convolution Block : where input size of image != output size. Hence there will be a Convolution Layer in the skip connection.

Example : input image = 56*56*3 --> output image = 28*28*128

To make the input == output, we can do two operations. Padding or 1*1 Convolution.

(Convolution of 1*1): (n+2p-f/s)+1 : Padding = 0, hence we get image of size : 28*28
And that's how after performing 1*1 Convolution, input == output.

What's Next :
Coding ResNet
ResNext : https://arxiv.org/abs/1611.05431v2
SeResNext : https://arxiv.org/abs/1709.01507v4
Original ResNet Paper : [https://arxiv.org/pdf/1512.03385v1.pdf]

That's all folks.

If you have any doubt ask me in the comments section and I'll try to answer as soon as possible.

If you love the article follow me on Twitter: [https://twitter.com/guptarohit_kota]
If you are the Linkedin type, let's connect: www.linkedin.com/in/rohitgupta24

Have an awesome day ahead 😀!

DEV Community: Rohit Gupta

EfficientNet for Beginners

ConvNext : A ConvNet for the 2020s (Part I)

Cross Validation for Beginners

Positional Embeddings

Vision Transformer : An image is worth 16*16 words

ResNet For Begginers

What was the Problem

*How ResNet solved the Problem*

ResNet Architecture

How ResNet solved the Problem