DEV Community: Seri Lee

Advanced Object Detection

Seri Lee — Tue, 24 Aug 2021 09:04:32 +0000

This article is originally from the book "Computer Vision with PyTorch"

In the previous chapter, we learned about R-CNN and Fast R-CNN techniques, which leveraged region proposals to generate predictions of the locations of objects in an image along with the classes corresponding to objects in the image. Furthermore, we learned about the bottleneck of the speed of inference, which happens because of having two different models-one for region proposal generation and another for object detection. In this chapter, we will learn about different modern techniques, such as Faster R-CNN, YOLO, and Single-Shot Detector (SSD), that overcomes slow inference time by employing a single model to make predictions for both the class of objects and the bounding box in a single shot. We will start by learning about anchor boxes and then proceed to learn about how each of the techniques works and how to implement them to detect objects in an image.

Components of modern object detection algorithms

The drawback of the R-CNN and Fast R-CNN techniques is that they have two disjointed networks-one to identify the regions that likely contain an object and the other to make corrections to the bounding box where an object is identified. Furthermore, both the models require as many forward propagations as there are region proposals. Modern object detection algorithms focus heavily on training a single neural network and have the capability to detect all objects in one forward pass. In the subsequent sections, we will learn about the various components of a typical modern object detection algorithm:

Anchor Boxes
Region Proposal Network (RPN)
Region of Interest Pooling (RoI)

Anchor boxes

So far, we have region proposals coming from the selective search method. Anchor boxes come in as handy replacement for selective search-we will learn how they replace selective search-based region proposals in this section.

Typically, a majority of objects have a similar shape-for example, in a majority of cases, a bounding box corresponding to an image of a person will have a greater height than width, and a bounding box corresponding to the image of a truck will have a greater width than height. Thus, we will have a decent idea of the height and width of the objects present in an image even before training the model (by inspecting the ground truths of bounding boxes corresponding to objects of various classes).

Furthermore, in some images, the objects of interest might be scaled-resulting in a much smaller or much greater height and width than average-while still maintaining the aspect ratio (that is, $\frac{h e i g h t}{w i d t h}$ ).

Once we have a decent idea of the aspect ratio and the height and width of objects (which can be obtained from ground truth values in the dataset) present in our images, we define the anchor boxes with heights and widths representing the majority of objects' bounding boxes within our dataset.

Typically, this is obtained by employing K-means clustering on top of the ground truth bounding boxes of objects present in images.

Now that we understand how anchor boxes' heights and widths are obtained, we will learn about how to leverage them in the process:

Slide each anchor box over an image from top left to bottom right
The anchor box that has a high intersection over union (IoU) with the object will have a label that it contains an object, and the others will be labeled 0.
- We can modify the threshold of the IoU by mentioning that if the IoU is greater than a certain threshold, the object class is 1; if it is less than another threshold, the object class is 0, and it is unknown otherwise. Once we obtain the ground truths as defined here, we can build a model that can predict the location of an object and also the offset corresponding to the anchor box to match it with ground truth. Let's understand how anchor boxes are represented in the following image: In the preceding image, we have two anchor boxes, one that has a greater height than width and the other with a greater width than height, to correspond to the objects (classes) in the image-a person and a car.

We slide the two anchor boxes over the image and note the locations where the IoU of the anchor box with the ground truth is the highest and denote that this particular location contains an object while the rest of the locations do not contain an object.

In addition to the preceding two anchor boxes, we would also create anchor boxes with varying scales so that we accommodate the different scales at which an object can be presented within an image. An example of how the different scales of anchor boxes look follows: Note that all the anchor boxes have the same center but different aspect ratios or scales.

Now that we understand anchor boxes, in the next section, we will learn about the RPN, which leverages anchor boxes to come up with predictions of regions that are likely to contain an object.

Region Proposal Network

Imagine a scenario where we have a 224x224x3 image. Furthermore, let's say that the anchor box is of shape 8x8 for this example. If we have a stride of 8 pixels, we are fetching 224/8 = 28 crops of a picture for every row-essentially $28 \times 28 = 576$ crops from a picture. We then take each of these crops and pass through a Region Proposal Network (RPN) model that indicates whether the crop contains an image. Essentially, an RPN suggests the likelihood of a crop containing an object.

Let's compare the output of selectivesearch and the output of an RPN.

selectivesearch gives a region candidates based on a set of computations on top of pixel values. However, an RPN generates region candidates based on the anchor boxes and strides with which anchor boxes are slid over the image. Once we obtain the region candidates using either of these two methods, we identify the candidates that are most likely to contain an object.

While region proposal generation based on selectivesearch is done outside of the neural network, we can build an RPN that is part of the object detection network. Using an RPN, we are now in a position where we don't have to perform unnecessary computations to calculate region proposals outside of the network. This way, we have a single model to identify regions, identify classes of objects in image, and identify their corresponding bounding box locations.

Next, we will learn how an RPN identifies whether a region candidate (a crop obtained after sliding an anchor box) contains an object or not. In our training data, we would have the ground truth correspond to objects. We now take each region candidate and compare with the ground truth bounding boxes of objects in an image to identify whether the IoU between a region candidate and a ground truth bounding box is greater than a certain threshold (say, 0.5). If the IoU is greater than a certain threshold, the region candidate contains an object, and if the IoU is less than a threshold (say 0.1), the region candidate does not contain an object and all the candidates that have an IoU between the two thresholds (0.1-0.5) are ignored while training.

Once we train a model to predict if the region candidate contains an object, we then perform non-max suppression, as multiple overlapping regions can contain an object.

In summary, an RPN trains a model to enable it to identify region proposals with a high likelihood of containing an object by performing the following steps:

Slide anchor boxes of different aspect ratios and sizes across the image to fetch crops of an image.
Calculate the IoU between the ground truth bounding boxes of objects in the image and the crops obtained in the previous step.
Prepare the training dataset in such a way that crops with an IoU greater than a threshold contain an object and crops with an IoU less than a threshold do not contain an object.
Train the model to identify the regions that contain an object.
Perform non-max suppression to identify the region candidate that has the highest probability of containing an object and eliminate other region candidates that have a high overlap with it.

Classification and regression

So far, we have learned about the following steps in order to identify objects and perform offsets to bounding boxes:

Identify the regions that contain objects.
Ensure that all the feature maps of regions, irrespective of the regions' shape, are exactly the same using Region of Interest (RoI) pooling.

Two issues with these steps are as follows:

The region proposals do not correspond tightly over the object (IoU>0.5 is the threshold we had in the RPN).
We identified whether the region contains an object or not, but not the class of the object located in the region.

We address these two issues in this section, where we take the uniformly shape feature map obtained previously and pass it through a network. We expect the network to predict the class of the object contained within the region and also the offsets corresponding to the region to ensure that the bounding box is as tight as possible around the object in the image.

Let's understand this through the following diagram: In the preceding diagram, we are taking the output of RoI pooling as input (the 7x7x5x12 shape), flattening it, and connecting to a dense layer before predicting two aspects:

Class of object in the region
Amount of offset to be done on the predicted bounding boxes of the region to maximize the IoU with the ground truth

Hence, if there are 20 classes in the data, the output of the neural network contains a total of 25 outputs-21 classes (including the background class) and the 4 offsets to be applied to the height, width, and two center coordinates of the bounding box.

Now that we have learned the different components of an object detection pipeline, let's summarize it with the following diagram:

Working details of YOLO

You Only Look Once (YOLO) and its variants are one of the prominent object detection algorithms. In this section, we will understand at a high level how YOLO works and the potential limitations of R-CNN-based object detection frameworks that YOLO overcomes.

First, let's learn about the possible limitations of R-CNN-based detection algorithms. In Faster R-CNN, we slide over the image using anchor boxes and identify the regions that are likely to contain an object, and then we make the bounding box corrections. However, in the fully connected layer, where only the detected region's RoI pooling output is passed as input, in the case of regions that do not fully encompass the object (where the object is beyond the boundaries of the bounding box of region proposal), the network has to guess the real boundaries of object, as it has not seen the full image (but has seen only the region proposal).

YOLO comes in handy in such scenarios, as it looks at the whole image while predicting the bounding box corresponding to an image.

Furthermore, Faster R-CNN is still slow, as we have two networks: the RPN and the final network that predicts classes and bounding boxes around objects.

Here, we will understand how YOLO overcomes the limitations of Faster R-CNN, both by looking at the whole image at once as well as by having a single network to make predictions.

We will look at how data is prepared for YOLO through the following example:

Create ground truth to train a model for a given image:
- Let's consider an image with the given ground truth of bounding boxes in red:
- Divide the image into NxN grid cells-for now, let's say N=3:
- Identify those grid cells that contain the center of at least one ground truth bounding box. In our case, they are cells b1 and b3 of our 3x3 grid image.
- The cell(s) where the middle point of ground truth bounding box falls is/are responsible for predicting the bounding box of the object. Let's create the ground truth corresponding to each cell.
- The output ground truth corresponding to each cell is as follows: Here, pc (the objectness score) is the probability of the cell containing an object.

Let's understand how to calculate bx, by, bw and bh.

First, we consider the grid cell (let's consider the b1 grid cell) as our universe, and normalize it to a scale between 0 and 1, as follows:

bx and by are the locations of the mid-point of the ground truth bounding with respect to the image (of the grid cell), as defined previously. In our case, bx = 0.5, as the mid-point of the ground truth is at a distance of 0.5 unit from the origin. Similarly, by = 0.5.

So far, we have calculated offsets from the grid cell center to the ground truth center corresponding to the object in the image. Now, let's understand how bw and bh are calculated.

bw is the ratio of the width of the bounding box with respect to the width of the grid cell.

bh is the ratio of the height of the bounding box with respect to the height of the grid cell.

Next, we will predict the class corresponding to the grid cell. If we have three classes, we will predict the probability of the cell containing an object among any of the three classes. Note that we do not need a background class here, as pc corresponds to whether the grid cell contains an object.

Now that we understand how to represent the output layer of each cell, let's understand how we construct the output of our 3x3 grid cells.

Let's consider the output of the grid cell a3:

The output of cell a3 is as shown in the preceding screenshot. As the grid cell does not contain an object, the first output (pc-objectness score) is 0 and the remaining values do not matter as the cell do not contain the center of any ground truth bounding box of an object.
Let's consider the output corresponding to grid cell b1:

The preceding output is the way it is because the grid cell contains an object with the bx, by, bw, and bh values that were obtained in the same way as we went through earlier (in the bullet point before last), and finally the class being car resulting in c2 being 1 while c1 and c3 are 0.

Note that for each cell, we are able to fetch 8 outputs. Hence, for 3x3 grid of cells, we fetch 3x3x8 outputs.

Define a model where the input is an image and the output is 3x3x8 with the ground truth being as defined in the previous step.
Define the ground truth by considering the anchor boxes.

So far, we have been building for a scenario where the expectation is that there is only one object within a grid cell. However, in reality, there can be scenarios where there are multiple objects within the same grid cell. This would result in creating ground truths that are incorrect. Let's understand this phenomenon through the following example image:

In the preceding example, the mid-point of the ground truth bounding boxes for both the car and the person fall in the same cell-cell b1.

One way to avoid such a scenario is by having a grid that has more rows and columns-for example, a 19x19 grid. However, there can be still a scenario where an increase in the number of grid cells does not help. Anchor boxes come in handy in such a scenario. Let's say we have two anchor boxes-one that has a greater height than width (corresponding to the person) and another that has a greater width than height (corresponding to the car):

Typically the anchor boxes would have the grid cell center as their centers. The output for each cell in a scenario where we have two anchor boxes is represented as a concatenation of the output expected of the two anchor boxes:

Here, bx, by, bw and bh represent the offset from the anchor box (which is the universe in this scenario as seen in the image instead of the grid cell).

From the preceding screenshot, we see we have an output that is 3x3x16, as we have two anchors. The expected output is of the shape NxNx(num_classes+1)x(num_anchor_boxes), where *NxN is the number of cells in the grid, num_classes is the number of classes in the dataset, and num_anchor_boxes is the number of anchor boxes.

Now we define the loss function to train the model.

When calculating the loss associated with the model, we need to ensure that we do not calculate the regression loss and classification loss when the objectness score is less than a certain threshold (this corresponds to the cells that do not contain an object).

Next, if the cell contains an object, we need to ensure that the classification across different classes is as accurate as possible.

Finally, if the cell contains an object, the bounding box offsets should be as close to expected as possible. However, since the offsets of width and height can be much higher when compared to the offset of the center (as offsets of the center range between 0 and 1, while the offsets of width and height need not), we give a lower weightage to offsets of width and height by fetching a square root value.

Calculate the loss of localization and classification as follows:

Here, we observe the following:

lambda_coordinate is the weightage associate with regression loss.
object_ij represents whether the cell contains an object.
hat_p_i(c) corresponds to the predicted class probability, and C_ij represents the objectness score.

The overall loss is a sum of classification and regression loss values.

Working details of SSD

So far, we have seen a scenario where we made predictions after gradually convolving and pooling the output from the previous layer. However, we know that different layers have different receptive fields to the original image. For example, the initial layers have a smaller receptive field when compared to the final layers, which have a larger receptive field. Here, we will learn how SSD leverages this phenomenon to come up with a prediction of bounding boxes for images.

The workings behind how SSD helps overcome the issue of detecting objects with different scales is as follows:

We leverage the pre-trained VGG network and extend it with a few additional layers until we obtain a 1x1 block.
Instead of leveraging only the final layer for bounding box and class predictions, we will leverage all of the last few layers to make class and bounding box predictions.
In place of anchor boxes, we will come up with default boxes that have a specific set of scales and aspect ratios.
Each of the default boxes should predict the object and bounding box offset just like how anchor boxes are expected to predict classes and offsets in YOLO.

Now that we understand the main ways in which SSD differs from YOLO (which is that default boxes in SSD replace anchor boxes in YOLO and multiple layers are connected to the final layer in SSD, instead of gradual convolution pooling in YOLO), let's learn about the following:

The network architecture of SSD
How to leverage different layers for bounding box and class predictions
How to assign scale and aspect ratios for default boxes in different layers

Basics of Object Detection (Part 2)

Seri Lee — Tue, 17 Aug 2021 21:40:00 +0000

This article is originally from the book "Modern Computer Vision with PyTorch"

We will hold off on building a model until the forthcoming sections as training a model is more involved and we would also have to learn a few more components before we train it. In the next section, we will learn about non-max suppression, which helps in shortlisting from the different possible predicted bounding boxes around an object when inferring using the trained model on a new image.

Non-max suppression

Imagine a scenario where multiple region proposals are generated and significantly overlap one another. Essentially, all the predicted bounding box coordinates (offsets to region proposals) significantly overlap one another. For example, let's consider the following image, where multiple region proposals are generated for the person in the image:

In the preceding image, I ask you to identify the box among the many region proposals that we will consider as the one containing an object and the boxes that we will discard. Non-max suppression comes in handy in such a scenario. Let's unpack the term "Non-max suppression".

Non-max refers to the boxes that do not contain the highest probability of containing an object, and suppression refers to us discarding those boxes that do not contain the highest probability of containing an object. In non-max suppression, we identify the bounding box that has the highest probability and discard all the other bounding boxes that have an IoU greater than a certain threshold with the box containing the highest probability of containing an object.

In PyTorch, non-max suppression is performed using the nms function in the torchvision.ops module. The nms function takes the bounding box coordinates, the confidence of the object in the bounding box, and the threshold of IoU across bounding boxes, to identify the bounding boxes to be retained. You will be leveraging the nms function when predicting object classes and bounding boxes of objects in a new imagein both the Training R-CNN-based custom object detectors and Training Fast R-CNN-based custom object detectors sections.

Mean average precision

So far, we have looked at getting an output that comprises a bounding box around each object within the image and the class corresponding to the object within the bounding box. Now comes the next question: How do we quantify the accuracy of the predictions coming from our model?

mAP comes to the rescue in such a scenario. Before we try to understand mAP, let's first understand precision, then average precision, and finally, mAP:

Typically we calculate precision as

A true positive refers to the bounding boxes that predicted the correct class of objects and that have an IoU with the ground truth that is greater than a certain threshold. A false positive refers to the bounding boxes that predicted the class incorrectly or have an overlap that is less than the defined threshold with the ground truth. Furthermore, if there are multiple bounding boxes that are identified for the same ground truth bounding box, only one box can get into a true positive, and everything else gets into a false negative.
Average precision is the average of precision values calculated at various IoU thresholds.
mAP is the average of precision values calculated at various IoU threshold values across all the classes of objects present within the dataset.

So far, we have learned about preparing a training dataset for our model, performing non-max suppression on the model's predictions, and calculating its accuracies. In the following sections, we will learn about training a model (R-CNN-based and Fast R-CNN-based) to detect objects in new images.

Training R-CNN-based custom object detectors

R-CNN stands for Region-based Convolutional Neural Network. Region-based within R-CNN stands for the region proposals. Region proposals are used to identify objects within an image. Note that R-CNN assists in identifying both the object present in the image and the location of objects present in the image and the location of objects within the image.

In the following sections, we will learn about the working details of R-CNN before training it on our custom dataset.

Working details of R-CNN

Let's get an idea of R-CNN-based object detection at a high level using the following diagram:

We perform the following steps when leveraging the R-CNN technique for object detection:

Extract region proposals from an image: ensure that we extract a high number of proposals to not miss out on any potential object within the image.
Resize (warp) all the extracted regions to get images of the same size.
Pass the resized region proposals through a network: typically we pass the resized region proposals through a pretrained model such as VGG16 or ResNet50 and extract the features in a fully connected layer.
Create data for model training, where the input is features extracted by passing the region proposals through a pretrained model, and the outputs are the class corresponding to each region proposal and the offset of the region proposal from the ground truth corresponding to the image. If a region proposal has an IoU greater than a certain threshold with the object, we prepare training data in such a way that the region is responsible for predicting the class of object it is overlapping with and also the offset of region proposal with the ground truth bounding box that contains the object of interest.

A sample as a result of creating a bounding box offset and a ground truth class for a region proposal is as follows:

In the preceding image, o (in red) represents the center of region proposal (dotted bounding box) and x represents the center of the ground truth bounding box (solid bounding box) corresponding to cat class. We calculate the offset between the region proposal bounding box and the ground truth bounding box as the difference between center coordinates of the two bounding boxes ($dx, dy$) and the difference between the height and width of the bounding boxes ($dw, dh$).

Connect two output heads, one corresponding to the class of image and the other corresponding to the offset of region proposal with the ground truth bounding box to extract the fine bounding box on the object.
Train the model post, writing a custom loss function that minimizes both object classification error and the bounding box offset error.

Note that the loss function that we will minimize differs from the loss function that is optimized in the original paper. We are doing this to reduce the complexity associated with building R-CNN and Fast R-CNN from scratch. Once the reader is familiar with how the model works and can build a model using the following code, we highly encourage them to implement the original paper from scratch.

In the next section, we will learn about fetching datasets and creating data for training. In the section after that, we will learn about designing the model and training it before predicting the class of objects present and their bounding boxes in a new image.

Implementing R-CNN for object detection on a custom dataset

Implementing R-CNN involves the following steps:

downloading the dataset
preparing the dataset
defining the region proposals extraction and IoU calculation functions
creating input data for the model, resizing the region proposals, passing them through a pretrained model to fetch the fully connected layer values
labelling each region proposal with a class or background label, defining the offset of the region proposal from the ground truth if the region proposal corresponds to an object and not background
defining and training the model
predicting on new images

Downloading the dataset

We will download the data from the Google Open Images v6 dataset. In code, we will work on only those images that are of a bus or a truck to ensure that we can train the images (as you will shortly notice the memory issues associated with using selectivesearch).

pip install -q --upgrade selectivesearch torch_snippets
mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download -d sixhky/open-images-bus-trucks/
unzip -qq open-images-bus-trucks.zip

By now, we have defined all the functions necessary to prepare data and initialize data loaders. In the next section, we will fetch region proposals (input regions to the model) and the ground truth of the bounding box offset along with the class of object (expected output).

Fetching region proposals and the ground truth of offset

In this section, we will learn about creating the input and output values corresponding to our model. The input constitutes the candidates that are extracted using the selectivesearch method and the output constitutes the class corresponding to candidates and the offset of the candidate with respect to the bounding box it overlaps the most with if the candidate contains an object.

R-CNN network architecture

In this section, we will learn about building a model that can predict both the class of region proposal and the offset corresponding to it in order to draw a tight bounding box around the object in the image.

Define a VGG backbone.
Fetch the features post passing the normalized crop through a pretrained model.
Attach a linear layer with sigmoid activation to the VGG backbone to predict the class corresponding to the region proposal.
Attach an additional linear layer to predict the four bounding box offsets.
Define the loss calculations for each of the two outputs (one to predict class and the other to predict the four bounding box offsets)
Train the model that predicts both the class of region proposals and the four bounding box offsets.

Predict on a new image

In this section, we will leverage the model trained so far to predict and draw bounding boxes around objects and the corresponding class of object within the predicted bounding box in new images.

Extract region proposals from the new image.
Resize and normalize each crop.
Feed-forward the preprocessed crops to make predictions of class and the offsets.
Perform non-max suppression to fetch only those boxes that have the highest confidence of containing an object.

Basics of Object Detection (Part 1)

Seri Lee — Mon, 16 Aug 2021 06:20:38 +0000

This article is originally from the book "Modern Computer Vision with PyTorch"

Introduction

Imagine a scenario where we are leveraging computer vision for a self-driving car. It is not only necessary to detect whether the image of a road contains the images of vehicles, a sidewalk, and pedestrians, but it is also important to identify where those objects are located. Various techniques of object detection that we will study in this article will come in handy in such a scenario.

With the rise of autonomous cars, facial detection, smart video surveillence, and people-counting solutions, fast and accurate object detection systems are in great demand. These systems include not only object classification from an image, but also location of each one of the objects by drawing appropriate bounding boxes around them. This (drawing bounding boxes and classification) makes object detection a harder task than its traditional computer vision predecessor, image classification.

To understand what the output of object detection looks like, let's go through the following diagram. In the preceding diagram, we can see that, while a typical object classification merely mentions the class of object present in the image, object localization draws a bounding box around the objects present in the image. Object detection, on the other hand, would involve drawing the bounding boxes around individual objects in the image, along with identifying the class of object within a bounding box across multiple objects present in the image.

Training a typical object detection model involves the following steps:

Creating ground truth data that contains labels of the bounding box and class corresponding to various objects present in the image.
Coming up with mechanisms that scan through the image to identify regions (region proposals) that are likely to contain objects. In this article, we will learn about leveraging region proposals generated by a method named selective search. Also, we will learn about leveraging anchor boxes to identify regions containing objects. Moreover, we will learn about leveraging positional embeddings in transformers to aid in identifying the regions containing an object.
Creating the target class variable by using the IoU metric.
Creating the target bounding box offset variable to make corrections to the location of region proposal coming in the second step.
Building a model that can predict the class of object along with the target bounding box offset corresponding to the region proposal.
Measuring the accuracy of object detection using mean Average Precision (mAP).

Creating a bounding box ground truth for training

We have learned that object detection gives us the output where a bounding box surrounds the object of interest in an image. For us to build an algorithm that detects the bounding box surrounding the object in an image, we would have to create the input-output combinations, where the input is the image and the output is the bounding boxes surrounding the objects in the given image, and the classes corresponding to the objects.

To train a model that provides the bounding box, we need the image, and also the corresponding bounding box coordinates of all the objects in an image. In this section, we will learn about one way to create the training dataset, where the image is the input and the corresponding bounding boxes and classes of objects are stored in an XML file as output. We will use the ybat tool to annotate the bounding boxes and the corresponding classes.

Let's understand about installing and using ybat to create (annotate) bounding boxes around objects in the image. Furthermore, we will also be inspecting the XML files that contain the annotated class and bounding box information.

Installing the image annotation tool

Let's start by downloading ybat-master.zip from the following github and unzip it. Post unzipping, store it in a folder of your choice. Open ybat.html using a browser of your choice and you will see an empty page. The following screenshot shows a sample of what the folders looks like and how to open the ybat.html file.

Before we start creating the ground truth corresponding to an image, let's specify all the possible classes that we want to label across images and store in the classes.txt file as follows:

Now, let's prepare the ground truth corresponding to an image. This involves drawing a bounding box around objects and assigning labels/classes to the object present in the image in the following steps:

Upload all the images you want to annotate
Upload the classes.txt file.
Label each image by first selecting the filename and then drawing a crosshair around each object you want to label. Before drawing a crosshair, ensure you select the correct class in the classes region.
Save the data dump in the desired format. Each format was independently developed by a different research team, and all are equally valid. Based on their popularity and convenience, every implementation prefers a different format.

For example, when we downlaod the PASCAL VOC format, it downloads a zip of XML files. A snapshot of the XML files after drawing the rectangular bounding box is as follows:

From the preceding screenshot, note that the bndbox field contains the coordinates of the minimum and maximum values of the x and y coordinates corresponding to the object of interest in the image. We should also be able to extract the classes corresponding to the objects in the image using the name field.

Now that we understand how to create a ground truth of objects (class labels and bounding box) present in an image, in the following sections, we will dive into the building blocks of recognizing objects in an image. First, we will talk about region proposals that help in highlighting the portions of the image that are most likely to contain an object.

Understanding region proposals

Imagine a hypothetical scenario where the image of interest contains a person and sky in the background. Furthermore, for this scenario, let's assume that there is little change in pixel intensity of the background and that there is considerable change in pixel intensity of the foreground.

Just from the preceding description itself, we can conclude that there are two primary regions here-one is of the person and the other is of the sky. Furthermore, within the region of the image of a person, the pixels corresponding to hair will have a different intensity to the pixels corresponding to the face, establishing that there can be multiple sub-regions within a region.

Region proposal is a technique that helps in identifying islands of regions where the pixels are similar to one another.

Generating a region proposal comes in handy for object detection where we have to identify the locations of objects present in the image. Furthermore, given a region proposal generates a proposal for each region, it aids in object localization where the task is to identify a bounding box that fits exactly around the object in the image. We will learn how region proposals assist in object localization and detection in a later section on Training R-CNN based custom object detectors, but let's first understand how to generate region proposals from an image.

Leveraging Selective Search to generate region proposals

Selective Search is a region proposal algorithm used for object localization where it generates proposals of regions that are likely to be ground together based on their pixel intensities. Selective Search groups pixels based on the hierarchical grouping of similar pixels, which, in turn, leverages the color, texture, size and shape compatibility of content within an image.

Initially, Selective Search over-segments an image by grouping pixels based on the preceding attributes. Next, it iterates through these over-segmented groups and groups them based on similarity. At each iteration, it combines smaller regions to form a larger region.

Let's understand the selective search process through the following example:

## dependencies 
pip install selectivesearch
pip install torch_snippets

from torch_snippets import *
import selectivesearch
from skimage.segmentation import felzenszwalb

img = read('Hemanvi.jpeg', 1)

## extract the felzenszwalb segments (which are obtained based on the color, texture, size and shape compatibility of content within an image) from the image
segments_fz = felzenszwalb(img, scale=200)

## scale represents the number of clusters that can be formed within the segments of the image. The higher the value of scale, the greater the detail of the original image that is preserved.

subplots([img, segments_fz], titles=['Original Image', 'Image post \nfelzenszwalb segmentation'], sz=10, nc=2)

The preceding code results in the following output:

From the preceding output, note that pixels that belong to the same group have similar pixel values.

Implementing Selective Search to generate region proposals

In this section, we will define the extract_candidates function using selectivesearch so that it can be leveraged in the subsequent sections on training R-CNN and Fast R-CNN-based custom object detectors:

from torch_snippets import *
import selectivesearch 

# define the function that takes an image as the input parameter
def extract_candidates(img):
 # fetch the candidate regions within the image using the selective_search method available in the selectivesearch package
 img_lbl, regions = selectivesearch.selectivesearch(img, scale=200, min_size=100)

 # calculate the image area and initialize a list (candidates) that we will use to store the candidates that pass a defined threshold
 img_area = np.prod(img.shape[:2])
 candidates = []

 # fetch only those candidates (regions) that are over 5% of the total image area and less than or equal to 100% of the image area and return them
 for r in regions:
  if r['rect'] in candidates:
   continue
  if r['size'] < (0.05 * img_area):
   continue
  if r['size'] > (1 * img_area): 
   continue
  x, y, w, h = r['rect']
  candidates.append(list(r['rect']))
 return candidates 

img = read('Hemanvi.jpeg', 1)
candidates = extract_candidates(img)
show(img, bbs=candidates)

The preceding code generates the following output:

The grid in the preceding diagram represent the candidate regions (region proposals) coming from the selective_search method.

Now that we understand region proposal generation, one question remains unanswered. How do we leverage region proposals for object detection and localization?

A region proposal that has a high intersection with the location (ground truth) of an object in the image of interest is labeled as the one that contains the object, and a region proposal with a low intersection is labeled as background.

In the next section, we will learn about how to calculate the intersection of a region proposal candidate with a ground truth bounding box in our journey to understand the various techniques that form the backbone of building an object detection model.

Understanding IoU

Imagine a scenario where we came up with a prediction of a bounding box for an object. How do we measure the accuracy of our prediction? The concept Intersection over Union (IoU) comes in handy in such a scenario.

Intersection measures how overlapping the predicted and actual bounding boxes are, while Union measures the overall space possible for overlap. IoU is the ratio of the overlapping region between the two bounding boxes over the combined region of both the bounding boxes.

This can be represented in a diagram as follows:

In the preceding diagram of two bounding boxes (rectangles), let's consider the left bounding box as the ground truth and the right bounding box as the prediction location of the object. IoU as a metric is the ratio of the overlapping region over the combined region between the two bounding boxes.

In the following diagram, you can observe the variation in the IoU metric as the overlap between bounding boxes varies:

From the preceding diagram, we can see that as the overlap decreases, IoU decreases and, in the final one, where there is no overlap, the IoU metric is 0.

Now that we have an intuition of measuring IoU, let's implement it in code and create a function to calculate IoU as we will leverage it in the sections of training R-CNN and training Fast R-CNN.

Let's define a function that takes two bounding boxes as input and returns IoU as the output:

# specify the get_iou function that takes boxA and boxB as inputs where boxA and boxB are two different bounding boxes
def get_iou(boxA, boxB, epsilon=1e-5):
 # we define the epsilon parameter to address the rare scenario when the union between the two boxes is 0, resulting in a division by zero error. 
 # note that in each of the bounding boxes, there will be four values corresponding to the four corners of the bounding box

 # calculate the coordinates of the intersection box
 x1 = max(boxA[0], boxB[0])
 y1 = max(boxA[1], boxB[1])
 x2 = min(boxA[2], boxB[2])
 y2 = min(boxA[3], boxB[3])

 # note that x1 is storing the maximum value of the left-most x-value between the two bounding boxes. y1 is storing the topmost y-value and x2 and y2 are storing the right-most x-value and bottom-most y-value, respectively, corresponding to the intersection part.

 # calculate the width and height corresponding to the intersection area (overlapping region):
 width = (x2 - x1)
 height = (y2 - y1)

 # calculate the area of overlap
 if (width < 0) or (height < 0):
  return 0.0
 area_overlap = width * height

 # if we specify that if the width or height corresponding to the overlapping region is less than 0, the area of intersection is 0. Otherwise, we calculate the area of overlap (intersection) similar to the way a rectangular's area is calculated. 

 # calculate the combined area corresponding to the two bounding boxes
 area_a = (boxA[2] - boxA[0]) * (boxA[3]- boxA[1])
 area_b = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])
 area_combined = area_a + area_b + area_overlap

 iou = area_overlap / (area_combined + epsilon)
 return iou

Padding in Neural Network

Seri Lee — Fri, 13 Aug 2021 05:50:15 +0000

this post is originally from https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/

What is padding and why do we need it?

What you see on the left is an RGB input image-width $W$ , height $H$ and three channels. Hence, this layer is likely the first layer in your model, in any other scenario, you'd have feature maps as the input to your layer.

Now, what is a feature map? That's the yellow block in the image. It's a collection of $N$ one-dimensional maps that each represent a particular feature that the model has spotted within the image. This is why convolutional layers are known as feature extractors.

Now, this is very nice-but how do we get from input (whether image or feature map) to a feature map? This is through kernels or filters, actually. These filters-you configure some number $N$ per convolutional layer-"slide"(strictly, convolve) over your input data, and have the same number of "channel" dimensions as your input data, but have much smaller widths and heights. For example, for the scenario above, a filter maybe 3x3 pixels wide and high, but always has 3 channels as our input has 3 channels, too.

Now, when they slide over the input-from left to right horizontally, then moving down vertically after a row has been captured-they perform element-wise multiplications between what's currently under investigation within the input data and the weights present within the filter. These weights are equal to the weights fo a "classic" neural network, but are structured in a different way. Hence, optimizing a ConvNet involves computing a loss value for the model and subsequently using an optimizer to change the weights.

Through these weights, as you may guess, the model learns to detect the presence of particular features-which once again, are represented by the feature maps.

Conv layers might induce spatial hierarchy

If the width and/or height of your kernels is above 1, you'll see that the width and height of the feature map (being the output) getting smaller. This occurs due to the fact that the feature maps slides over the input and computes element-wise multiplications, but is too large in order to inspect the "edges" of the input. This is illustrated in the image, where "red" position is impossible to take and the "green" ones is part of the path of the convolution operation.

As it cannot capture the edges, it won't be able to effectively "end" at the final position of your row, resulting in a smaller output width and/or height.

We call this a spatial hierarchy. Indeed, convolutional layers may cause a "hierarchy"-like flow of data through the model. Here, you have a schematic representation of a substantial hierarchy and a less substantial one-which is often considered to be less efficient.

Padding avoids the loss of spatial dimensions

Sometimes, however, you need to apply filters of a fixed size, but you don't want to lose width and/or height dimensions in your feature maps. For example, this is the case when you're training an autoencoder. You need the outptut images to be of the same size as the input, yet need an activation function like Sigmoid in order to generate them.

If you do so with a Conv layer, this would be problematic, as you'd reduce the size of your feature maps-and hence would produce outputs unequal in size as your inputs.

That's not what we want when we create an autoencder. We want the original output and the original output only.

Padding helps you solve this problem. Applying it effectively adds space around your input data or your feature map-or more precisely, "extra rows and columns".

The consequence of this fact are rather pleasurable, as we can see in the example above. Adding "extra space" now allows us to capture the position we previously couldn't capture, and allows us to detect features in the edges of your input.

Types of padding

Now, unfortunately, padding is not a binary option-i.e. it cannot simply be turned on and off. Rather, you can choose which padding you use.

Valid padding/no padding

Valid padding simply means "no padding". This equals the scenario where capturing edges only is not possible.

It may seem strange to you that frameworks include an option for valid padding/no padding, as you could simply omit the padding as well. However, this is not strange at all: if you specify some padding attribute, there must be a default value.

Same padding/zero padding

Another option would be "same padding" also known as "zero padding". Here, the padding ensures that the output has the same shape as the input data. It is achieved by adding "zeros" at the edge of your layer output, e.g. the white space on the right of the image.

Casual Padding

Suppose that you have a time series dataset, where two inputs together determine an output, in a causal fashion. It's possible to create a model that can handle this by means of a Conv1D layer with a kernel size of 2-the learnt kernel will be able to map the inputs to the outputs successfully.

But what about the first two targets? Although they are valid targets, the input are incomplete-that is, there is insufficient input data available in order to successfully use them in the training process. For the second target, one input-visible in gray-is missing (whereas the second is actually there), while for the first target both aren't there.

For the first target, there is no real hope for success (as we don't have any input at all and hence do not know which values produce the target values), but for the second, we have a partial picture: we've got half the inputs that produce the target.

Causal padding on the Conv1D layer allows you to include the partial information in your training process. By padding your input dataset with zeros at the front, a causal mapping to the first, missed-out targets can be made. While the first target will be useless for training, the second can now be used based on the partial information that we have.

Reflection padding

Another type of padding is "reflection padding". As you can see, it pads the values with the "reflection" or "mirror" of the values directly in the opposite direction of the edge of your to be padded shape.

For example, if you look at the image above, for the first row of the yellow box: if you go to the right, you'll see a 1. Now you need to fill the padding element directly to the right. What do you find when you move in the opposite direction of the edge? Indeed, a 5. Hence, your first padding value is a 5. When you move further, it's a 3, so the next padding value following the 5 is 3. And so one. In the opposite direction, you get a mirrored effect. Having a 3 at the edge, you'll once again find 5 (as it's the center value) but the second value for the padding will be a 1. and so on.

Reflective padding seems to improve the empirical performance of your model. Possibly, this occurs because of how "zero" based padding (i.e. the "same" padding) and "constant" based padding alter the distribution of your dataset. This becomes clear when we actually visualize the padding when it is applied:

Replication Padding/Symmetric Padding

Replication padding looks like reflection padding, but is slightly different. Rather than reflecting like a mirror, you simply take a copy and mirror it. Like this: You're at the first row again, at the right. You find a 1. What is the next value? Simple: you copy the entire row, mirror it, and start adding it as padding values horizontally. As you can see, since we only pad 2 elements in width, there are 1 and 5, but 3 falls off the padding.

As with reflection padding, replication padding attempts to reduce the impact of "zero" and "constant" padding on the quality of your data by using "plausible" data values re-using what is along the borders of the input.

Which padding to use when?

There are no hard criteria that prescribe when to use which type of padding. Rather, it's important to understand that padding is pretty much important all the time-because it allows you to preserve information that is present at the borders of your input data, and present there only.