Coursera's Deep Learning Specialization: Object Detection & Localization (Week 3)

#ai #computervision #unet #yolo

Overview

Learned the basics of the more macro portions of ML/AI in computer vision(CV), differentiating topics such as object detection and localization. Learning the different use cases for just Object Classification, or Object Localization and THEN Object Classification, as well as the use cases of Landmark Detection. The different topics I implemented in code myself and their theory are shown below, with papers to pre-requisite topics linked as needed.

Practice/Projects

Object Localzation/Classification & Landmark Detection

Object Localization is when you show where an object is in an inputted image, typically with a bounding box. Which is defined as a vector typically consisting of its midpoint's coordinates, height, and width, normally denoted by
[bx, by, bw, bh]

Object detection is classifying either the localized object or simply classifying the image itself. Examples of the two situations are shown below.

Landmark Detection is detecting where predetermined landmarks are on the picture, a clear example of this is features(landmarks) on a face in the image below.

Landmark Detection can allow us to conduct facial recognition and verification(which I cover and put into practice next week).

YOLO Algorithm (You Only Look Once)

YOLO is a CV object detection & localization algorithm that operates by dividing input images into a grid and directly predicts bounding boxes and class probabilities within each grid cell. Then typically uses Non-max Suppression to get rid of incorrectly/low confidence bounding boxes that may overlay on already correctly predicted boxes. An example of the placement of the grid, placing of bounding boxes, and then non-max suppression is shown in the photo below.

YOLO's achieves impressive accuracy and speed simultaneously, which is why it was revolutionary in the Computer Vision world of research.

U-Net (Semantic Segmentation)

In a slightly different category is semantic segmentation, which is categorizing which item each individual pixel in an image belongs to. The most popular algorithm that completes this is the U-net algorithm, which its architecture is shown below and my explanation of it follows.

You complete the typicall CNN layers, shrinking the height and width, while expanding the number of channels, during the first half of the model. Then we use transpose convolutional layers to expand the heigh and width and shrink the number of channels back to the original dimensions in order to be able to map the predicted segmentation back onto the original image. The arrows across the screen show that we concatenate the outputs from the respective layers in the first half of the model, into the second half of the model using skip connections. An example output of this model is shown below and is mapped onto the original image to show which parts of the image belong to what pre-classified object.

Practice

Car Detection YOLO

I implemented the YOLO algorithm to detect cars. Where I applied a

-filter by threshold to delete bounding boxes that were very unconfident

Calculated IoU(Intersection over Union) of the predicted bounding boxes and the target bounding boxes to train the model.

-Implemented Non-Max Suppression to delete overlapping bounding boxes to ensure each item detected only had one bounding box.

Image Segmentation using U-Net

Recreated the U-net algorithm using Tensorflow and then putting them together using the Functional API. The function putting all the model components together is shown below. As well as an image showing image inputted into the mode, its true mask(model target), and predicted mask(model output).

def unet_model(input_size=(96, 128, 3), n_filters=32, n_classes=23):

    #Encoding Blocks
    cblock1 = conv_block(inputs, n_filters)
    cblock2 = conv_block(cblock1[0], 2 * n_filters)
    cblock3 = conv_block(cblock2[0], 4 * n_filters)
    cblock4 = conv_block(cblock3[0], 8 * n_filters, dropout_prob = 0.3) 
    cblock5 = conv_block(cblock4[0], 16 * n_filters, dropout_prob = 0.3, max_pooling=None) 

    #Decoding Blocks
    ublock6 = upsampling_block(cblock5[0], cblock4[1], 8 * n_filters)
    ublock7 = upsampling_block(ublock6, cblock3[1],  4 * n_filters)
    ublock8 = upsampling_block(ublock7, cblock2[1],  2 * n_filters)
    ublock9 = upsampling_block(ublock8, cblock1[1],  n_filters)

    conv9 = Conv2D(n_filters,
                 3,
                 activation='relu',
                 padding='same', 
                 kernel_initializer='he_normal')(ublock9)
    conv10 = Conv2D(n_classes, 1, padding='same')(conv9)

    #Tensorflow's Functional API
    model = tf.keras.Model(inputs=inputs, outputs=conv10)

    return model