DEV Community: Andrey Germanov

A simple way to extract all detected objects from image and save them as separate files using YOLOv8.2 and OpenCV

Andrey Germanov — Sun, 18 Aug 2024 13:45:51 +0000

Introduction
Sample image
Detect objects using YOLOv8
    More about different YOLOv8 models
    Run the model to detect objects
Parse detection results
    Extract objects with background
    Extract objects without background
Conclusion

Introduction

In this tutorial I will show how to detect all objects on image using neural network, extract them and save to separate files.

This is a common task and there are many different options to do this. In this article, I will show a very simple way, using YOLOv8 neural network and OpenCV.

This tutorial covers only this topic, so if you want to deep dive to YOLOv8 neural network and computer vision, read previous articles from my YOLOv8 series.

I will use Python to write all code in this article, so, I assume that you are able to develop on Python. Also, I use Jupyter Notebook, but it's not required. You can use any IDE or text editor to write and run the code.

Sample image

During this tutorial, we will detect and extract objects from the image, using YOLOv8 neural network and OpenCV. As an example image, we will use the following image, that I got from Wikipedia page:

Source: https://en.wikipedia.org/wiki/Vehicular_cycling

We will detect all people and cars on this image, extract and save them to separate image files. I will show how to save extracted objects with or without background. You can select either first or second option depending on your needs.

Detect objects using YOLOv8

So, let's get started. First action, that you need to do is to install YOLOv8.2 package, if you do not have it yet. To do this, run the following in your Jupyter notebook:



%pip install ultralytics

then import the YOLOv8 API:



from ultralytics import YOLO

After it's done, let's load the YOLOv8 neural network model:



model = YOLO("yolov8m-seg.pt")

This line of code will download the yolov8m-seg.pt neural network model and will load it to the model variable.

More about different YOLOv8 models

In this tutorial, we will use one of pretrained YOLOv8 models, that can be used to detect 80 common object classes. There are three types of YOLOv8 models exist and 5 different sizes.

Classification	Detection	Segmentation	Kind
yolov8n-cls.pt	yolov8n.pt	yolov8n-seg.pt	Nano
yolov8s-cls.pt	yolov8s.pt	yolov8s-seg.pt	Small
yolov8m-cls.pt	yolov8m.pt	yolov8m-seg.pt	Medium
yolov8l-cls.pt	yolov8l.pt	yolov8l-seg.pt	Large
yolov8x-cls.pt	yolov8x.pt	yolov8x-seg.pt	Huge

The bigger model you choose, the more quality results you'll get, but the slower it works.

There are three types of YOLOv8 models exist: for classification, for object detection and for instance segmentation. The classification models used only to detect a class of object on the image, so it can't be used for our task. The object detection models can detect bounding boxes of detected objects. These models can be used to get x1,y1,x2,y2 coordinates of each object, and you can use this coordinates to extract the object with background. Finally, segmentation models can be used to detect not only bounding boxes of the objects, but also exact shapes (bounding polygons) for them. Using bounding polygon, you can extract an object without background.

In the code above, I've loaded the middle-sized model for segmentation yolov8m-seg.pt, that can be used both to extract object with background and without it.

To detect specific object classes, that do not exist in pretrained models, you can create and train your own model, save it to the .pt file and load it. Read the first part of my YOLOv8 series to learn how to do this.

Run the model to detect objects

To detect objects on images, you can pass the list of image file names to the model object and receive the array of results for each image:



results = model(["road.jpg"])

This code assumes, that the sample image saved to the road.jpg file. If you send a single image in the list, the results array will contain a single element. You can send more and in this case, the result will contain more elements.

Parse detection results

Now, let's get detection results for the first image (road.jpg)



result = results[0]

The result is an object of the ultralytics.engine.results.Results class, which contains different information about detected objects on the image.

You can use a link above to learn more about all methods and properties, that this object contains, but here we need only few of them:

result.boxes.xyxy - array of bounding boxes for all objects, detected on the image.
result.masks.xy - array of bounding polygons for all objects, detected on the image.

For example, the result.boxes.xyxy[0] will contain [x1,y1,x2,y2] coordinates for the first object, detected on the image:



print(result.boxes.xyxy[0])



tensor([2251.1409, 1117.8158, 3216.7141, 1744.1128], device='cuda:0')

and the bounding polygon for the same object:



print(result.masks.xy[0])



[[       2500        1125]
 [     2493.8      1131.2]
 [     2481.2      1131.2]
 [       2475      1137.5]
 [     2468.8      1137.5]
 [     2462.5      1143.8]
 [     2456.2      1143.8]
 [     2418.8      1181.2]
 [     2418.8      1187.5]
 [     2381.2        1225]
 [     2381.2      1231.2]
 [       2350      1262.5]
 [       2350      1268.8]
 [     2337.5      1281.2]
 [     2337.5      1287.5]
 [       2325        1300]
 [       2325      1306.2]
 [     2306.2        1325]
 [     2306.2      1331.2]
 [       2300      1337.5]
 [       2300      1343.8]
 [     2287.5      1356.2]
 [     2287.5      1362.5]
 [     2281.2      1368.8]
 [     2281.2      1387.5]
 [       2275      1393.8]
 [       2275        1700]
 [     2281.2      1706.2]
 [     2281.2      1712.5]
 [     2287.5      1718.8]
 [     2356.2      1718.8]
 [     2368.8      1706.2]
 [     2368.8        1700]
 [     2381.2      1687.5]
 [     2381.2      1681.2]
 [     2393.8      1668.8]
 [     2393.8      1662.5]
 [     2412.5      1643.8]
 [     2456.2      1643.8]
 [     2462.5        1650]
 [     2468.8        1650]
 [     2481.2      1662.5]
 [     2562.5      1662.5]
 [     2568.8      1656.2]
 [       2575      1656.2]
 [     2581.2        1650]
 [     2712.5        1650]
 [     2718.8      1656.2]
 [     2737.5      1656.2]
 [     2743.8      1662.5]
 [     2768.8      1662.5]
 [       2775      1668.8]
 [     2831.2      1668.8]
 [     2837.5        1675]
 [     2868.8        1675]
 [       2875      1681.2]
 [     2887.5      1681.2]
 [       2900      1693.8]
 [     2906.2      1693.8]
 [     2912.5        1700]
 [     2918.8        1700]
 [     2931.2      1712.5]
 [     2931.2      1718.8]
 [     2937.5      1718.8]
 [       2950      1731.2]
 [     2956.2      1731.2]
 [     2962.5      1737.5]
 [     3018.8      1737.5]
 [     3018.8      1731.2]
 [     3037.5      1712.5]
 [     3037.5      1706.2]
 [     3043.8        1700]
 [     3043.8      1681.2]
 [       3050        1675]
 [       3050      1668.8]
 [     3056.2      1662.5]
 [     3062.5      1662.5]
 [     3068.8      1668.8]
 [     3081.2      1668.8]
 [       3100      1687.5]
 [     3106.2      1687.5]
 [     3112.5      1693.8]
 [       3175      1693.8]
 [     3181.2      1687.5]
 [     3187.5      1687.5]
 [     3193.8      1681.2]
 [     3193.8      1662.5]
 [       3200      1656.2]
 [       3200      1562.5]
 [     3193.8      1556.2]
 [     3193.8        1500]
 [     3187.5      1493.8]
 [     3187.5      1468.8]
 [     3181.2      1462.5]
 [     3181.2      1437.5]
 [       3175      1431.2]
 [       3175      1418.8]
 [     3168.8      1412.5]
 [     3168.8        1400]
 [     3143.8        1375]
 [     3143.8      1368.8]
 [       3125        1350]
 [       3125      1343.8]
 [     3112.5      1331.2]
 [     3112.5        1325]
 [       3100      1312.5]
 [       3100      1306.2]
 [     3087.5      1293.8]
 [     3087.5      1287.5]
 [       3075        1275]
 [       3075      1268.8]
 [     3068.8      1262.5]
 [     3068.8      1256.2]
 [       3050      1237.5]
 [       3050      1231.2]
 [     3006.2      1187.5]
 [     3006.2      1181.2]
 [       3000        1175]
 [       3000      1168.8]
 [     2993.8      1162.5]
 [     2993.8      1156.2]
 [     2987.5        1150]
 [     2987.5      1143.8]
 [       2975      1143.8]
 [     2968.8      1137.5]
 [       2950      1137.5]
 [     2943.8      1131.2]
 [     2868.8      1131.2]
 [     2862.5        1125]]

This is a list of [x,y] coordinates for all points in the polygon.

You can see below the bounding box and the bounding polygon for the first detected object:

Bounding box	Bounding polygon

As you may assume, to extract the object with background you can use the bounding box, but to extract the object without background, you will need to use the bounding polygon.

To extract all objects and save to separate files, you need to run the code for each detected object in a loop.

In the next sections, I will show how to achieve both.

Extract objects with background

First, I will show how to crop a single object, using coordinates of bounding box. Then, we will write a loop to extract all detected objects.

So, in the previous section, we extracted the bounding box for the first detected object as result.boxes.xyxy[0]. It contains an [x1,y1,x2,y2] array with coordinates. However, this is a PyTorch Tensor with values of Float32 type, but coordinates must be integers. Let's convert the tensor for appropriate coordinates:



x1,y1,x2,y2 = result.boxes.xyxy[0].cpu().numpy().astype(int);

Now, you need to load the image and crop it, using the coordninates above.

I will use a OpenCV library for this. Ensure that it's installed in your system or install it:



%pip install opencv-python

Then import it and load the image:



import cv2

img = cv2.imread("road.jpg")

The OpenCV image is a regular NumPy array. You can see it shape:



print(img.shape)



(604, 800, 3)

The first dimension is a number of rows (height of the image), the second dimensions is a number of columns (width of the image), and the third dimension is a number of color channels, which is 3 for standard RGB images.

Now, it's easy to crop the part of this array, using x1,y1,x2,y2 coordinates that we have:



img[y1:y2,x1:x2,:]

This way you get only rows from y1 to y2 and columns from x1 to x2, e.g. only the object, that is required. Let's save this cropped image to a new file:



cv2.imwrite("image1.png",img[y1:y2,x1:x2,:])

That's all. After running this code, you should see the new file image1.png. If you open it, you should see the cropped object with background:

Now, you can write a loop, to extract and save all detected objects:



for idx,box in enumerate(result.boxes.xyxy):
    x1,y1,x2,y2 = box.cpu().numpy().astype(int)
    cv2.imwrite(f"image{idx}.png", img[y1:y2,x1:x2,:])

After running this, you'll see the files image0.png, image1.png ... etc. with all detected objects on the image.

This is a full solution:



from ultralytics import YOLO
import cv2

model = YOLO("yolov8m-seg.pt")
results = model(["road.jpg"])

result = results[0]

img = cv2.imread("road.jpg")
for idx,box in enumerate(result.boxes.xyxy):
    x1,y1,x2,y2 = box.cpu().numpy().astype(int)
    cv2.imwrite(f"image{idx}.png", img[y1:y2,x1:x2,:])

Extract objects without background

First, to make an image transparent, we need to add a transparency channel to the input image. By default, JPG images do not have this channel, so, to add it using OpenCV you have to run this line of code:



img = cv2.cvtColor(img,cv2.COLOR_BGR2BGRA)

It's easy to cut the image, using the rectangular area, as you seen in the previous section, but I did not find any Python library, that can crop a part of image, using custom polygon.That is why, we will go other way. First, we will make transparent all pixels of the whole input image, that are not in a bounding polygon and then, we will cut the object from this transparent image, using the bounding box, as we did in the previous section.

To implement the first part (make the image transparent), we will need to create and apply a binary mask to the image. The binary mask is a black and white image, on which all pixels that are white treated as pixels, that belong to object and all pixels that are black treated as transparent pixels. For example, the binary mask for the image, that contains the first object will look like this:

OpenCV has a function, that allows to apply the binary mask to the image and this operation makes all pixels on the image transparent, except pixels, that are white on the binary mask.

But first, we need to create a binary mask for OpenCV: the black image with white bounding polygon.

As I said above, the OpenCV image is a NumPy array, so, to create a black binary image, you need to create the NumPy array of the same size as original image, filled with 0.



import numpy as np

mask = np.zeros_like(img,dtype=np.int32)

This code created array with the same size as original road.jpg image, filled with 0. The data type of items in this image must be integer. Now, you need to draw white bounding polygon on it, to make it look the same, as binary mask on the previous image.

The bounding polygon for the first object located in the result.masks.xy[0]. The type of items in this polygon is float32, but for images they must be int32. To convert the polygon to correct type, use the following code:



polygon = result.masks.xy[0].astype(np.int32)

Now, the fillPoly function of the OpenCV library can be used to draw white polygon on the black mask:



cv2.fillPoly(mask,[polygon],color=(255, 255, 255))

Finally, let's make everything except the object transparent on the image, using OpenCV binary AND operation:



img = cv2.bitwise_and(img, img, mask=mask[:,:,0].astype('uint8'))

It applies binary AND operation for each pixel on the img image using the mask. So, all pixels of the image that have 0 on the mask will be transparent.

As a result, of this operation, you will have the following image:

Finally, you can crop the object from it, using the bounding box coordinates, as in the previous section:



x1,y1,x2,y2 = result.boxes.xyxy[0].cpu().numpy().astype(int)
cv2.imwrite("image1.png",img[y1:y2,x1:x2,:])

after running this, the image1.png file will contain the object, without background:

Now, let's extract all detected objects from the image. To do this, we need to repeat all the code from this section for each detected object in a loop:



for idx,polygon in enumerate(result.masks.xy):
    polygon = polygon.astype(np.int32)
    img = cv2.imread("road.jpg")
    img = cv2.cvtColor(img,cv2.COLOR_BGR2BGRA)
    mask = np.zeros_like(img,dtype=np.int32)
    cv2.fillPoly(mask,[polygon],color=(255, 255, 255))
    x1,y1,x2,y2 = result.boxes.xyxy[idx].cpu().numpy().astype(int)
    img = cv2.bitwise_and(img, img, mask=mask[:,:,0].astype('uint8'))
    cv2.imwrite(f"image{idx}.png", img[y1:y2,x1:x2,:])

After running this code, you should see the image files image0.png, image1.png, image2.png and so on. Each image will have transparent background.

Here is a whole solution to extract all objects from the image with transparent background using YOLOv8.2 and OpenCV and save them to files:



from ultralytics import YOLO
import cv2
import numpy as np

model = YOLO("yolov8m-seg.pt")
results = model(["road.jpg"])

result = results[0]

for idx,polygon in enumerate(result.masks.xy):
    polygon = polygon.astype(np.int32)
    img = cv2.imread("road.jpg")
    img = cv2.cvtColor(img,cv2.COLOR_BGR2BGRA)
    mask = np.zeros_like(img,dtype=np.int32)
    cv2.fillPoly(mask,[polygon],color=(255, 255, 255))
    x1,y1,x2,y2 = result.boxes.xyxy[idx].cpu().numpy().astype(int)
    img = cv2.bitwise_and(img, img, mask=mask[:,:,0].astype('uint8'))
    cv2.imwrite(f"image{idx}.png", img[y1:y2,x1:x2,:])

Conclusion

In this article I showed a simple way to extract and save all detected objects from image using YOLOv8 and OpenCV in less than 20 lines of code. I did not dive to many details in this post. If you want to know more about computer vision and YOLOv8, welcome to read previous articles from my YOLOv8 article series. Links to them you can find either in the beginning or in the end of this article.

You can follow me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Have a fun coding and never stop learning!

Teeth caries detection using YOLOv8 neural network

Andrey Germanov — Thu, 22 Feb 2024 04:41:01 +0000

Introduction
Prepare the dataset
    The source dataset format
    The YOLOv8 dataset format
    Convert the dataset
        Create the YOLOv8 dataset folder structure
        Generate the data.yaml file
        Copy images from source to destination datasets
        Convert annotations
Train the caries detector model
Detect caries on custom image
Create a web-service to detect caries
Conclusion

Introduction

This is a fifth part of my YOLOv8 series, and it's time to use the theory, learned before, to solve a real world task. Let's stop detecting cats and dogs using pretrained models and start doing something really helpful.

The computer vision widely used in medicine, so in this article, we are going to train and use the model to detect caries on teeth photos. I will explain the whole process, from collecting and preparing the training data and until creating a web service, that will use trained custom YOLOv8 model to detect and show caries on teeth photos online.

I assume that you read the first article of YOLOv8 series, which explains the theory, required to prepare the data for YOLOv8 models, train them and run detection.

All code examples in this article are on Python, that is why I assume that you will use the Python and Jupyter notebook to run the code.

If all this is fine for you, let's dive to the object detection in medicine.

Prepare the dataset

To start any machine learning task you need to find a data and make a dataset from it. This dataset must be compatible with an algorithm, that you will use for training. This is the most complicated and time-consuming part of the whole process. If you can find ready-to-use quality dataset, then you are lucky, if not, then the only way to go is to collect images and annotate them manually.

However, even if you found the suitable dataset, it's highly likely that it will not be compatible with your model. This is happened to me. This is a great dental dataset, but initially it's absolutely incompatible to YOLOv8 object detection model.

The dataset that I've got is a DentalAI, that you can download from the Dataset Ninja: https://datasetninja.com/dentalai. It has 2495 images with great annotations to detect four classes: Tooth, Caries, Cavity, Crack. Perhaps I do not need all these classes, but I can filter them later.

However, you can't use this dataset with YOLOv8 object detection models due to two problems:

1) It annotated for instance segmentation, but not for object detection.
2) It stored in the Supervisely format

It's required to convert it to YOLOv8 format and there are two options for this:

1) use one of the online converters between Supervisely and YOLOv8, like this one from Roboflow: https://roboflow.com/convert/supervisely-json-to-yolov8-pytorch-txt
2) create a script to convert it locally

The first option looks easy, but it's required to upload the whole 1.2 GB dataset to Roboflow. Moreover, it will convert it for instance segmentation, but I need an object detection.

On the one hand, the YOLOv8 can do image segmentation too, but it works slower, and if it's not required for the task, then why consume more resources than needed.

Understanding the details of data is very important for machine learning, so, I will not rely on automatic converters and will go the second way - will create a script to convert the dataset to YOLOv8 object detection format on my own. To do this, it's required to understand the source data format.

The source dataset format

If you download and extract the DentalAI dataset, you'll see the following folder structure:

This is a typical "Supervisely" dataset. It split to training (train) and validation (valid) sets. Also, it has a test set, that can be used to verify prediction quality after training finished. Each of these folders include images in the img folder and annotations for each image in the ann folders.

There is also the meta.json file in the root of the dataset that describes the classes of objects, which this dataset contains:



{
    "classes": [
        {
            "title": "Caries",
            "shape": "polygon",
            "color": "#FF0000",
            "geometry_config": {},
            "id": 4668092,
            "hotkey": ""
        },
        {
            "title": "Cavity",
            "shape": "polygon",
            "color": "#0F8A53",
            "geometry_config": {},
            "id": 4668093,
            "hotkey": ""
        },
        {
            "title": "Crack",
            "shape": "polygon",
            "color": "#0011FF",
            "geometry_config": {},
            "id": 4668094,
            "hotkey": ""
        },
        {
            "title": "Tooth",
            "shape": "polygon",
            "color": "#00FFFF",
            "geometry_config": {},
            "id": 4668095,
            "hotkey": ""
        }
    ],
    "tags": [],
    "projectType": "images"
}

Here you can see that there are four object classes: Caries, Cavity, Crack and Tooth. Usually, the caries, the cavity and the crack objects located inside the tooth objects.

If you open any annotation file from one of the ann folders, then you'll find that this is a JSON file, that describes the image, to which it belongs.



{
    "description": "",
    "tags": [],
    "size": {
        "height": 631,
        "width": 931
    },
    "objects": [
        {
            "id": 44634240,
            "classId": 4668095,
            "description": "",
            "geometryType": "polygon",
            "labelerLogin": "inbox@datasetninja.com",
            "createdAt": "2023-09-15T14:09:26.528Z",
            "updatedAt": "2023-09-15T14:09:26.528Z",
            "tags": [],
            "classTitle": "Tooth",
            "points": {
                "exterior": [
                    [
                        409,
                        230
                    ],
                    [
                        379,
                        235
                    ],
                    [
                        358,
                        243
                    ],
                    [
                        334,
                        238
                    ],
                    [
                        313,
                        246
                    ],
                    [
                        290,
                        262
                    ],
                    [
                        269,
                        282
                    ],
                    [
                        256,
                        305
                    ],
                    [
                        253,
                        321
                    ],
                    [
                        231,
                        326
                    ],
                    [
                        221,
                        325
                    ],
                    [
                        222,
                        334
                    ],
                    [
                        217,
                        347
                    ],
                    [
                        213,
                        364
                    ],
                    [
                        207,
                        386
                    ],
                    [
                        202,
                        395
                    ],
                    [
                        207,
                        419
                    ],
                    [
                        210,
                        453
                    ],
                    [
                        243,
                        480
                    ],
                    [
                        281,
                        493
                    ],
                    [
                        327,
                        502
                    ],
                    [
                        347,
                        502
                    ],
                    [
                        370,
                        500
                    ],
                    [
                        411,
                        492
                    ],
                    [
                        431,
                        473
                    ],
                    [
                        437,
                        450
                    ],
                    [
                        449,
                        417
                    ],
                    [
                        445,
                        396
                    ],
                    [
                        442,
                        387
                    ],
                    [
                        445,
                        359
                    ],
                    [
                        447,
                        341
                    ],
                    [
                        447,
                        316
                    ],
                    [
                        455,
                        304
                    ],
                    [
                        457,
                        274
                    ],
                    [
                        454,
                        256
                    ],
                    [
                        444,
                        235
                    ],
                    [
                        433,
                        231
                    ]
                ],
                "interior": []
            }
        },
        {
            "id": 44634239,
            "classId": 4668092,
            "description": "",
            "geometryType": "polygon",
            "labelerLogin": "inbox@datasetninja.com",
            "createdAt": "2023-09-15T14:09:26.528Z",
            "updatedAt": "2023-09-15T14:09:26.528Z",
            "tags": [],
            "classTitle": "Caries",
            "points": {
                "exterior": [
                    [
                        747,
                        318
                    ],
                    [
                        737,
                        332
                    ],
                    [
                        739,
                        345
                    ],
                    [
                        747,
                        352
                    ],
                    [
                        756,
                        353
                    ],
                    [
                        767,
                        347
                    ],
                    [
                        782,
                        342
                    ],
                    [
                        793,
                        344
                    ],
                    [
                        801,
                        346
                    ],
                    [
                        810,
                        347
                    ],
                    [
                        817,
                        346
                    ],
                    [
                        824,
                        341
                    ],
                    [
                        826,
                        333
                    ],
                    [
                        822,
                        325
                    ],
                    [
                        815,
                        322
                    ],
                    [
                        804,
                        316
                    ],
                    [
                        790,
                        315
                    ],
                    [
                        774,
                        312
                    ],
                    [
                        764,
                        313
                    ],
                    [
                        758,
                        313
                    ]
                ],
                "interior": []
            }
        },
        .
        .
        .
    ]
}

This listing shows the beginning of the following annotation file of the train dataset: 3_jpg.rf.d6209280da26bdf012e1aabd7e5a8d5b.jpg.json. This annotation belongs to the image with a 3_jpg.rf.d6209280da26bdf012e1aabd7e5a8d5b.jpg name from the appropriate img folder in the training dataset. Here is this image:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z8pwtj8hr1brr9b86sdi.jpg)

There are two nodes of this JSON that important for us:

the size, which contains the size of the image, including width and height
the objects array, which contains annotations of each object, that should be detected on this image.

This dataset prepared for an instance segmentation and that is why, each object is a polygon, that defines a border of segmentation mask and a class of this object. The class can be found in the classTitle attribute and the polygon defined by the points attribute. The points has the exterior, which contains the actual array of point coordinates: first coordinate is x and second coordinate is y. On the code listing, there are 2 objects with classes "Tooth" and "Caries" and corresponding polygons.

But there are more of them defined in this file, which you can see on your own:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uts37h9zrmalx0izcfs9.png)

There are one Cavity, four Tooth and seven Caries annotated for this image.

So, to recap, the key data, that should be converted to the YOLOv8 format is: the meta.json file and the annotation files. In particular, from each annotation file you need to get the image width and the image height, and, the most important: the polygons for each detected object, and it's class. All this should be transformed to the YOLOv8 dataset format.

The YOLOv8 dataset format

If you look at the first article of the YOLOv8 series, you'll see that the folder structure of the YOLOv8 standard dataset is very similar to the source dataset:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/07ybe3cmbljeliguzjr2.png)

There are train, test and val datasets and each of them contains folders for images and annotations. The labels folder of the YOLOv8 dataset matches to the ann folder of the source dataset, and the images folder of the YOLOv8 dataset matches to the img folder of the source dataset.

Also, there is a data.yaml file, that describes the data, which has the following format:



train: ../train/images
val: ../val/images
test: ../test/images

nc: <number-of-classes>
names: [<class_names]

Convert the dataset

These are the steps that should be done to convert the source dataset to the YOLOv8 dataset:

Create the YOLOv8 dataset folder structure
Generate the data.yaml file using data from meta.json file
Copy all images from train/img, valid/img and test/img folders of the source dataset to the train/images, val/images and test/images of the destination dataset
Generate annotation files for images in the train/labels, val/labels and test/labels of the YOLOv8 dataset using the data from the annotation files in the train/ann, valid/ann and test/ann folders of the source dataset.

Let's do this step by step.

Create the YOLOv8 dataset folder structure

To start writing code you can use the Jupyter notebook, or any IDE, but I recommend the former.



import shutil
from os import path
import os
import json

# Define the folder locations of source and destination datasets
SRC_DIR = "dataset"
DEST_DIR = "yolo_dataset"

# Create folder structure of destination dataset
os.makedirs(path.join(DEST_DIR, "train", "images"), exist_ok=True)
os.makedirs(path.join(DEST_DIR, "train", "labels"), exist_ok=True)
os.makedirs(path.join(DEST_DIR, "val", "images"), exist_ok=True)
os.makedirs(path.join(DEST_DIR, "val", "labels"), exist_ok=True)
os.makedirs(path.join(DEST_DIR, "test", "images"), exist_ok=True)
os.makedirs(path.join(DEST_DIR, "test", "labels"), exist_ok=True)

This code imports the required libraries first. Then it defines in which folders will be source and destination datasets. In this sample, it assumed that the source dataset should be located in the dataset folder, so, create this folder and copy the dataset to it.

Then, the script creates all folders in the destination's dataset root, e.g. yolo_dataset/train/images, yolo_dataset/train/labels, yolo_dataset/val/images etc.

The first step is done. If you run this, you'll see the empty folder structure for the destination dataset inside the yolo_dataset folder.

Now let's fill this structure step by step.

Generate the `data.yaml` file

The data.yaml file defines the paths to the dataset's images and lists object classes this dataset contains. Most of the content can be hard-coded, but the data about classes you need to get from the meta.json file of the source dataset. We will use the json package to parse JSON:



# From source model load classes that this dataset contains
meta = json.load(open(path.join(SRC_DIR,"meta.json")))
classes = {}
for (index, entry) in enumerate(meta["classes"]):
    classes[entry["title"]] = index

This code will open the meta.json file and will parse it to the JSON object. Then it will read all items of the "classes" element (see the sample of the meta.json file above) and will create a dictionary, that maps each class name to its index. Finally, the classes variable will contain the following:



{'Caries': 0, 'Cavity': 1, 'Crack': 2, 'Tooth': 3}

This map will be required later, when we create the annotation files, because in the YOLOv8 annotations you need to specify class codes, not names.

So, now everything ready to generate the data.yaml file with this information:



# Create the "data.yaml" file in destination dataset
# with classes, that this dataset will contain
with open(path.join(DEST_DIR,"data.yaml"),"w") as fp:
    fp.write("train: ../train/images\n")
    fp.write("val: ../val/images\n")
    fp.write("test: ../test/images\n")
    fp.write("\n")
    fp.write("nc: {}\n".format(len(classes)))
    fp.write("names: ['{}']".format("','".join(classes.keys())))

This code

created a new data.yaml file in the yolo_dataset folder
added paths to train, val and test images datasets
added the number of classes in the nc: line
added the array of classes in the names: line in the same order as they defined in the dictionary.

After you run this, you'll see the yolo_dataset/data.yaml file with the following content:



train: ../train/images
val: ../val/images
test: ../test/images

nc: 4
names: ['Caries','Cavity','Crack','Tooth']

The last line is the most important for the next section. It defines a list of classes and their order is important, because the index of the item in the names array is the ID of the class: 0 - Caries, 1 - Cavity, 2 - Crack, 3 - Tooth.

Copy images from source to destination datasets

Let's map source folders to destination folders for convenience:



dirs_map = {"train": "train", "valid": "val", "test":"test"}

Now, you can just copy all images as is, using this code:



# Copy images and transform annotations
for (src_dir, dest_dir) in dirs_map.items():
    # Copy all images from source to destination dataset
    shutil.copytree(path.join(SRC_DIR,src_dir,"img"),path.join(DEST_DIR,dest_dir,"images"),dirs_exist_ok=True)

and that is it. The shutil.copytree function copies all files from img subfolder of the source datasets to the images subfolder of the destination datasets. Nothing should be changed in the images.

In the same loop, we should copy annotations for that images, but things are not as simple as with images. Let's go figure.

Convert annotations

If you remember the YOLOv8 dataset format, described in the previous article you'll find that the annotation file for the image is a text file with the same name as the image, but with the .txt extension, which contains the following lines for each object, that should be detected on the image:



{class_id} {x_center} {y_center} {width} {height}
{class_id} {x_center} {y_center} {width} {height}
.
.
.

So, the YOLOv8 dataset for object detection requires bounding boxes for each object on the image:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/us7jjshd343ncga2p52s.png)

Furthermore, these bounding boxes must be normalized, e.g. divided by dimensions of the image. In other words:

x_center = (box_x_left+box_x_width/2)/image_width
y_center = (box_y_top+box_height/2)/image_height
width = box_width/image_width
height = box_height/image_height

However, the source dataset is annotated for the segmentation, and the source annotation JSON file defines the bounding polygons for each object, like this one:

You need to convert the polygon to the bounding box:

First, you need to determine the coordinates of top left and bottom right corners of bounding box, using the coordinates of the polygon points
Then, calculate width and height of the bounding box and coordinates of center
Normalize x_center, y_center, width and height using image size (see the formulas above).

Let's start doing that.



for (src_dir, dest_dir) in dirs_map.items():
    # Copy all images from source to destination dataset
    shutil.copytree(path.join(SRC_DIR,src_dir,"img"),path.join(DEST_DIR,dest_dir,"images"),dirs_exist_ok=True)
    # Go over each annotation file, transform annotations to YOLOv8 format
    # and write to the destination dataset
    for file in os.listdir(path.join(SRC_DIR,src_dir,"ann")):
        ann = json.load(open(path.join(SRC_DIR,src_dir,"ann",file),"r"))
        # get width and height of the image
        img_width = ann["size"]["width"]
        img_height = ann["size"]["height"]

After copying all images, this code starts processing the annotations of the same dataset, located in the "ann" subfolder. It loads each file as a JSON object with the ann name and retrieves the width and height of the image to the img_width and img_height variables.

Then, it prepares a name for the corresponding destination annotation file. The name of this file should be the same as a name of the image, but with .txt extension. To do this, you have to replace the extension of this file to .txt and create the file with this name in the corresponding place of the destination dataset:



file_name = file.replace(".jpg.json",".txt")
fp = open(path.join(DEST_DIR,dest_dir,"labels",file_name+".txt"),"w")

In the first line we replaced current double extension, which is not needed for YOLOv8 dataset to the .txt. Then, we created the file with this name in the appropriate folder of the destination dataset (either yolo_dataset/train/labels, yolo_dataset/val/labels or yolo_dataset/test/labels).

Now, it's time to process objects and their polygons. The object annotations are collected in the ann["objects"] node:



for obj in ann["objects"]:
    .
    .
    .

If you open that JSON file, you'll see that class name located in classTitle field of the obj, and the polygon points - in the points.exterior field. There is also the interior item inside the points field, but in this dataset it's always empty.

First, let's get the class ID of the object, by the classTitle:



class_id = classes[obj["classTitle"]]

Here we used the dictionary, that maps class titles to their IDs, created on the previous step for the data.yaml file.

Then, you need to go through points.exterior and calculate the bounding box for all these points:



    top = 999999
    left = 999999
    bottom = 0
    right = 0          
    for point in obj["points"]["exterior"]:
        # Determine the top left and right bottom corners of bounding box
        if point[0]<left:
            left = point[0]
        if point[0]>right:
            right = point[0]
        if point[1]<top:
            top = point[1]
        if point[1]>bottom:
            bottom = point[1]

Obviously, the bounding box consists of top left and bottom right corners, so, we need to calculate these coordinates. To do this, we are going through points. Each point is an array of two elements: [x,y]. So we found minimal top left corner and maximum bottom right corner.

Then, knowing bounding box coordinates, image_width and image_height you can use the formulas, defined above to calculate x_center, y_center, width and height, that required for the YOLOv8 annotation file:



    width = right - left
    height = bottom - top
    x_center = (left+width/2)/img_width
    y_center =(top+height/2)/img_height
    width /= img_width
    height /= img_height

Finally, all that left to do, is to write calculated values along with the class ID to the destination annotation file:



    fp.write("{} {} {} {} {}\n".format(class_id,x_center,y_center,width,height))
fp.close()

So, this way, for each object it will create the YOLOv8 annotation file. For example, for the sample image, that provided above, this code will create the following annotation in the destination dataset:



3 0.35392051557465093 0.5800316957210776 0.2738990332975295 0.43106180665610144
0 0.8394199785177229 0.5269413629160064 0.09559613319011816 0.06497622820919176
0 0.6949516648764769 0.49128367670364503 0.02577873254564984 0.05071315372424723
0 0.34371643394199786 0.694136291600634 0.05370569280343716 0.028526148969889066
0 0.4317937701396348 0.6870047543581617 0.04081632653061224 0.03328050713153724
0 0.27926960257787325 0.6624405705229794 0.010741138560687433 0.028526148969889066
3 0.9624060150375939 0.5649762282091918 0.07303974221267455 0.35657686212361334
1 0.6267454350161117 0.554675118858954 0.21160042964554243 0.3359746434231379
0 0.23952738990332975 0.5499207606973059 0.017185821697099892 0.04120443740095087
0 0.3694951664876477 0.5253565768621236 0.19978517722878625 0.27099841521394613
3 0.8469387755102041 0.5126782884310618 0.1858216970998926 0.3787638668779715
3 0.6117078410311493 0.5301109350237718 0.27175080558539205 0.44532488114104596

As expected there are 4 objects of the "Tooth" class (class_id=3), 7 objects of the "Caries" class (class_id=0) and a single Cavity (class_id=1).

This is a full loop, that copies all images and converts all annotations from train, valid and test folders of the source dataset to the train, val and test folders of the destination dataset:



for (src_dir, dest_dir) in dirs_map.items():
    # Copy all images from source to destination dataset
    shutil.copytree(path.join(SRC_DIR,src_dir,"img"),path.join(DEST_DIR,dest_dir,"images"),dirs_exist_ok=True)
    # Go over each annotation file, transform annotations to YOLOv8 format
    # and write to the destination dataset
    for file in os.listdir(path.join(SRC_DIR,src_dir,"ann")):
        ann = json.load(open(path.join(SRC_DIR,src_dir,"ann",file),"r"))
        # get width and height of the image
        img_width = ann["size"]["width"]
        img_height = ann["size"]["height"]
        # Create the annotation file in the destination dataset
        file_name = file.replace(".jpg.json",".txt")
        fp = open(path.join(DEST_DIR,dest_dir,"labels",file_name),"w")
        # Calculate bounding boxes for each object, defined in this annotation file
        for obj in ann["objects"]:
            # Get a class code for this bounding box
            class_id = classes[obj["classTitle"]]
            top = 999999
            left = 999999
            bottom = 0
            right = 0
            for point in obj["points"]["exterior"]:
                # Determine the top left and right bottom corners of bounding box
                if point[0]<left:
                    left = point[0]
                if point[0]>right:
                    right = point[0]
                if point[1]<top:
                    top = point[1]
                if point[1]>bottom:
                    bottom = point[1]
                # calculate bounding box in YOLOv8 format with normalization (x_center,y_center,width_height)
                width = right - left
                height = bottom - top
                x_center = (left+width/2)/img_width
                y_center =(top+height/2)/img_height
                width /= img_width
                height /= img_height
            # Write bounding box to the annotation file to destination dataset
            fp.write("{} {} {} {} {}\n".format(class_id,x_center,y_center,width,height))
        fp.close()

That is it for data preparation. The yolo_dataset folder will contain the standard YOLOv8 object detection dataset. You can load one of YOLOv8 models and use the .train method to train the model using this dataset.

Train the caries detector model

The YOLOv8 train process fully described in the appropriate section of the first article about YOLOv8. I will just recap the code, that required to run the training process:



from ultralytics import YOLO
import os

# Load the medium YOLOv8 pretrained model
model = YOLO("yolov8m.pt")

# Train the model on transformed dataset with 30 epochs.
model.train(data=os.path.join(os.getcwd(),"yolo_dataset","data.yaml"),model="yolov8m.pt",epochs=30)

This code will run the training, and after it finish 30 epochs, the best model will be stored to the runs/detect/train/weights/best.pt file. You can copy it from there to your project folder.

After that, you can run detection using your own images.

Detect caries on custom image

After the model is trained, lets start using it. The object detection process also described in detail in the first article, so here I only will show the code with brief comments. This code assumes that you write and run it in the Jupyter Notebook.

I've got this image, that we are going to use to detect caries on these teeth:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h130kwwtdp7wplqvu86o.jpeg) Source: [Wikipedia](https://en.wikipedia.org/wiki/Tooth_decay)

I put it to the caries.jpg file and the following code will run YOLOv8 object detection, using the trained model:



from ultralytics import YOLO
model = YOLO("best.pt")
results = model("caries.jpg")

Next code gets detected bounding boxes for the first image:



boxes = results[0].boxes
print(len(boxes))

It detected 6 boxes. Let's print their coordinates, classes and probabilities:



for box in boxes:
    class_name = results[0].names[box.cls.numpy()[0]]
    probability = box.conf.numpy()[0]
    [x1,y1,x2,y2] = [int(v) for v in box.xyxy.numpy()[0]]
    print(f"{x1},{y1},{x2},{y2} - {class_name} - {probability}")



38,72,98,130 - Tooth - 0.930024266242981
0,91,43,161 - Tooth - 0.8993273377418518
176,30,226,78 - Tooth - 0.8704051375389099
104,53,180,117 - Tooth - 0.8602994680404663
115,77,161,114 - Caries - 0.8227962255477905
230,17,287,60 - Tooth - 0.4153428375720978

It detected 5 Tooth and 1 Caries. But we do not need teeth, we need only caries. So, let's filter this list and save it:



caries_boxes = []
for box in boxes:
    class_name = results[0].names[box.cls.numpy()[0]]
    if class_name != "Caries":
        continue
    caries_boxes.append([int(v) for v in box.xyxy.numpy()[0]])

Now, let's visualize these detections, using the Pillow image package:



from PIL import Image,ImageDraw
img = Image.open("caries.jpg")
draw = ImageDraw.Draw(img)
for box in caries_boxes:
    draw.rectangle(box,None,"#00FF00",3)
img

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lfxgjohas0pe05k4ags3.png)

Great, as an additional exercise, try this code using more images, like this one:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ibtdkdkvnhoqoi9uib8b.jpg)

You should receive the following detections for caries:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/asrhhdt0sr4xcyv4l5ah.jpg)

The model trained on 30 epochs works pretty good. It detects even small samples of caries.

Create a web-service to detect caries

In the last part of the article about the YOLOv8 object detection, we created a web service, that uses custom YOLOv8 model to detect objects and display their bounding boxes. You can reuse this code. Just replace the model, to the best.pt file, trained here for teeth detection, and you are ready to go!

This is how it could look:

I modified the code little bit to detect only caries and to not show the class labels, because it's not required for single class detections.

Conclusion

In this article, we did a great job: using the knowledge from previous articles, we wrote a dataset converter from Supervisely to YOLOv8 format. You can use this as a base to write converters for image data from other sources.

Then we trained a custom model for real business task - to detect caries and other teeth deceases. We trained it using just 30 epochs, but it showed not bad results on images, that it has never seen before. Of course, to get better accuracy, you can train the model again using more epochs and tuning other YOLOv8 hyperparameters.

Finally, we created a web-service, that uses this model to detect and show caries on teeth online.

This web-service implemented on Python, but you can easily convert it to other programming languages. Read the second part of the YOLOv8 series to discover how to do this.

You can find all source code for this article in this repository, but I recommend writing it from scratch, by reading the article.

I hope you enjoyed this reading and will find this useful for your own work.

Follow me for more machine learning and programming insights:

LinkedIn, Twitter, and Facebook

Have a good one, and until next time!

Export Segment Anything neural network to ONNX: the missing parts

Andrey Germanov — Wed, 15 Nov 2023 14:47:40 +0000

Introduction
What is a problem ?
Diving to the SAM model structure
Export SAM to ONNX - the right way
    Export the image encoder
    Export the mask decoder
Produce image segmentation masks using ONNX
    Preprocess input image
    Generate embeddings from input image
    Encode the prompt
    Run the mask decoder
    Post-process and visualize segmentation mask
Conclusion

Introduction

Hello all!

In this article, I am going to talk about Segment Anything - the neural network for instance segmentation, that can be used to segment any object from an image without knowing its type. However, this is not a tutorial on how to use it, because it already described in official repository and in other articles like this one. Here I will explain how to solve a problem with it, which is not described anywhere - the problem with export to ONNX function.

What is a problem ?

If you try to export the Segment Anything model to ONNX and then deploy it to production, using the guide in the official notebook, you'll see that you can't use only ONNX model that you exported, but you still need to use Segment Anything package with PyTorch to prepare embeddings from input image, and you still need to use a function from this package to encode the prompt.

When I experienced this for the first time, I've asked myself: "Why should I export the model to ONNX if I still need to use the original PyTorch model ?".

One of the main benefits of ONNX is the ability to run the model in environments without Python and PyTorch. However, according to official documentation, I can't do that with Segment Anything. Even with ONNX I need to install the whole PyTorch environment on my production server or device.

I was not alone with this problem, a lot of people asked for solution in forums or in the project GitHub, but there were no clear answers. Finally, I decided to dive to the Segment Anything source code myself and fill this gap.

In this article, I am going to show how to export a complete SAM model and how to segment the image using only ONNX model and without other heavy dependencies.

Diving to the SAM model structure

Before going to ONNX, let's understand the SAM model structure by using its official API.

The Segment Anything has a transformer neural network architecture and contains the following parts: image encoder, prompt encoder and mask decoder.

This picture from SAM official paper shows the segmentation mask inference process. Now let's see the code, that uses the official API, that implements this flow.

All code examples in this article use the following image, named cat_dog.jpg that you can download here:



from segment_anything import sam_model_registry, SamPredictor
import numpy as np
import cv2

# 1. Load the image
img = cv2.imread("cat_dog.jpg")
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)

# 2. Load the Segment anything model
sam = sam_model_registry["vit_b"](checkpoint="./sam_vit_b_01ec64.pth")

# 3. Put the model to the SamPredictor helper object
predictor = SamPredictor(sam)

# 4. Encode the image to embeddings.
predictor.set_image(img)

# 5. Prepare the prompt
input_point = np.array([[321,230]])
input_label = np.array([1])

# 6. Decode masks
masks = predictor.predict(input_point, input_label)

Here is a breakdown of this flow:

First it loads the image as a Numpy array of HWC shape (Height, Width, Channels) using OpenCV. You can do this using any other library like Pillow as well.
Then it loads the SAM model to the sam variable. The sam is an object of Sam class, defined in the sam.py file. This class contains both the image encoder and the mask decoder parts. If you open this file and see the __init__ constructor, you'll find there that the encoder initialized in the image_encoder property and the decoder initialized in the mask_decoder property. Both of them are standard PyTorch neural network modules.
Then, the code initializes the helper SamPredictor object, which used as a wrapper for created Sam model. It contains helper methods to prepare the input image, encode the image to embeddings, encode the prompt and pass both them to the mask_decoder to get segmentation masks.
The most important line of the whole code is predictor.set_image(img). This method used to preprocess input image and run the SAM encoder network with it. Under the hood, it runs the following line with preprocessed image: predictor.features = sam.image_encoder(input_image). This line passes the image through the encoder neural network to get embeddings and saves them to the features property of the SamPredictor object. The official export to ONNX function does not export this neural network, so you still need to run this even if you use the exported ONNX model.
Then, you define the point on the image, that will be used as a prompt to decode segmentation mask and a label for this point: 1 means that the point belongs to the object that you want to extract, 0 means that the point does not belong to that object.
Finally, you executed the predictor.predict(input_point, input_label) method. At this moment, the predictor encoded the prompt and passed both image embeddings, saved in the features property and the encoded prompt to the mask decoder, which is a sam.mask_decoder neural network. Then this method returned the resulting output tensor, which then post-processed to return the masks.

This is how the official API works. The Segment Anything is actually two neural networks: image_encoder and mask_decoder that executed separately one by one. It runs the sam.image_encoder network first to encode image to embeddings, and then it runs sam.mask_decoder network to decode embeddings to masks, using prompt. Prompt also encoded, using the prompt encoder, but in many cases prompt can be encoded without neural network. However, when you export the sam model to ONNX, it exports only the mask_decoder, and you still need to use the official API to prepare the image embeddings for the exported ONNX model and to encode the prompt.

Fortunately, the image_encoder is an ordinary PyTorch neural network module that you can export to ONNX yourself using the standard PyTorch feature, described here. The prompt also can be encoded using only Numpy. I will fill these gaps for you in the next sections.

Export SAM to ONNX - the right way

To use the Segment Anything network independently of PyTorch and/or Python, you need to export two models to ONNX: the image encoder and the mask decoder. Official documentation shows how to export only mask decoder. In this tutorial, I will show you how to export and use both parts and do not depend on PyTorch and SAM official API.

Export the image encoder

To export any PyTorch model to ONNX you need to know the shape of input tensor or tensors, that this model requires. The image encoder model, used in Segment Anything, is a modified encoder part of the ViT transformer neural network. It defined in the ImageEncoderViT class in the image_encoder.py. By analyzing the source code of this file it's easy to understand that this neural network module requires the input tensor in the following shape (1,3,1024,1024), which is a batch of images of 1024x1024 size. So, to pass a single image to the image encoder, you need to encode it to the float tensor of this shape.

This is a full code to export the image encoder to ONNX. I assume that you'll run it in Jupyter Notebook:



!pip install git+https://github.com/facebookresearch/segment-anything.git
!pip install onnx
!pip install torch

from segment_anything import sam_model_registry
import torch

# Download SAM model checkpoint
!pip install wget
!python -m wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

# Load SAM model
sam = sam_model_registry["vit_b"](checkpoint="./sam_vit_b_01ec64.pth")

# Export images encoder from SAM model to ONNX
torch.onnx.export(
    f="vit_b_encoder.onnx",
    model=sam.image_encoder,
    args=torch.randn(1, 3, 1024, 1024),
    input_names=["images"],
    output_names=["embeddings"],
    export_params=True
)

This code installs and imports all required packages first. Perhaps you already have all them, but I included these lines in case if not.
Then it downloads model weights and loads the sam model with them. I used the smallest Vit-B version, but you can replace it with 'Vit-L' or 'Vit-H' and download appropriate weights from here.
Finally, the standard torch.onnx.export function used to export the sam.image_encoder to the vit_b_encoder.onnx file. The resulting ONNX model has a single input, named images, which accepts input tensors of (1,3,1024,1024) shape. Also, it will have a single output, named embedddings that will contain embeddings for the provided input image.

Great! After running this you'll have vit_b_encoder.onnx file. The biggest part of export work is done!

Export the mask decoder

In this section I can only repeat the code, that already written in the official notebook. I modified it a little bit for consistency:



!pip3 install git+https://github.com/facebookresearch/segment-anything.git
!pip3 install onnx
!pip3 install torch

from segment_anything import sam_model_registry
from segment_anything.utils.onnx import SamOnnxModel
import torch

# Download SAM model checkpoint
!pip install wget
!python -m wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

# Load SAM model
sam = sam_model_registry["vit_b"](checkpoint="./sam_vit_b_01ec64.pth")

# Export masks decoder from SAM model to ONNX
onnx_model = SamOnnxModel(sam, return_single_mask=True)
embed_dim = sam.prompt_encoder.embed_dim
embed_size = sam.prompt_encoder.image_embedding_size
mask_input_size = [4 * x for x in embed_size]
dummy_inputs = {
    "image_embeddings": torch.randn(1, embed_dim, *embed_size, dtype=torch.float),
    "point_coords": torch.randint(low=0, high=1024, size=(1, 5, 2), dtype=torch.float),
    "point_labels": torch.randint(low=0, high=4, size=(1, 5), dtype=torch.float),
    "mask_input": torch.randn(1, 1, *mask_input_size, dtype=torch.float),
    "has_mask_input": torch.tensor([1], dtype=torch.float),
    "orig_im_size": torch.tensor([1500, 2250], dtype=torch.float),
}
output_names = ["masks", "iou_predictions", "low_res_masks"]
torch.onnx.export(
    f="vit_b_decoder.onnx",
    model=onnx_model,
    args=tuple(dummy_inputs.values()),
    input_names=list(dummy_inputs.keys()),
    output_names=output_names,
    dynamic_axes={
        "point_coords": {1: "num_points"},
        "point_labels": {1: "num_points"}
    },
    export_params=True,
    opset_version=17,
    do_constant_folding=True
)

This code installs and imports all required packages first. Perhaps you already have all them, but I included these lines in case if not.
Then it downloads model weights and loads the sam model with them. I used the smallest Vit-B version, but you can replace it with 'Vit-L' or 'Vit-H' and download appropriate weights from here.
Finally, it uses the standard torch.onnx.export function to export the sam.mask_decoder to the vit_b_decoder.onnx file. The resulting ONNX model has six inputs. Most important of them are: image_embeddings that will receive the output of the vit_b_encoder.onnx model as image embeddings, point_coords and point_masks that will receive the encoded prompt. Also, the decoder model requires orig_im_size which is an original input image size as a Numpy array with two items: [height, width] to correctly scale the resulted masks.

Wonderful! Now you have all parts in a puzzle:

vit_b_encoder.onnx - to create the image embeddings
vit_b_decoder.onnx - to decode segmentation masks using the embeddings and the prompts.

For your convenience, I put all ONNX export code to the sam_onnx_export.ipynb notebook in the article's repository.

However, using these models without official API is a little bit complicated, because you need to preprocess input image and encode prompt on your own. There are no any documentation about these points. I will show how to do this in the next section.

Produce image segmentation masks using ONNX

To get segmentation masks for interested objects in your image using the ONNX models exported above, you need to do the following:

Preprocess the input image
Pass the preprocessed image to the vit_b_encoder.onnx model to generate image embeddings
Create a prompt and encode it
Pass the image embeddings and prompt to the vit_b_decoder.onnx model and receive segmentation mask
Post-process the mask and optionally visualize it

In the next sections, I am going to implement these steps one by one. I assume that you will follow my code using Jupyter Notebook and that you have vit_b_encoder.onnx and vit_b_decoder.onnx file in the folder with your notebook. Also, in examples I will use the cat_dog.jpg image, which you can download in the beginning of this article and place in the same folder.

Preprocess input image

As mentioned above, the encoder model requires the input tensor of the (1,3,1024,1024) size. Therefore, you need to correctly resize the input image to 1024x1024 preserving the aspect ratio, convert it to tensor of numbers and normalize this tensor.

Let's load the image first, you will use the Pillow package for this:



!pip install Pillow

from PIL import Image
img = Image.open("cat_dog.jpg")
img = img.convert("RGB")
img.size
orig_width, orig_height = img.size
print(img.size)



(612, 415)

This code loaded the image, converted it to RGB and saves the original size, that you will need later.

Then, you need to resize this image preserving aspect ratio using 1024 as a long side. It means, that you need to set long side to 1024 and then, set short side to maintain aspect ratio. The following code can be used for this:



resized_width, resized_height = img.size

if orig_width > orig_height:
    resized_width = 1024
    resized_height = int(1024 / orig_width * orig_height)
else:
    resized_height = 1024
    resized_width = int(1024 / orig_height * orig_width)

img = img.resize((resized_width, resized_height), Image.Resampling.BILINEAR)
print(img.size)



(1024, 694)

So, this code determined which of the sides is longest and according to this, calculated the new size of shortest side. In this case the longest side is width, the shortest is height, and they scaled to (1024,694) and saved to resized_width and resized_height variables.

Then, you need to convert it to tensor. The Numpy allows doing this in a single line:



!pip install numpy
import numpy as np
input_tensor = np.array(img)
input_tensor.shape



(694, 1024, 3)

The input_tensor contains three matrices of image pixels colors. First matrix contains red color components, second contains green color components and third - blue color components. Each color can be in a range from 0 to 255. However, Segment Anything model requires normalized numbers. To get a normalized number, you need to subtract mean color from each number and then divide it to standard deviation. There are different ways to calculate mean color and standard deviation, but Segment Anything package provides already calculated means and deviations for each color component. You need to initialize them:



mean = np.array([123.675, 116.28, 103.53])
std = np.array([[58.395, 57.12, 57.375]])

So, now you need to subtract 123.765 from each red color component and then divide it by 58.395. Similarly, you need to subtract 116.28 from each component of green color matrix and divide it by 57.12 and so on for blue. You can do all this in a single line of code using Numpy:



input_tensor = (input_tensor - mean) / std

Now you have normalized input tensor, but it has incorrect shape: (694, 1024, 3). You need to change it to the form of (1,color_channels,height,width). In this case it should be (1, 3, 694, 1024):



input_tensor = input_tensor.transpose(2,0,1)[None,:,:,:].astype(np.float32)
input_tensor.shape



(1, 3, 694, 1024)

The final step is to transform it to (1, 3, 1024, 1024). To do this, you need to pad the short side with zeros:



if resized_height < resized_width:
    input_tensor = np.pad(input_tensor,((0,0),(0,0),(0,1024-resized_height),(0,0)))
else:
    input_tensor = np.pad(input_tensor,((0,0),(0,0),(0,0),(0,1024-resized_width)))

input_tensor.shape



(1, 3, 1024, 1024)

The np.pad function receives the input tensor that need to pad with zeros and then, for each axis, it receives how many zeros to add before and after existing values. In this case, you need to add 1024-resized_height rows of zeros to the end. If the shortest side was width, then this had to be done for the last axis.

That is it, now you have correct input_tensor for the image encoder model.

Generate embeddings from input image

The first thing that need to do is to import the onnxruntime library and load the vit_b_encoder.onnx model using it:



!pip install onnxruntime
import onnxruntime as ort
encoder = ort.InferenceSession("vit_b_encoder.onnx")

Then, run the model with the input_tensor as input images to generate embeddings:



outputs = encoder.run(None, {"images": input_tensor})
embeddings = outputs[0]
embeddings.shape



(1, 256, 64, 64)

If you remember, when export the image encoder to ONNX you specified that this model should have a single input named "images" and a single output named "embeddings". Here, you've passed the input_tensor as an "images" input. The run method of ONNX model returns outputs as an array, even if the output is single. That is why, the embeddings located in the first item of this array.

Great, now you have embeddings. This is the first input, that you will need for the mask decoder model. The next input is prompt which you also need to prepare.

Encode the prompt

The prompt helps to find segmentation mask of required object correctly. The prompt can be either a single point of image, that belongs to the object, or a bounding box around this object, or several points. To encode all those options, the Segment Anything uses a similar algorithm. Let's start with a single point:



input_point = np.array([[321,230]])
input_label = np.array([1])

In this code, you defined a point with x=321 and y=230. Also, you defined a label for this point, which is 1. This label means that the point belongs to the object. Using this definition, the mask decoder will try to find the segmentation mask for the object, that contains this point. However, you need to encode this point to a format, that mask decoder requires. Use next lines of code for this:



from copy import deepcopy

onnx_coord = np.concatenate([input_point, np.array([[0.0, 0.0]])], axis=0)[None, :, :]
onnx_label = np.concatenate([input_label, np.array([-1])])[None, :].astype(np.float32)

coords = deepcopy(onnx_coord).astype(float)
coords[..., 0] = coords[..., 0] * (resized_width / orig_width)
coords[..., 1] = coords[..., 1] * (resized_height / orig_height)

onnx_coord = coords.astype("float32")
onnx_coord



array([[[537.098 , 384.6265],
        [  0.    ,   0.    ]]], dtype=float32)

The SAM mask decoder requires scaling the input point to 1024x1024 image size and convert it to the tensor of floats. Here I used the original_width, original_height, resized_width and resized_height of the image to scale the coordinates.

I won't give detail explanation of each line of this code, because I just reused it from the transform.apply_coords function of the SAM source code with few modifications to make it more simple. It's just a requirement for mask decoder model.

If you need to send bounding box as a prompt, you can use similar code:



input_box = np.array([132, 157, 256, 325]).reshape(2,2)
input_labels = np.array([2,3])

onnx_coord = input_box[None, :, :]
onnx_label = input_labels[None, :].astype(np.float32)

coords = deepcopy(onnx_coord).astype(float)
coords[..., 0] = coords[..., 0] * (resized_width / orig_width)
coords[..., 1] = coords[..., 1] * (resized_height / orig_height)

onnx_coord = coords.astype("float32")
onnx_coord



array([[[220.86275, 262.5494 ],
        [428.33987, 543.49396]]], dtype=float32)

This code used to encode a prompt to get the mask for object located inside the box with top left corner at x=132,y=157 and bottom right corner at x=256,y=325.

If you want to encode a prompt, that contains both bounding box and point, you can use the following code:



input_box = np.array([132, 157, 256, 325]).reshape(2,2)
box_labels = np.array([2,3])
input_point = np.array([[140, 160]])
input_label = np.array([0])

onnx_coord = np.concatenate([input_point, input_box], axis=0)[None, :, :]
onnx_label = np.concatenate([input_label, box_labels], axis=0)[None, :].astype(np.float32)

coords = deepcopy(onnx_coord).astype(float)
coords[..., 0] = coords[..., 0] * (resized_width / orig_width)
coords[..., 1] = coords[..., 1] * (resized_height / orig_height)

onnx_coord = coords.astype("float32")
onnx_coord

This code includes both input_box and input_point and labels for them. Notice that input_label here contains 0, which means that the point (140,160) does not belong to the object, that you want to extract. This prompt will guide the model to segment the object, that located inside the (132,157,256,325) box, but not in (140,160) point.

You can construct very specific prompts to get desired results (just like with ChatGPT ;) ).

So, now you have correctly encoded onnx_coord and onnx_label to pass to the mask decoder. Let's do this right now.

Run the mask decoder

Now when you have the embeddings, onnx_coord and onnx_label, nothing can stop you from running the mask decoder model to get the segmentation mask.

Let's load the model first:



decoder = ort.InferenceSession("vit_b_decoder.onnx")

and pass all encoded data to it:



onnx_mask_input = np.zeros((1, 1, 256, 256), dtype=np.float32)
onnx_has_mask_input = np.zeros(1, dtype=np.float32)

outputs = decoder.run(None,{
    "image_embeddings": embeddings,
    "point_coords": onnx_coord,
    "point_labels": onnx_label,
    "mask_input": onnx_mask_input,
    "has_mask_input": onnx_has_mask_input,
    "orig_im_size": np.array([orig_height, orig_width], dtype=np.float32)
})
masks = outputs[0]
masks.shape



(1, 1, 415, 612)

This code runs the model with encoded image_embeddings, point_coords and point_labels. Also, I provided dummy masks to mask_input and has_mask_input and original image size to the orig_im_size parameter.

The model returns 3 outputs, and the array of segmentation masks is the first of them. For the input image it returned the tensor of (1, 415, 612) shape which is a single channel segmentation mask.

The only step left is to post process it.

Post-process and visualize segmentation mask

The segmentation mask is an array of pixels, however, each pixel contains not a color but some number. If this number greater than 0, then this pixel belongs to object, otherwise not. So, to convert it to real pixel colors you can run the following code:



mask = masks[0][0]
mask = (mask > 0).astype('uint8')*255

This code extracts the pixel matrix from the mask (415x612), converts all positive values to True and all negatives to False. Then it converts all numbers to 8-bit integers. After this, all True values becomes 1 and all False values become 0. Then, I multiplied the matrix by 255 to convert all True pixels to white color. Finally, you have a single channel black-white image, that can be easily visualized by many image libraries. For example, you can visualize it this way using the Pillow:



img = Image.fromarray(mask,'L')
img

Hooray! Now you can do Segment Anything image segmentation using only ONNX.

This is the end of our journey. You can find all source code of this section in the sam_onnx_inference.ipynb notebook in the repository.

Conclusion

In this article, I showed how to fill the gap in the official implementation of the Segment Anything Model's ONNX export function. Then I guided you how to do a prompt-based image segmentation using the exported ONNX models.

All source code you can find in this repository: https://github.com/AndreyGermanov/sam_onnx_full_export.

Here I used only Python, but now, with complete ONNX models you can do much more. You can run Segment Anything model on any programming language, supported by ONNX runtime. If you know the algorithm how to pre-process input and post-process output, you can integrate this model to most production systems, written in any programming language. For example, you can embed it to software written on C/C++, Go or Rust, or to websites written on JavaScript.

Thank you and until next time!

Follow me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

How to implement instance segmentation using YOLOv8 neural network

Andrey Germanov — Tue, 04 Jul 2023 15:31:32 +0000

Introduction
Getting started with YOLOv8 segmentation
Train the YOLOv8 model for image segmentation
Using YOLOv8 segmentation model in production
    Export the YOLOv8 segmentation model to ONNX
    Load the model using ONNX
    Prepare the input
    Run the model
    Process the output
        Join bounding boxes and masks
        Parse the combined output
        Process segmentation masks
        Calculate bounding polygons
    Draw bounding polygons on the image
Create a segmentation web application
    Create a backend
    Create a frontend
Conclusion

Introduction

This is the fourth part of my YOLOv8 series. In previous articles, I described how to use the YOLOv8 to detect objects on images and in videos using different programming languages. However, the YOLOv8 also can be used to detect objects more precisely, using instance segmentation.

The result of object detection is a list of bounding boxes around all detected objects. The result of the instance segmentation is a segmentation mask of each detected object. The segmentation mask is a black-white image on which all pixels that belong to the object are white, and all other pixels are black, as displayed on the next image:

After having this mask, which is, in fact, a 2D array with values either 0 for background pixels and 255 for object pixels, you can apply it to the image to draw only pixels of the image that displayed as white on the mask. This way you can, for example, remove background from the image:

and set new background for objects:

The same way, you can run instance segmentation for each frame in video and remove background from the whole video.

Furthermore, having these bit masks of detected objects, you can calculate their contours.

The calculated contour is a polygon, which is an array of point coordinates. This polygon can be used to identify the detected objects on images in real world applications much more precisely than just using the bounding boxes.

The instance segmentation used in self-driving cars, medical imagining, aerial crop monitoring, and more.

For example, the well known ChromaCam application and its analogues use image segmentation to detect your shape and boundaries and replace or blur background around you while you stream a video from a web camera.

In this article, I will guide you how to implement instance segmentation for images using YOLOv8. First, we will use default Ultralytics API where most of internal work greatly automated, and we will use a pretrained model shipped with YOLOv8 that detects 80 objects classes from the COCO dataset. Then I will give a review how to prepare data and train the model on it to detect custom object classes, that can be required for specific business tasks. Then, after using high level API, we will dive to the internals and will prepare the input for the YOLOv8 model and will parse its output by hands. Finally, we will create a web application in which you can upload the image, pass it through the YOLOv8 model and display contours of all objects, detected on it.

The YOLOv8 segmentation models are based on object detection models, that I covered in previous articles. When you do instance segmentation, the model automatically does object detection and returns both results of object detection and instance segmentation. That is why, it crucially important to read my previous articles of this series, at least first and second parts, because I will reuse a big amount of code from these posts.

Getting started with YOLOv8 segmentation

Let's just start coding. You need an environment where you can run Python code. I recommend to use Jupyter Notebook. In the next sections, all code samples and output assumes that you run this in the Jupyter Notebook.

Ensure that the Ultralytics package installed by running the following command in notebook:



!pip install ultralytics

Then, import the YOLO models factory object:



from ultralytics import YOLO

Then, you need to instantiate the model, that will be used for predictions. I will use a pretrained medium-sized model, shipped with YOLOv8, that can detect 80 object classes. In contrast with object-detection models, the segmentation model names have -seg suffix. So, if you need to load a medium-sized model for segmentation, you need to specify yolov8m-seg.pt file.



model = YOLO("yolov8m-seg.pt")

Also, all the same models for segmentation available: yolov8n-seg.pt, yolov8s-seg.pt, yolov8m-seg.pt, yolov8t-seg.pt and yolov8x-seg.pt.

For all examples, I will use the image with cat and dog, that named cat_dog.jpg and located in the current folder with the notebook:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ut0gbf0zrz83ojt9ue79.png)

Let's run the model to get segmentation results for this image:



results = model.predict("cat_dog.jpg")

The predict method for segmentation model works the same as for object detection model. It returns an array of results for each image, specified in the method call. In this case, this array contains a single item. Let's get it:



result = results[0]

Furthermore, the segmentation model runs object detection as well. Consequently, the returned result is almost the same as it was when we ran object detection in the first article of this series. In addition, it has the masks property, which is an array of detected object segmentation masks.

Let's see how many masks detected:



masks = result.masks
len(masks)

The result is predictable, it segmented 2 objects: dog and cat. Let's get the first of these masks:



mask1 = masks[0]

Each mask is an object that has a set of properties. We will use two of them:

data - the segmentation mask of the object, which is a black and white image matrix, in which 0 elements are black pixels and 1 elements are white pixels.
xy - the polygon of object, which is an array of points.

There are other properties exist. All them you can learn in the official documentation

The data property wrapped to PyTorch tensor array, but, I think, it will be more common to work with it as with NumPy array. Let's extract the mask and the polygon:



mask = mask1.data[0].numpy()
polygon = mask1.xy[0]

Let's display the mask of the first object. We will use the Pillow Image object for it. It has the fromarray method to load images from NumPy matrices. Ensure that the Pillow installed:



!pip install pillow

and create an image from the mask:



from PIL import Image
mask_img = Image.fromarray(mask,"I")
mask_img

It should display the following image:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4soatfikiwcnol6rzkey.png)

So, you see that this is a segmentation mask of the dog.

Now, let's work with polygon object. Display it to see the array of points:



polygon



array([[     280.18,      96.575],
       [      275.4,      101.36],
       [      275.4,      102.31],
       [     274.44,      103.27],
       [     274.44,      105.18],
       [     273.49,      106.14],
       [     273.49,      107.09],
       [     272.53,      108.05],
       [     272.53,      111.87],
       [     271.57,      112.83],
       [     271.57,      117.61],
       [     272.53,      118.57],
       [     272.53,      152.04]
...
       [     302.17,      112.83],
       [     302.17,      110.92],
       [     301.22,      109.96],
       [     301.22,      108.05],
       [     298.35,      105.18],
       [     298.35,      104.22],
       [     297.39,      103.27],
       [     297.39,      102.31],
       [     292.61,      97.531],
       [     291.66,      97.531],
       [      290.7,      96.575]], dtype=float32)

Each point is a list with coordinates [x,y]. You can do whatever you want with it. For example, you can draw it on top of the cat_dog image:



from PIL import ImageDraw

img = Image.open("cat_dog.jpg")
draw = ImageDraw.Draw(img)
draw.polygon(polygon,outline=(0,255,0), width=5)
img

This code imports the ImageDraw module from Pillow that used to draw on top of images. Then, it opens the cat_dog.jpg image and initializes the draw object with it. Then it draws the polygon on it, using the polygon points. The outline argument specifies the line color (green) and the width specifies the line width. Finally, you should see the image with outlined dog:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sk098kdmmy3qhd33pz12.png)

Now, to summarize this, let's do the same for the second mask:



mask2 = masks[1]
mask = mask2.data[0].numpy()
polygon = mask2.xy[0]
mask_img = Image.fromarray(mask,"I")
mask_img

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qrn4v8mo6vx8l3yqp163.png)



draw.polygon(polygon,outline=(0,255,0), width=5)
img

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pnar6zy1ofpvgbu1zod1.png)

For convenience, you can see the whole coding session of this chapter as a video.

Using the models pretrained on well-known objects is ok to start, but in practice, you may need a solution to segment specific objects for a concrete business problem.

For example, someone may need to detect specific products on supermarket shelves or discover brain tumors on x-rays. It's highly likely that this information is not available in public datasets, and there are no free models that know about everything.

So, you have to teach your own model to detect these types of objects. To do that, you need to create a database of annotated images for your problem and train the model on these images.

Train the YOLOv8 model for image segmentation

To train the model, you need to prepare annotated images and split them to training and validation datasets. The training set will be used to teach the model and the validation set will be used to test the results of this study, to measure the quality of the trained model. You can put 80% of images to the training set and 20% to the validation set.

These are the steps that you need to follow to create each of the datasets:

Decide and encode classes of objects you want to teach your model to detect. For example, if you want to detect only cats and dogs, then you can state that "0" is cat and "1" is dog.
Create a folder for your dataset and two subfolders in it: "images" and "labels".
Put the images to the "images" subfolder. The more images you collect, the better for training.
For each image, create an annotation text file in the "labels" subfolder. Annotation text files should have the same names as image files and the ".txt" extensions. In annotation file you should add records about each object, that exist on the appropriate image in the following format:



{object_class_id} {polygon}

object_class_id is a label of object class, like for example 0 if it's cat or 1 if it's dog.
polygon is a coordinates of bounding polygon for this object in the following format: x1 y1 x2 y2 ...

Actually, this is the most time-consuming manual work in a machine learning process: to measure coordinates of bounding polygons for all objects and add them to annotation files. Moreover, coordinates should be normalized to fit in a range from 0 to 1. To calculate them, you need to use the following formulas:

x = x/image_width
y = y/image_height

So, if you have a point with (100,100) coordinate and the image size is (612,415), then, use the following to calculate this point for annotation:

x = 100/612 = 0.163666121
y = 100/415 = 0.240963855

This way, you need to set up polygons for all objects on each image. For example, if you have an image with the following cat and dog polygons:

then you need to create the following annotation file for it:



1 0.45781 0.23271 0.45 0.24423 0.45 0.24654 0.44844 0.24884 0.44844 0.25345 0.44687 0.25575 0.44687 0.25806 0.44531 0.26036 0.44531 0.26958 0.44375 0.27188 0.44375 0.2834 0.44531 0.28571 0.44531 0.36636 0.44375 0.36866 0.44375 0.38018 0.44219 0.38248 0.44219 0.3894 0.44062 0.3917 0.44062 0.42857 0.43906 0.43087 0.43906 0.45622 0.44062 0.45852 0.44062 0.48157 0.43906 0.48387 0.43906 0.49309 0.4375 0.49539 0.4375 0.50461 0.43594 0.50691 0.43594 0.51843 0.43437 0.52074 0.43437 0.52765 0.43281 0.52995 0.43281 0.54148 0.43125 0.54378 0.43125 0.58295 0.43281 0.58526 0.43281 0.58986 0.43437 0.59217 0.43437 0.59447 0.4375 0.59908 0.4375 0.60139 0.44219 0.6083 0.44219 0.6106 0.44375 0.61291 0.44375 0.61521 0.44531 0.61752 0.44531 0.61982 0.45156 0.62904 0.45156 0.63134 0.46875 0.65669 0.47031 0.65669 0.47344 0.6613 0.47344 0.6636 0.475 0.6659 0.475 0.67512 0.47344 0.67742 0.47344 0.69816 0.475 0.70047 0.475 0.71199 0.47656 0.71429 0.47656 0.7166 0.48437 0.72812 0.4875 0.72812 0.48906 0.72581 0.49062 0.72581 0.49375 0.7212 0.49375 0.7166 0.49844 0.70968 0.49844 0.70738 0.50156 0.70277 0.50312 0.70277 0.50469 0.70047 0.50937 0.70047 0.51406 0.70738 0.51406 0.70968 0.51562 0.71199 0.51562 0.71429 0.51719 0.7166 0.51562 0.7189 0.51562 0.72351 0.51719 0.72581 0.51875 0.72581 0.52031 0.72812 0.52969 0.72812 0.53281 0.72351 0.54531 0.72351 0.54687 0.72581 0.55156 0. 72581 0.55625 0.73273 0.55781 0.73273 0.55937 0.73503 0.56094 0.73273 0.56875 0.73273 0.57031 0.73503 0.575 0.73503 0.57656 0.73273 0.57812 0.73503 0.58281 0.73503 0.58437 0.73733 0.59375 0.73733 0.59531 0.73964 0.6 0.73964 0.60156 0.74194 0.625 0.74194 0.62656 0.73964 0.63281 0.73964 0.63437 0.73733 0.63906 0.73733 0.64062 0.73503 0.64219 0.73503 0.64375 0.73273 0.64531 0.73273 0.64687 0.73042 0.64844 0.73042 0.65 0.72812 0.65156 0.72812 0.65312 0.72581 0.65469 0.72581 0.65625 0.72351 0.65781 0.72351 0.65937 0.7212 0.6625 0.7212 0.66406 0.7189 0.66719 0.7189 0.66875 0.7166 0.67187 0.7166 0.67344 0.71429 0.67969 0.71429 0.68125 0.71199 0.6875 0.71199 0.68906 0.70968 0.70156 0.70968 0.70312 0.71199 0.70469 0.71199 0.70781 0.7166 0.71094 0.7166 0.7125 0.7189 0.71562 0.7189 0.71719 0.7212 0.71875 0.7212 0.72031 0.72351 0.72656 0.72351 0.72812 0.72581 0.73437 0.72581 0.73594 0.72351 0.7375 0.72351 0.74219 0.7166 0.74219 0.71429 0.74375 0.71199 0.74375 0.70738 0.74531 0.70508 0.74531 0.70047 0.74687 0.69816 0.74687 0.69125 0.74844 0.68895 0.74844 0.67742 0.75 0.67512 0.75 0.65208 0.75156 0.64977 0.75156 0.64056 0.75 0.63825 0.75 0.62904 0.74844 0.62673 0.74844 0.61752 0.74687 0.61521 0.74687 0.6106 0.74531 0.6083 0.74531 0.60599 0.74375 0.60369 0.74375 0.60139 0.74219 0.59908 0.74219 0.59447 0.74062 0.59217 0.74062 0.58986 0.7375 0.58526 0.7375 0.58065 0.73594 0.57834 0.73594 0.57373 0.73281 0.56913 0.73281 0.56682 0.72969 0.56221 0.72969 0.55991 0.72812 0.55761 0.72812 0.5553 0.725 0.55069 0.725 0.54839 0.72031 0.54148 0.72031 0.53917 0.71562 0.53226 0.71562 0.52995 0.71406 0.52995 0.70156 0.51152 0.7 0.51152 0.69844 0.50922 0.69687 0.50922 0.69531 0.50691 0.69219 0.50691 0.69062 0.50461 0.67969 0.50461 0.67812 0.5023 0.67031 0.5023 0.66875 0.50461 0.66094 0.50461 0.65937 0.50691 0.65156 0.50691 0.65 0.50922 0.625 0.50922 0.62344 0.50691 0.62031 0.50691 0.61875 0.50461 0.61875 0.5023 0.61562 0.4977 0.61562 0.49539 0.6125 0.49078 0.61094 0.49078 0.60312 0.47926 0.60312 0.47696 0.6 0.47235 0.6 0.46774 0.59844 0.46544 0.59844 0.46083 0.59687 0.45852 0.59687 0.45392 0.59531 0.45161 0.59531 0.44931 0.59375 0.447 0.59375 0.44009 0.59219 0.43779 0.59219 0.42396 0.59062 0.42166 0.59062 0.40092 0.58906 0.39861 0.58906 0.39631 0.5875 0.39401 0.5875 0.38709 0.58594 0.38479 0.58594 0.29723 0.5875 0.29492 0.5875 0.26036 0.58594 0.25806 0.58594 0.25114 0.58437 0.24884 0.58437 0.24654 0.57656 0.23502 0.57344 0.23502 0.57187 0.23271 0.57031 0.23271 0.56875 0.23502 0.56406 0.23502 0.55156 0.25345 0.55156 0.25575 0.55 0.25806 0.55 0.26036 0.54844 0.26267 0.54844 0.26497 0.54531 0.26958 0.54531 0.27188 0.54375 0.27419 0.54375 0.27649 0.54219 0.2788 0.54219 0.2834 0.5375 0.29032 0.5375 0.29262 0.53437 0.29723 0.53125 0.29723 0.52969 0.29953 0.51562 0.29953 0.51406 0.29723 0.50937 0.29723 0.50781 0.29492 0.50625 0.29492 0.49844 0.2834 0.49844 0.2811 0.49687 0.2788 0.49687 0.27649 0.49375 0.27188 0.49375 0.26727 0.49219 0.26497 0.49219 0.26036 0.4875 0.25345 0.4875 0.25114 0.48594 0.24884 0.48594 0.24654 0.47812 0.23502 0.47656 0.23502 0.475 0.23271
0 0.25 0.41705 0.24844 0.41935 0.24687 0.41935 0.24531 0.42166 0.24531 0.42857 0.24375 0.43087 0.24375 0.46774 0.24219 0.47005 0.24219 0.48157 0.24062 0.48387 0.24062 0.48848 0.23906 0.49078 0.23906 0.4977 0.2375 0.5 0.2375 0.50922 0.23594 0.51152 0.23594 0.52074 0.23437 0.52304 0.23437 0.52995 0.23281 0.53226 0.23281 0.54378 0.23125 0.54608 0.23125 0.63825 0.23281 0.64056 0.23281 0.65208 0.23437 0.65438 0.23437 0.65669 0.23594 0.65899 0.23594 0.6636 0.2375 0.6659 0.2375 0.67512 0.23906 0.67742 0.23906 0.68434 0.24062 0.68664 0.24062 0.69125 0.24219 0.69355 0.24219 0.70047 0.24375 0.70277 0.24375 0.70968 0.24531 0.71199 0.24531 0.7166 0.25 0.72351 0.25 0.72581 0.25156 0.72812 0.25625 0.72812 0.25781 0.73042 0.2875 0.73042 0.28906 0.73273 0.29219 0.73273 0.29375 0.73503 0.29687 0.73503 0.29844 0.73733 0.30312 0.73733 0.30469 0.73964 0.30781 0.73964 0.30937 0.74194 0.3125 0.74194 0.31406 0.74425 0.31562 0.74425 0.31875 0.74886 0.32031 0.74886 0.32187 0.75116 0.32344 0.75116 0.325 0.75346 0.3375 0.75346 0.33906 0.75116 0.35312 0.75116 0.35469 0.75346 0.36094 0.75346 0.3625 0.75577 0.37031 0.75577 0.37187 0.75807 0.37812 0.75807 0.37969 0.75577 0.38594 0.75577 0.3875 0.75346 0.40781 0.75346 0.4125 0.74655 0.4125 0.74425 0.41562 0.73964 0.41562 0.7166 0.41406 0.71429 0.41406 0.71199 0.40937 0.70508 0.40937 0.70277 0.40469 0.69586 0.40469 0.68895 0.40312 0.68664 0.40312 0.67742 0.40156 0.67512 0.40156 0.66821 0.4 0.6659 0.4 0.6106 0.39844 0.6083 0.39844 0.59447 0.39687 0.59217 0.39687 0.58756 0.39531 0.58526 0.39531 0.58295 0.39375 0.58065 0.39375 0.57604 0.39219 0.57373 0.39219 0.57143 0.39062 0.56913 0.39062 0.56682 0.38906 0.56452 0.38906 0.56221 0.3875 0.55991 0.3875 0.55761 0.38594 0.5553 0.38594 0.55069 0.38437 0.54839 0.38437 0.54608 0.38125 0.54148 0.38125 0.53917 0.37812 0.53456 0.37812 0.53226 0.36875 0.51843 0.36719 0.51843 0.36406 0.51383 0.3625 0.51383 0.35937 0.50922 0.35781 0.50922 0.35625 0.50691 0.35312 0.50691 0.35 0.5023 0.34844 0.5023 0.34687 0.5 0.34375 0.5 0.34219 0.4977 0.34062 0.4977 0.33125 0.48387 0.33125 0.47926 0.32969 0.47696 0.32969 0.46083 0.33125 0.45852 0.33125 0.447 0.33281 0.4447 0.33281 0.42396 0.33437 0.42166 0.33281 0.41935 0.32969 0.41935 0.32812 0.41705 0.32656 0.41705 0.325 0.41935 0.32187 0.41935 0.31875 0.42396 0.31719 0.42396 0.30937 0.43548 0.30781 0.43548 0.30625 0.43779 0.30469 0.43779 0.30312 0.44009 0.29687 0.44009 0.29531 0.44239 0.29219 0.44239 0.29062 0.44009 0.27656 0.44009 0.27344 0.43548 0.27187 0.43548 0.26719 0.42857 0.26719 0.42627 0.26562 0.42396 0.26406 0.42396 0.2625 0.42166 0.2625 0.41935 0.26094 0.41935 0.25937 0.41705

The first line defines normalized polygon for dog (class_id=1) and the second line defines normalized polygon for cat (class_id=0).

After adding and annotating all images, the dataset is ready. You need to create two datasets and place them in different folders. The final folder structure can look like this:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bhkgcydewkiv5f8em465.png)

Here the training dataset located in the "train" folder and the validation dataset located in the "val" folder.

Finally, you need to create a dataset descriptor YAML-file, that points to created datasets and describes the object classes in them. This is a sample of this file for the data, created above:



train: ../train/images
val: ../val/images

nc: 2
names: ['cat','dog']

In the first two lines, you need to specify paths to the images of the training and the validation datasets. The paths can be either relative to the current folder or absolute. Then, the nc line specifies the number of classes that exist in these datasets and the names is an array of class names in correct order. Indexes of these items are numbers that you used when annotated the images, and these indexes will be returned by the model when detect objects using the predict method. So, if you used "0" for cats, then it should be the first item in the names array.

This YAML file should be passed to the train method of the model to start a training process.

To make this process easier, there are a lot of programs exist to visually annotate images for machine learning. You can ask a search engine something like "software to annotate images for machine learning" to get a list of them. There are also many online tools that can do all this work. One of the great online tools for this is the Roboflow Annotate. Using this service, you just need to upload your images, draw polygons on them, and set class for each polygon. Then, the tool will automatically create annotation files, split your data to train and validation datasets, will create a YAML descriptor file, and then you can export and download the annotated data as a ZIP file.

In the next video, I show how to use the Roboflow to create the "cats and dogs" micro-dataset.

This example has just two images, but for real life problems, that database should be much bigger. To train a good model, you should have hundreds of annotated images.

Also, when prepare images database, try to make it balanced. It should have an equal number of objects of each class, e.g. equal number of dogs and cats. Otherwise, the model trained on it could predict one class better than another.

After the data is ready, copy it to the folder with your Python code.

The training process for image segmentation is exactly the same as for object detection. You can read about this in the appropriate section of the first article of this series.

After you have trained model in the best.pt file, you can use it to predict segmentation masks and polygons of your specific object classes.

Using YOLOv8 segmentation model in production

For all the job, we used the Ultralytics high level APIs, provided with YOLOv8 package by default. These APIs are based on the PyTorch framework, that used to create the bigger part of neural networks today. It's quite convenient on the one hand, but dependence on these high level APIs has a negative effect as well. If you need to run the app created this way in production, you should install all this environment there, including Python, PyTorch and many other dependencies. To run this on a clean new server, you'll need to download and install more than 1 GB of third party libraries!! This is definitely not a way to go. Also, what if you do not have Python in your production environment? What if all your other code written on other programming language, and you do not plan to use Python? Or what if you want to run the model on a mobile phone on Android or iOS?

Export the YOLOv8 segmentation model to ONNX

Run the following code to export the YOLOv8 segmentation model to ONNX:



model = YOLO("yolov8m-seg.pt")
model.export(format="onnx")

This code should create the yolov8m-seg.onnx file, which is an ONNX version of middle-sized YOLOv8 segmentation model. Let's discover how to make predictions using the ONNX API, instead of Ultralytics.

Load the model using ONNX

Now when you have a model, let's use ONNX to work with it. Install the ONNX runtime library for Python by running the following command in your Jupyter notebook:



!pip install onnxruntime

and import it:



import onnxruntime as ort

We set the ort alias to it. The ort module is a root of the ONNX API. The main object of this API is the InferenceSession which used to instantiate a model to run prediction on it. Model instantiation works very similar to what we did before with Ultralytics:



model = ort.InferenceSession("yolov8m-seg.onnx")

Here we loaded the model, but from ".onnx" file instead on ".pt". And now it's ready to run.

Before continue reading, remember the appropriate section of the article, where we used the YOLOv8 ONNX model for object detection, because image segmentation model does object detection as well and many things are similar, and some thorough descriptions will be omitted here. We will go through the same steps as there: prepare the input, run the model and process the output to finally get segmentation masks and bounding polygons for all detected objects.

Prepare the input

Let's see, which inputs this model expects to receive and which outputs it will produce.



inputs = model.get_inputs()
len(inputs)

This code showed that the model expects to get a single input. Now let's display the information about this input:



input = inputs[0]

print("Name: ",input.name)
print("Shape: ",input.shape)
print("Type: ",input.type)



Name:  images
Shape:  [1, 3, 640, 640]
Type:  tensor(float)

As you see, the input is the same as for the object detection model. It's a 3-channel image of 640x640 pixels. Let's prepare it the same way, using the Pillow module to load the image:



from PIL import Image
import numpy as np

img = Image.open("cat_dog.jpg")

# save original image size for future
img_width, img_height = img.size;
# convert image to RGB,
img = img.convert("RGB");
# resize to 640x640
img = img.resize((640,640))

# convert the image to tensor 
# of [1,3,640,640] as required for 
# the model input
input = np.array(img)
input = input.transpose(2,0,1)
input = input.reshape(1,3,640,640).astype('float32')
input = input/255.0

So, we have the input, which can be passed through the model.

Run the model

Let's discover, which outputs the model will produce before running it:



outputs = model.get_outputs()
len(outputs)

Here you can see something different. This model produces two outputs instead of a single one, as it was for object detection. Let's print information about each of these outputs:



for output in outputs:
    print("Name: ", output.name)
    print("Shape: ", output.shape)
    print("Type: ", output.type)



Name:  output0
Shape:  [1, 116, 8400]
Type:  tensor(float)
---
Name:  output1
Shape:  [1, 32, 160, 160]
Type:  tensor(float)
---

As previously said, the segmentation model outputs both object detection bounding boxes and segmentation masks.

output0 - contains detected bounding boxes and object classes, the same as for object detection
output1 - contains segmentation masks for detected objects. There are only raw masks and no polygons.

Let's run the model to receive these outputs:



outputs = model.run(None, {"images":input})
len(outputs)

Now you have these two outputs and it's time to process them.

Process the output

The model returned 2 outputs. Let's define them for convenience and show their shapes:



output0 = outputs[0]
output1 = outputs[1]
print("Output0:",output0.shape,"Output1:",output1.shape)



Output0: (1, 116, 8400) Output1: (1, 32, 160, 160)

The output0 tensor is close to the same, that we got for the object detection in the previous article.



output0 = output0[0].transpose()
output1 = output1[0]
print("Output0:",output0.shape,"Output1:",output1.shape)



Output0: (8400, 116) Output1: (32, 160, 160)

The output0 contains 8400 detected objects (most of them are garbage), however, each of them has 116 parameters instead of 84. This is because it contains additional 32 parameters for segmentation masks. This output includes the same 84 parameters of the object detection model and 32 segmentation masks. That is why, you need to split this input to two parts:



boxes = output0[:,0:84]
masks = output0[:,84:]
print("Boxes:",boxes.shape,"Masks:",masks.shape)



Boxes: (8400, 84) Masks: (8400, 32)

Keep in mind, that if you use a custom trained model, the number of classes could be not 80, but other number. In this case you need to use the (number of classes)+4 to split output to boxes and masks, not 84.

Join bounding boxes and masks

The boxes matrix you can process the same way as for object detection.

The data in the masks matrix is not enough to produce segmentation masks. You need to join it with the second output somehow. Let's print the masks and the output1 shapes close to each other to see how to join them:



print(masks.shape,output1.shape)



(8400, 32) (32, 160, 160)

Do you see that the number of columns of first matrix is the same as the number of rows in second one? It means that you can multiply these matrices to join them together. After joining them, you could receive something like (8400,160,160). These are segmentation masks for all detected boxes. Each segmentation mask has 160x160 size.

However, to make matrix multiplication you need to reshape the output1 to have the same number of dimensions:



output1 = output1.reshape(32,160*160)
print(masks.shape,output1.shape)



(8400, 32) (32, 25600)

Later you will need to reshape 25600 to 160,160.

Now it's clear that you can do matrix multiplication. The @ operator from NumPy library used for this:



masks = masks @ output1
print(masks.shape)



(8400, 25600)

So, now you have 8400 detected boxes and segmentation masks for them:



print(boxes.shape,masks.shape)



(8400, 84) (8400, 25600)

Now we will connect them together. Let's add 25600 columns from the second matrix to the first one:



boxes = np.hstack((boxes,masks))
print(boxes.shape)



(8400, 25684)

The hstack function connects two 2D NumPy arrays horizontally by appending columns from the second array to the right of the first array. Finally, for each detected object, we have the following columns:

0-4 - x_center, y_center, width and height of bounding box
4-84 - Object class probabilities for all 80 classes, that this YOLOv8 model can detect
84-25684 - Pixels of segmentation mask as a single row. Actually, the segmentation mask is a 160x160 matrix, but we just flattened it.

Parse the combined output

Now nothing can stop you from parsing this matrix. For everything except segmentation mask, you can reuse the same code from the appropriate section of previous article. Let's copy-paste required functions and definitions from there:



yolo_classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

def intersection(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    x1 = max(box1_x1,box2_x1)
    y1 = max(box1_y1,box2_y1)
    x2 = min(box1_x2,box2_x2)
    y2 = min(box1_y2,box2_y2)
    return (x2-x1)*(y2-y1)

def union(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)

def iou(box1,box2):
    return intersection(box1,box2)/union(box1,box2)

Now let's traverse the array of detected objects, parse and filter it:



# stub of mask parsing function
def get_mask(row,box):
    mask = row.reshape(160,160)
    return mask

# parse and filter detected objects
objects = []
for row in boxes:
    prob = row[4:84].max()
    if prob < 0.5:
        continue
    xc,yc,w,h = row[:4]
    class_id = row[4:84].argmax()
    x1 = (xc-w/2)/640*img_width
    y1 = (yc-h/2)/640*img_height
    x2 = (xc+w/2)/640*img_width
    y2 = (yc+h/2)/640*img_height
    label = yolo_classes[class_id]
    mask = get_mask(row[84:25684],(x1,y1,x2,y2))
    objects.append([x1,y1,x2,y2,label,prob,mask])

# apply non-maximum suppression to filter duplicated
# boxes
objects.sort(key=lambda x: x[5], reverse=True)
result = []
while len(objects)>0:
    result.append(objects[0])
    objects = [object for object in objects if iou(object,objects[0])<0.7]

print(len(result))

If you read that article, then everything except get_mask should be clear here. Finally, it returns 2 detected objects, as this should be for our image.

Process segmentation masks

On the current stage, the get_mask function just reshapes the mask to 160x160 pixels matrix, but it's not a complete mask parsing code. To write correct code in this function, let's get some mask from the result and explore what is it:



mask = result[0][6]
print(mask)



[[    -1.6534     -3.0664     -3.1947 ...     -7.4953     -6.7599     -5.0995]
 [     -2.258     -4.9611     -5.2562 ...     -9.0785     -8.8663     -7.7189]
 [    -2.6319     -5.4276     -5.5777 ...     -8.7668     -8.7555     -8.0009]
 ...
 [    -3.0467     -4.2886     -4.1702 ...     -5.1899     -5.0957     -4.4152]
 [    -3.0335     -4.4031     -4.4115 ...     -5.3815     -5.2315     -4.4424]
 [    -2.7506     -4.1725     -4.3196 ...      -4.692     -4.4167     -3.5068]]

This code printed the mask of the first detected object. The value for each pixel is a weird number. Actually, each of this number should be probability that this pixel belongs to the object. If the probability is low, then this pixel is a background and this pixel should be black, otherwise it should be white. But this is a raw output from neural network and these values should be converted to probabilities. We will use the sigmoid function for this:

This function can return only values in a range from 0 to 1 for any input argument z. This is how you can define it on Python:



def sigmoid(z):
    return 1 / (1 + np.exp(-z))

and apply it to the mask



mask = sigmoid(mask)
print(mask)



[[    0.16065    0.044514    0.039364 ...  0.00055536    0.001158   0.0060625]
 [    0.09466   0.0069568   0.0051882 ...  0.00011408  0.00014104  0.00044417]
 [   0.067116   0.0043745   0.0037671 ...   0.0001558  0.00015757  0.00033503]
 ...
 [   0.045361    0.013539    0.015214 ...   0.0055419   0.0060856    0.011948]
 [   0.045937    0.012091    0.011991 ...   0.0045797   0.0053169    0.011631]
 [   0.060052     0.01518     0.01313 ...   0.0090851     0.01193     0.02912]]

Now you have probabilities for each pixel, and it's time to convert probabilities to real colors. Let's say that if probability less or equal to 0.5, then it will be black color (color 0) and if probability is greater, then it will be white color (color 255).



mask = (mask > 0.5).astype('uint8')*255

(mask > 0.5) creates a new boolean matrix of the same size in which all items that greater than 0.5 will be True and all elements below 0.5 will be False
.astype('uint8') converts these values to integers: False to 0 and True to 1
*255 used to set 255 for all ones to make these pixels white.

Now, this is a real 160x160 image array, in which all pixels that belong to object are white and all pixels that belong to background are black. You can create the Pillow image from this array and display it:



img_mask = Image.fromarray(mask,"L")
img_mask

However, it displays the segmentation mask of the whole image, but we need the mask of only the first detected object. Let's crop it.

Get the coordinates of the object's bounding box first:



x1,y1,x2,y2 = result[0][:4]

Then, keep in mind that the size of mask is 160x160, but the coordinates of the bounding box calculated for real image size, so you need to scale them to the coordinates of mask:



mask_x1 = round(x1/img_width*160)
mask_y1 = round(y1/img_height*160)
mask_x2 = round(x2/img_width*160)
mask_y2 = round(y2/img_height*160)

then crop the mask using them, convert to image again and display it:



mask = mask[mask_y1:mask_y2,mask_x1:mask_x2]

img_mask = Image.fromarray(mask,"L")
img_mask

The last step is to scale this mask to the size of the bounding box of this object.



img_mask = img_mask.resize((round(x2-x1),round(y2-y1)))
img_mask

and convert this final image back to the mask array:



mask = np.array(img_mask)

Now, you can copy all the mask processing code to the get_mask function:



def get_mask(row,box):    
    mask = row.reshape(160,160)
    mask = sigmoid(mask)
    mask = (mask > 0.5).astype('uint8')*255
    x1,y1,x2,y2 = box
    mask_x1 = round(x1/img_width*160)
    mask_y1 = round(y1/img_height*160)
    mask_x2 = round(x2/img_width*160)
    mask_y2 = round(y2/img_height*160)
    mask = mask[mask_y1:mask_y2,mask_x1:mask_x2]
    img_mask = Image.fromarray(mask,"L")
    img_mask = img_mask.resize((round(x2-x1),round(y2-y1)))
    mask = np.array(img_mask)
    return mask

and restart the YOLOv8 model output parsing process to get correct masks for all detected objects.

It's almost finished, but as you remember, Ultralytics API calculates polygons for these masks and we are going to do the same.

Calculate bounding polygons

The segmentation mask is a binary image. There are different algorithms that can calculate contours for binary image. If there are just two colors, then it's not difficult to do.

One of these algorithms is "Topological Structural Analysis of Digitized Binary Images by Border Following" which is widely used to calculate vector polygons for images. Here is a paper that describes it. But do not worry, we will not implement this from scratch. It's already implemented for most programming languages. In particular, the Python implementation you can find in OpenCV library, in the findContours function. Ensure that the OpenCV installed:



!pip install opencv-python

Then let's define a function, that will get the bounding polygon from the mask, using the findContours function:



import cv2
def get_polygon(mask):
    contours = cv2.findContours(mask,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)
    polygon = [[contour[0][0],contour[0][1]] for contour in contours[0][0]]
    return polygon

In the first line we use the findContours function to get all contours for the mask and in the second line we convert the output of this function (which is a little bit ugly) to an array of [x,y] coordinates.

Now you can add the polygon to the YOLOv8 output parsing loop. Here is a complete parsing code:



def get_mask(row,box):    
    mask = row.reshape(160,160)
    mask = sigmoid(mask)
    mask = (mask > 0.5).astype('uint8')*255
    x1,y1,x2,y2 = box
    mask_x1 = round(x1/img_width*160)
    mask_y1 = round(y1/img_height*160)
    mask_x2 = round(x2/img_width*160)
    mask_y2 = round(y2/img_height*160)
    mask = mask[mask_y1:mask_y2,mask_x1:mask_x2]
    img_mask = Image.fromarray(mask,"L")
    img_mask = img_mask.resize((round(x2-x1),round(y2-y1)))
    mask = np.array(img_mask)
    return mask

def get_polygon(mask):
    contours = cv2.findContours(mask,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)
    polygon = [[contour[0][0],contour[0][1]] for contour in contours[0][0]]
    return polygon

objects = []
for row in boxes:
    prob = row[4:84].max()
    if prob < 0.5:
        continue
    xc,yc,w,h = row[:4]
    class_id = row[4:84].argmax()
    x1 = (xc-w/2)/640*img_width
    y1 = (yc-h/2)/640*img_height
    x2 = (xc+w/2)/640*img_width
    y2 = (yc+h/2)/640*img_height
    label = yolo_classes[class_id]
    mask = get_mask(row[84:25684],(x1,y1,x2,y2))
    polygon = get_polygon(mask)
    objects.append([x1,y1,x2,y2,label,prob,mask,polygon])

objects.sort(key=lambda x: x[5], reverse=True)
result = []
while len(objects)>0:
    result.append(objects[0])
    objects = [object for object in objects if iou(object,objects[0])<0.7]

Draw bounding polygons on the image

Finally, to demonstrate the results of the whole process, we will draw bounding polygons of detected objects on the image.



img = Image.open("cat_dog.jpg")
draw = ImageDraw.Draw(img, "RGBA")

for object in result:
    [x1,y1,x2,y2,label,prob,mask,polygon] = object
    polygon = [(int(x1+point[0]),int(y1+point[1])) for point in polygon]
    draw.polygon(polygon,fill=(0,255,0,125))
img

We loaded the image and the ImageDraw object for it.
Then we loop through detected objects
The polygons calculated assuming that top left corner is (0,0), but we need to draw it starting from top left corner of the object. That is why we have transformed the polygon of each object by adding the coordinates of object's left corner (x1,y1) to each point
Then we have drawn a filled semi-transparent polygon on each object.

If everything went fine, the final image should look like this:

For convenience, look this video. It is a whole coding session of this chapter:

Create a segmentation web application

It's interesting to research in Jupyter Notebook, but now it's time for real practice. We are going to integrate the code, written in Jupyter Notebook above, to the object detection web application, developed in the previous article. This application will detect not only bounding boxes of objects, but their contours and will draw them on top of the image.

The application will look and work as showed in the next video.

Most of the code will be reused from the previous project, that you can find in GitHub.

It's time to stop working in Jupyter Notebook and use some IDE to work with Python web applications.

Here I show how to modify only web application, created on Python. However, as a homework, using the same idea, you can integrate instance segmentation to the projects, created on other languages mentioned in that article, like Julia, Node.js, JavaScript, Go and Rust, and on any other programming language that support ONNX runtime.

Create a backend

I assume that you will reuse the backend from the object_detector.py file.

Need to implement the following changes in it:

Import OpenCV:



import cv2

Then, copy the YOLOv8 segmentation module that we used: yolov8m-seg.onnx to the project folder and change the run_model function to load it and return 2 outputs:



def run_model(input):
    model = ort.InferenceSession("yolov8m-seg.onnx")
    outputs = model.run(None, {"images":input})
    return outputs

This function runs the model and returns both outputs from it.

Then, change the process_output function to receive these outputs and parse them:



def process_output(outputs, img_width, img_height):
    output0 = outputs[0].astype("float")
    output1 = outputs[1].astype("float")
    output0 = output0[0].transpose()
    output1 = output1[0]
    boxes = output0[:, 0:84]
    masks = output0[:, 84:]
    output1 = output1.reshape(32, 160 * 160)
    masks = masks @ output1
    boxes = np.hstack((boxes, masks))

    objects = []
    for row in boxes:
        prob = row[4:84].max()
        if prob < 0.5:
            continue
        class_id = row[4:84].argmax()
        label = yolo_classes[class_id]
        xc, yc, w, h = row[:4]
        x1 = (xc - w/2) / 640 * img_width
        y1 = (yc - h/2) / 640 * img_height
        x2 = (xc + w/2) / 640 * img_width
        y2 = (yc + h/2) / 640 * img_height
        mask = get_mask(row[84:25684], (x1, y1, x2, y2), img_width, img_height)
        polygon = get_polygon(mask)
        objects.append([x1, y1, x2, y2, label, prob, polygon])

    objects.sort(key=lambda x: x[5], reverse=True)
    result = []
    while len(objects) > 0:
        result.append(objects[0])
        objects = [object for object in objects if iou(object, objects[0]) < 0.5]
    return result

This code copy/pasted from previous section with small changes: the get_mask function also accepts img_width, and img_height arguments.

Then copy/paste other helper functions to parse masks and polygons:



def get_mask(row,box, img_width, img_height):
    mask = row.reshape(160,160)
    mask = sigmoid(mask)
    mask = (mask > 0.5).astype('uint8')*255
    x1,y1,x2,y2 = box
    mask_x1 = round(x1/img_width*160)
    mask_y1 = round(y1/img_height*160)
    mask_x2 = round(x2/img_width*160)
    mask_y2 = round(y2/img_height*160)
    mask = mask[mask_y1:mask_y2,mask_x1:mask_x2]
    img_mask = Image.fromarray(mask,"L")
    img_mask = img_mask.resize((round(x2-x1),round(y2-y1)))
    mask = np.array(img_mask)
    return mask


def get_polygon(mask):
    contours = cv2.findContours(mask,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)
    polygon = [[int(contour[0][0]), int(contour[0][1])] for contour in contours[0][0]]
    return polygon


def sigmoid(z):
    return 1 / (1 + np.exp(-z))

That's all that should be changed in the backend. The "/detect" endpoint returns an array of detected objects. Each object includes bounding polygon as a last item.

Now, let's modify frontend to draw this polygon on top of the image.

Create a frontend

In the frontend you only need to change the draw_image_and_boxes function in the index.html file to draw the polygons, received from backend on top of the image:



      function draw_image_and_boxes(file,boxes) {
          const img = new Image()
          img.src = URL.createObjectURL(file);
          img.onload = () => {
              const canvas = document.querySelector("canvas");
              canvas.width = img.width;
              canvas.height = img.height;
              const ctx = canvas.getContext("2d");
              ctx.drawImage(img,0,0);
              ctx.strokeStyle = "#00FF00";
              ctx.lineWidth = 3;
              ctx.font = "18px serif";
              boxes.forEach(([x1,y1,x2,y2,label,_,polygon]) => {
                  ctx.fillStyle = "rgba(0,255,0,0.5)";
                  ctx.beginPath();
                  polygon.forEach(([x,y]) => {
                    ctx.lineTo(x+x1,y+y1);
                  });
                  ctx.closePath();
                  ctx.fill();
                  ctx.strokeRect(x1,y1,x2-x1,y2-y1);
                  ctx.fillStyle = "#00ff00";
                  const width = ctx.measureText(label).width;
                  ctx.fillRect(x1,y1,width+10,25);
                  ctx.fillStyle = "#000000";
                  ctx.fillText(label, x1, y1+18);
              });
          }
      }

Notice that in a loop, we use the polygon variable to draw a filled path on top of the image before rectangle and label. We use semi-transparent green color to fill this path (rgba(0,255,0,0.5)).

That's all! Now you can run the app by the following command:



python object_detector.py

Open http://localhost:8080 in a web browser, then use the interface to upload an image. If all code written correctly, you'll see not only bounding boxes, but segmentation masks of detected objects.

Conclusion

In this article, I have explained the instance segmentation machine learning task and showed how to implement it, using the YOLOv8 neural network model. We covered both high level Ultralytics API and low level ONNX API. Finally, we created a web application to detect bounding polygons of detected objects on images.

You can find the source code of this web application in this repository:

https://github.com/AndreyGermanov/yolov8_segmentation_python

In addition, this repository contains Jupyter Notebooks with instance segmentation code, both for Ulralytics and for ONNX.

If you compare the produced masks from Ultralytics API and from ONNX API, you'll probably find that ONNX example returns masks with lower quality. This is because we used only basic input image processing, before passing it to the neural network. The Ultralytics .predict function implements more image preprocessing and postprocessing steps. As an additional practice, you can open and learn the source code of the "predict" function to understand, which additional filters and transformations it applies. Then you can try to implement them on your own for ONNX inference and see the difference.

If you know the data preprocessing and postprocessing algorithm, described in this article, you can do YOLOv8 segmentation not only on Python, but on any other language, that supports ONNX. For example, this is a web application on Rust, that implements both object detection and instance segmentation on Rust: https://github.com/AndreyGermanov/yolov8_onnx_rust_segmentation

Obviously, instance segmentation can be used for more practical tasks than this demo app. One of the most well-known use cases is removing background around a person on web camera video, like it implemented in ChromaCam application.

I created a small Flask web application that demonstrates how it works by using YOLOv8 image segmentation. Get source code here: https://github.com/AndreyGermanov/yolov8_chromacam

Thank you and until next time!

Follow me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Have a fun coding and never stop learning!

How to detect objects in videos in a web browser using YOLOv8 neural network and JavaScript

Andrey Germanov — Wed, 31 May 2023 15:40:42 +0000

Introduction
Adding a video component to a web page
Capture video frames for object detection
Detect objects in video
    Prepare the input
    Run the model
    Process the output
    Draw bounding boxes
Running several tasks in parallel in JavaScript
Running the model in a background thread
Conclusion

Introduction

This is a third part of the YOLOv8 series. In previous parts, I guided you through all YOLOv8 essentials, including data preparation, neural network training and running object detection on images. Finally, we created a web service that detects objects on images using different programming languages.

Now it's time to move one step forward. If you know how to detect objects in images, then nothing stops you from detecting objects in videos, because the video is an array of images with background sound. You only need to know how to capture each frame as an image and then pass it through the object detection neural network using the same code, that we wrote in the previous article. This is what I am going to show in this tutorial.

In the next sections, we will create a web application that will detect objects in a video, loaded to a web browser. It will display the bounding boxes of detected objects in real time. The final app will look and work as shown in the next video.

Ensure that you read and tried all previous articles of this series, especially How to detect objects on images using JavaScript section because I will reuse algorithms and source code of the project, developed there.

After refresh your knowledge on how to use the YOLOv8 neural network to detect objects on images in a web browser, you will be ready to continue reading the sections below.

Adding a video component to a web page

Let's start a project now. Create a new folder and add the index.html file to it with the following content:

index.html



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Object detector</title>    
</head>
<body>
<video src="sample.mp4" controls></video>
</body>
</html>

We use a <video> element to display the video on a web page. This element can display video from various sources, including files, web cameras or remote media streams that come from WebRTC. In this article we will use video from a file, but the object detection code will work for any other video source, supported by the <video> component. I used the sample.mp4 which is a nice recording with two cats. You can download it from here, or use any other MP4 video for testing. Place the video file in the same folder with the index.html.

The video element has many attributes. We used the src to specify the source video file and the controls attribute to display the control bar with play and other buttons. The full list of the video tag options, you can find here.

When you open this web page, you'll see the following:

You can see that it displays the video and a bottom panel, that can be used to control the video: play/pause, change audio volume, display in a full screen mode and so on.

Also, you can manage this component from JavaScript code. To get access to the video element from your code, you need to get a link to the video object:



const video = document.querySelector("video");

Then you can use the video object to programmatically control the video. This variable is an instance of the HTMLVideoElement object that implements the HTMLMediaElement interface. This object contains a set of properties and methods to control the video element. Also, it provides access to the video lifecycle events. You can bind event handlers to react to many different events, in particular:

loadeddata - fired when the video loaded and displayed the first frame
play - fired when video starts playing
pause -fired when video paused

You can use these events to capture video frames. Before capture the frames, you need to know the dimensions of the video: the width and height. Let's get this right after video is loaded.

Create a JavaScript file with object_detector.js name and include it to the index.html:

index.html



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Object detector</title>    
</head>
<body>
<video src="sample.mp4" controls></video>
<script src="object_detector.js" defer></script>
</body>
</html>

and add the following to the new file:

object_detector.js



const video = document.querySelector("video");

video.addEventListener("loadeddata", () => {
    console.log(video.videoWidth, video.videoHeight);
})

In this code snippet, you set up the event listener for loadeddata event of the video. As soon, as video file loaded to the video element, the dimensions of video become available, and you print the videoWidth and videoHeight to the console.

If you used the sample.mp4 video file, then you should see the following size on the console.



960 540

If it works, then everything ready to capture the video frames.

Capture video frames for object detection

As you should read in the previous article, to detect objects on an image, you need to convert the image to the array of normalized pixel colors. To do that, we drew the image on HTML5 canvas, using the drawImage method and then, we used the getImageData method of HTML5 Canvas context to get access to the pixels and their color components.

The great thing about the drawImage method, is that you can use it to draw video on the canvas the same way as you used it to draw the image.

Let's see how it works. Add the <canvas> element to the index.html page:

index.html



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Object detector</title>
</head>
<body>
<video src="sample.mp4" controls></video>
<br/>
<canvas></canvas>
<script src="object_detector.js" defer></script>
</body>
</html>

The video component starts playing the video when user presses the "Play" button, or if the developer calls the play() method on the video object. That is why, to start capturing the video, you need to implement the play event listener. Replace the content of object_detector.js file to the following:

object_detector.js



const video = document.querySelector("video");

video.addEventListener("play", () => {
    const canvas = document.querySelector("canvas");
    canvas.width = video.videoWidth;
    canvas.height = video.videoHeight;
    const context = canvas.getContext("2d");
    context.drawImage(video,0,0);
});

In this code, when the video starts playing:

The "play" event listener triggered.
In the event handling function, we set up the canvas element with actual width and height of video
Next code obtains the access to the 2d HTML5 canvas drawing context
Then, using the drawImage method, we draw the video on the canvas.

Open the index.html page in a web browser and press "Play" button. After this, you should see the following:

Here you see the video on the top and the canvas with captured frame below it. The canvas shows only the first frame, because you captured the frame only once, when the video started. To capture each frame, you need to call the drawImage all the time while the video is playing. You can use the setInterval function to call specified code repeatedly. Let's draw the current frame of video every 30 milliseconds:

object_detector.js



let interval;
video.addEventListener("play", () => {
    const canvas = document.querySelector("canvas");
    const context = canvas.getContext("2d");
    const interval = setInterval(() => {
        context.drawImage(video,0,0);
    },30)
});

In this code we draw the current frame of video while it is playing. But we should stop this process if the video stops playing, because there is no sense to redraw the canvas all the time if video paused or ended. To do this, I saved the created interval to the interval variable, that can be used later in the clearInterval function.

To intercept a moment when video stopped playing, you need to handle the pause event. Add the following to your code to stop capturing frames when video stopped playing:

object_detector.js



video.addEventListener("pause", () => {
    clearInterval(interval);
});

After this is done, you can reload your page. If everything done correctly, when you press the "play" button, you'll see that both the video and the canvas synchronized.

The code in the setInterval function will capture and draw each frame on the canvas until the video is playing. If you press the "Pause" button, or video ended, the pause event handler will clear the interval and stop the frame capturing loop.

We do not need to display the same video two times on a web page, so we will customize our player. Let's hide the original video player and leave only the canvas.

index.html



<video controls style="display:none" src="sample.mp4"></video>

However, if we hide the video player, then we do not have access to the "Play" and "Pause" buttons. Fortunately, it's not a big problem because you can control the video object programmatically. It has the play and pause methods to control the playback. We will add our own "Play" and "Pause" buttons below the canvas, and this is how the new UI will look:

index.html



<video controls style="display:none" src="sample.mp4"></video><br/>
<canvas></canvas><br/>
<button id="play">Play</button>&nbsp;
<button id="pause">Pause</button>

Now add the onclick event handlers for created buttons to the object_detector.js:

object_detector.js



const playBtn = document.getElementById("play");
const pauseBtn = document.getElementById("pause");
playBtn.addEventListener("click", () => {
    video.play();
});
pauseBtn.addEventListener("click", () => {
    video.pause();
});

Refresh the page after those changes to see the result:

You should be able to start playback by pressing the "Play" button and stop it by pressing the "Pause" button.

Here is a full JavaScript code of the current stage:

object_detector.js



const video = document.querySelector("video");
let interval
video.addEventListener("play", () => {
    const canvas = document.querySelector("canvas");
    const context = canvas.getContext("2d");
    interval = setInterval(() => {
        context.drawImage(video,0,0);

    },30)
});

video.addEventListener("pause", () => {
    clearInterval(interval);
});

const playBtn = document.getElementById("play");
const pauseBtn = document.getElementById("pause");
playBtn.addEventListener("click", () => {
    video.play();
});
pauseBtn.addEventListener("click", () => {
    video.pause();
});

Now you have custom video player and full control over each frame of the video. You can, for example, draw whatever you want on the canvas on top of any video frame using the HTML5 Canvas context API. In the sections below, we will pass each frame to YOLOv8 neural network to detect all objects on it and draw bounding boxes around them. We will use the same code, that we wrote in the previous article, when develop JavaScript object detection web service to prepare the input, run the model, process the output and draw bounding boxes around detected objects.

Detect objects in video

To detect objects in video, you need to detect objects on each frame of the video. You already converted each frame to the image and displayed it on the HTML5 canvas. Everything is ready to reuse the code, which we wrote in the previous article, to detect objects on image. For each video frame you need:

Prepare the input from the image on the canvas
Run the model with this input
Process the output
Display bounded boxes of detected objects on top of each frame

Prepare the input

Let's create a prepare_input function that will be used to prepare the input for the neural network model. This function will receive the canvas with displayed frame as an image and will do the following with it:

create a temporary canvas and resize it to 640x640, which is required for YOLOv8 model
copy the source image (canvas) to this temporary canvas
get array of pixel color components using the getImageData method of HTML5 canvas context
collect Red, Green and Blue color components of each pixel to separate arrays
Concatenate these arrays to a single one in which reds go first, greens go next and blues go last.
Return this array

Let's implement this function:

object_detector.js



function prepare_input(img) {  
    const canvas = document.createElement("canvas");
    canvas.width = 640;
    canvas.height = 640;
    const context = canvas.getContext("2d");
    context.drawImage(img, 0, 0, 640, 640);

    const data = context.getImageData(0,0,640,640).data;
    const red = [], green = [], blue = [];
    for (let index=0;index<data.length;index+=4) {
        red.push(data[index]/255);
        green.push(data[index+1]/255);
        blue.push(data[index+2]/255);
    }
    return [...red, ...green, ...blue];
}

In the first part of the function, we created an invisible canvas of 640x640 size and displayed the input image on it with resizing to 640x640.

Then, we got access to the canvas pixels data, collected color components to green, red and blue arrays, joined them together and returned. This process displayed on the next image.

Also, we normalized each color component value, dividing it by 255.

This function is very similar to the prepare_input function, created in the previous article. The only difference that we do not need to create an HTML element for the image here, because the image already exists on the input canvas.

When this function is ready, you can pass each frame to it and receive the array to use as an input for the YOLOv8 model. Add the call to this function to the setInterval loop:

object_detector.js



interval = setInterval(() => {
    context.drawImage(video,0,0);
    const input = prepare_input(canvas);
},30)

Here, right after drawing the frame on the canvas, you passed this canvas with image on it to the prepare_input function, that returns an array of red, green and blue color components of all pixels of this frame. This array will be used as an input for the YOLOv8 neural network model.

Run the model

When the input is ready, it's time to pass it to the neural network. We will not create a backend for this, everything will work in frontend. We will use the JavaScript version of ONNX runtime library to run the model predictions right in a browser. Include the ONNX runtime Javascript library to the index.html file to load it.

index.html



<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>

Then, you need to get the YOLOv8 model and convert it to ONNX format. Do this as explained in this section of the previous article. Copy the exported .onnx file to the same folder with the index.html.

Then, let's write a function run_model that will instantiate a model using the .oonx file, then will pass the input, prepared in the above section to the model and will return the raw predictions:

object_detector.js



async function run_model(input) {
    const model = await ort.InferenceSession.create("yolov8n.onnx");
    input = new ort.Tensor(Float32Array.from(input),[1, 3, 640, 640]);
    const outputs = await model.run({images:input});
    return outputs["output0"].data;
}

This code just copy/pasted from the appropriate section of the previous article. Read more to refresh your knowledge about how it works.

Here I used the yolov8n.onnx model, which is a tiny version of pretrained YOLOv8 model on COCO dataset. You can use any other pretrained or custom model here.

Finally, call this function in your setInterval loop to detect objects on each frame:

object_detector.js



interval = setInterval(async() => {
    context.drawImage(video,0,0);
    const input = prepare_input(canvas);
    const output = await run_model(input)
},30)

Notice that I added the async keyword for a function inside setInterval and await keyword when call the run_model, because this is an async function that requires some time to finish execution.

To make it working, you need to run the index.html in some HTTP server, for example in the embedded web server of VS Code, because the run_model function requires downloading the yolov8n.onnx file to the browser using HTTP.

Now, it's time to convert the raw YOLOv8 model output to bounding boxes of detected objects.

Process the output

You can just copy the process_output function from the appropriate section of the previous article.

object_detector.js



function process_output(output, img_width, img_height) {
    let boxes = [];
    for (let index=0;index<8400;index++) {
        const [class_id,prob] = [...Array(80).keys()]
            .map(col => [col, output[8400*(col+4)+index]])
            .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
        if (prob < 0.5) {
            continue;
        }
        const label = yolo_classes[class_id];
        const xc = output[index];
        const yc = output[8400+index];
        const w = output[2*8400+index];
        const h = output[3*8400+index];
        const x1 = (xc-w/2)/640*img_width;
        const y1 = (yc-h/2)/640*img_height;
        const x2 = (xc+w/2)/640*img_width;
        const y2 = (yc+h/2)/640*img_height;
        boxes.push([x1,y1,x2,y2,label,prob]);
    }

    boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
    const result = [];
    while (boxes.length>0) {
        result.push(boxes[0]);
        boxes = boxes.filter(box => iou(boxes[0],box)<0.7);
    }
    return result;
}

This code written for YOLOv8 model pretrained on COCO dataset with 80 object classes. If you use custom model with different number of classes, then you should replace "80" in the [...Array(80).keys()] line to the number of classes that your model detects.

Also, copy helper functions used to implement "Intersection over union" algorithm and array of COCO object class labels:

object_detector.js



function iou(box1,box2) {
    return intersection(box1,box2)/union(box1,box2);
}

function union(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    const box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)
}

function intersection(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const x1 = Math.max(box1_x1,box2_x1);
    const y1 = Math.max(box1_y1,box2_y1);
    const x2 = Math.min(box1_x2,box2_x2);
    const y2 = Math.min(box1_y2,box2_y2);
    return (x2-x1)*(y2-y1)
}

const yolo_classes = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat',
    'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase',
    'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
    'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
    'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
];

Here I used the labels array for pretrained model on COCO dataset. If you use a different model, the labels obviously should be different.

Finally, call this function for each video frame in the setInterval loop:

object_detector.js



interval = setInterval(async() => {
        context.drawImage(video,0,0);
        const input = prepare_input(canvas);
        const output = await run_model(input);
        const boxes = process_output(output, canvas.width, canvas.height);
    },30)

The process_output function receives the raw model output and dimensions of the canvas to scale bounding boxes to the original image size. (Remember, that the model works with 640x640 images).

Finally, the boxes array contains the bounding boxes for each detected object in a format: [x1,y1,x2,y2,label,prob].

All that left to do is to draw these boxes on top of the image on the canvas.

Draw bounding boxes

Now you need to write a function that uses the HTML5 canvas context API to draw rectangles for each bounding box with object class labels. You can reuse the draw_image_and_boxes function that we wrote in each project in the previous article. This is how the original function looks:



function draw_image_and_boxes(file,boxes) {
    const img = new Image()
    img.src = URL.createObjectURL(file);
    img.onload = () => {
        const canvas = document.querySelector("canvas");
        canvas.width = img.width;
        canvas.height = img.height;
        const ctx = canvas.getContext("2d");
        ctx.drawImage(img,0,0);
        ctx.strokeStyle = "#00FF00";
        ctx.lineWidth = 3;
        ctx.font = "18px serif";
        boxes.forEach(([x1,y1,x2,y2,label]) => {
            ctx.strokeRect(x1,y1,x2-x1,y2-y1);
            ctx.fillStyle = "#00ff00";
            const width = ctx.measureText(label).width;
            ctx.fillRect(x1,y1,width+10,25);
            ctx.fillStyle = "#000000";
            ctx.fillText(label, x1, y1+18);
        });
    }
}

However, you can simplify it, because in this case you do not need to load the image from file and then display it on the canvas, because you already have the image, displayed on the canvas. You just need to pass the canvas to this function and draw boxes on it. Also, rename the function to draw_boxes, because the image already drawn on the input canvas. This is how you can modify it:

object_detector.js



function draw_boxes(canvas,boxes) {
    const ctx = canvas.getContext("2d");
    ctx.strokeStyle = "#00FF00";
    ctx.lineWidth = 3;
    ctx.font = "18px serif";
    boxes.forEach(([x1,y1,x2,y2,label]) => {
        ctx.strokeRect(x1,y1,x2-x1,y2-y1);
        ctx.fillStyle = "#00ff00";
        const width = ctx.measureText(label).width;
        ctx.fillRect(x1,y1,width+10,25);
        ctx.fillStyle = "#000000";
        ctx.fillText(label, x1, y1+18);
    });
}

The function receives the canvas with current frame and the boxes array of detected objects on it.
The function setup fill, stroke and font style.
Then it traverses the boxes array. It draws the green bounding rectangle around each detected object and the class label. To display the class label, it uses black text and green background.

Now you can call this function for each frame in the setInterval loop this way:

object_detector.js



interval = setInterval(async() => {
     context.drawImage(video,0,0);
     const input = prepare_input(canvas);
     const output = await run_model(input);
     const boxes = process_output(output, canvas.width, canvas.height);
     draw_boxes(canvas,boxes)
 },30)

However, the code written this way will not work correctly. The draw_boxes is the last line in the cycle, so, right after this line, the next iteration will start and will overwrite the displayed boxes by context.drawImage(video, 0,0, canvas.width, canvas.height) line. So, you will never see the displayed boxes. You need to drawImage first, and draw_boxes next, but current code will do this in opposite order. We will use the following trick to fix it:

object_detector.js



let interval
let boxes = [];
video.addEventListener("play", async() => {
    const canvas = document.querySelector("canvas");
    canvas.width = video.videoWidth;
    canvas.height = video.videoHeight;
    const context = canvas.getContext("2d");
    interval = setInterval(async() => {
        context.drawImage(video,0,0);
        draw_boxes(canvas, boxes);
        const input = prepare_input(canvas);
        const output = await run_model(input);
        boxes = process_output(output, canvas.width, canvas.height);
    },30)
});

In this code snippet, I declared the boxes as a global variable, before the "play" event handler. It is an empty array by default. This way, you can run the draw_boxes function right after drawing the video frame on the canvas with drawImage function. On the first iteration it will draw nothing on top of the image, but then, it will run the model and overwrite the boxes array with detected objects. Then it will draw the bounding boxes of detected objects in the beginning of the next iteration. Assuming that you do iteration each 30 milliseconds, the difference between previous and current frames won't be significant.

Finally, if everything implemented correctly, you will see the video with bounding boxes around detected objects.

Perhaps when you run this, you'll experience annoying delays in video. The machine learning model inference in the run_model function is a CPU intensive operation, that can require more time, than 30 milliseconds. That is why it interrupts the video. The delay duration depends on your CPU power. Fortunately, there is a way to fix it, that we will cover below.

Running several tasks in parallel in JavaScript

The JavaScript is single threaded by default. It has a main thread, or, sometimes, it's called the UI thread. All your code runs in it. However, it's not a good practice to interrupt UI by CPU intensive tasks, like machine learning model execution. You should move CPU intensive tasks to separate threads to not block the user interface.

A common way to create threads in JavaScript is using the WebWorkers API. Using this API, you can create a Worker object and pass the JavaScript file to it, like in this code:



const worker = new Worker("worker.js");

The worker object will run the worker.js file in a separate thread. All code inside this file will run in parallel with the user interface.

The worker threads spawned this way do not have any access to web page elements or to any code, defined in it. The same for the main thread, it does not have any access to the content of the worker thread. To communicate between threads, the WebWorkers API uses messages. You can send message with data to a thread, and listen for messages from it.

The worker thread can do the same: it can send messages to the main thread and listen for messages from the main thread. The communication defined this way is asynchronous.

For example, to send a message to the worker thread, that you created before, you should run:



worker.postMessage(data)

The data argument is any JavaScript object.

To listen messages from the worker thread, you need to define the onmessage event handler:



worker.onmessage = (event) => {
    console.log(event.data);
};

When the message comes from the worker, it triggers the function and passes the incoming message inside the event argument. The event.data property contains the data, that the worker thread sent using the postMessage function.

You can read more theory about WebWorkers in the documentation, and then we will move to practice.

Let's solve the problem with video delays. The most time and resource consuming function that we have is the run_model. So, we will move it to a new worker thread. Then, we will send the input to this thread and will receive the output from it. It will work in a background for each frame while the video is playing.

Running the model in a background thread

Let's create a worker.js file and move the code, that required to run the model, to this file:

worker.js



importScripts("https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js");

async function run_model(input) {
    const model = await ort.InferenceSession.create("./yolov8n.onnx");
    input = new ort.Tensor(Float32Array.from(input),[1, 3, 640, 640]);
    const outputs = await model.run({images:input});
    return outputs["output0"].data;
}

The first line imports the ONNX runtime JavaScript API library, because, as said before, the worker thread does not have access to a web page and to anything that imported in it. The importScripts function used to import external scripts to the worker thread.

The JavaScript ONNX API library imported here contains only high level JavaScript functions, but not the ONNX runtime library itself. The ONNX runtime library for JavaScript is a WebAssembly compilation of the original ONNX runtime, that written in C. When you import that ort.min.js file and open a web page with it, it checks if the real ONNX library exists in the project folder, and if not, it automatically downloads the ort-wasm-simd.wasm file to your web browser. I experienced a problem with it. If run this from web worker, it does not download this file. I think, the best quick fix for this is to manually download the ort-wasm-simd.wasm file from repository and put it to the project folder.

Following this, I copy/pasted the run_model function from object_detector.js.

Now we need to send input to this script from the main UI thread where all other code works. To do this, we need to create a new worker in the object_detector.js. You can do this in the beginning:

object_detector.js



const worker = new Worker("worker.js")

Then, instead of the run_model call, you need to post a message with the input to this worker.

object_detector.js



interval = setInterval(() => {
        context.drawImage(video,0,0, canvas.width, canvas.height);
        draw_boxes(canvas, boxes);
        const input = prepare_input(canvas);
        worker.postMessage(input);
//        boxes = process_output(output, canvas.width, canvas.height);
    },30)

Here I have sent the input to the worker using the postMessage function and commented all code after it, because we should run it only after the worker processes the input and returns the output. You can just remove this line, because it will be used later in other function, where we will process the messages from the worker thread.

Let's return to the worker now. It should receive the input, that you sent. To receive messages, you need to define the onmessage handler. Let's add it to the worker.js:

worker.js



onmessage = async(event) => {
    const input = event.data;
    const output = await run_model(input);
    postMessage(output);
}

This is how the event handler for messages from the main thread should be implemented in the worker thread. The handler is defined as an async function. When the message comes, it extracts the data from the message to the input variable. Then it calls the run_model with this input. Finally, it sends the output from the model to the main thread as a message, using the postMessage function.

worker.js



onmessage = async(event) => {
    const input = event.data;
    const output = await run_model(input);
    postMessage(output);
}

When the model returned the output and sent it to the main thread as a message, the main thread should receive this message and process the output from the model. To do this, you need to define the onmessage handler for the worker thread in the object_detector.js.

object_detector.js



worker.onmessage = (event) => {
    const output = event.data;
    const canvas = document.querySelector("canvas");
    boxes =  process_output(output, canvas.width, canvas.height);
};

Here, when output from the model comes from the worker thread, you process it using the process_output function and save it to the boxes global variable. So, the new boxes will be available to draw.

Almost done, but one more important thing should be done. The message flow between main and worker threads goes asynchronously, so, the main thread will not wait until the run_model in the worker thread finishes and will continue sending new frames to the worker thread every 30 milliseconds. It can result in a huge request queue, especially if the user has slow CPU. I recommend do not send all new requests to the worker thread until it works with the current one. This can be implemented the following way:

Here I defined a busy variable which acts as a semaphore. When the main thread sends a message arrives, it sets the busy variable to true to signal that the message processing started. Then, all subsequent requests will be ignored until the previous one processed and returned. At this moment, the value of the busy variable resets to false.

The process, that we defined, will work in parallel with a main video playing loop. Here is a full source of the object_detector.js:

object_detector.js



const video = document.querySelector("video");

const worker = new Worker("worker.js");
let boxes = [];
let interval
let busy = false;
video.addEventListener("play", () => {
    const canvas = document.querySelector("canvas");
    canvas.width = video.videoWidth;
    canvas.height = video.videoHeight;
    const context = canvas.getContext("2d");
    interval = setInterval(() => {
        context.drawImage(video,0,0);
        draw_boxes(canvas, boxes);
        const input = prepare_input(canvas);
        if (!busy) {
            worker.postMessage(input);
            busy = true;
        }
    },30)
});

worker.onmessage = (event) => {
    const output = event.data;
    const canvas = document.querySelector("canvas");
    boxes =  process_output(output, canvas.width, canvas.height);
    busy = false;
};

video.addEventListener("pause", () => {
    clearInterval(interval);
});

const playBtn = document.getElementById("play");
const pauseBtn = document.getElementById("pause");
playBtn.addEventListener("click", () => {
    video.play();
});
pauseBtn.addEventListener("click", () => {
    video.pause();
});

function prepare_input(img) {
    const canvas = document.createElement("canvas");
    canvas.width = 640;
    canvas.height = 640;
    const context = canvas.getContext("2d");
    context.drawImage(img, 0, 0, 640, 640);
    const data = context.getImageData(0,0,640,640).data;
    const red = [], green = [], blue = [];
    for (let index=0;index<data.length;index+=4) {
        red.push(data[index]/255);
        green.push(data[index+1]/255);
        blue.push(data[index+2]/255);
    }
    return [...red, ...green, ...blue];
}

function process_output(output, img_width, img_height) {
    let boxes = [];
    for (let index=0;index<8400;index++) {
        const [class_id,prob] = [...Array(yolo_classes.length).keys()]
            .map(col => [col, output[8400*(col+4)+index]])
            .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
        if (prob < 0.5) {
            continue;
        }
        const label = yolo_classes[class_id];
        const xc = output[index];
        const yc = output[8400+index];
        const w = output[2*8400+index];
        const h = output[3*8400+index];
        const x1 = (xc-w/2)/640*img_width;
        const y1 = (yc-h/2)/640*img_height;
        const x2 = (xc+w/2)/640*img_width;
        const y2 = (yc+h/2)/640*img_height;
        boxes.push([x1,y1,x2,y2,label,prob]);
    }
    boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
    const result = [];
    while (boxes.length>0) {
        result.push(boxes[0]);
        boxes = boxes.filter(box => iou(boxes[0],box)<0.7 || boxes[0][4] !== box[4]);
    }
    return result;
}

function iou(box1,box2) {
    return intersection(box1,box2)/union(box1,box2);
}

function union(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    const box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)
}

function intersection(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const x1 = Math.max(box1_x1,box2_x1);
    const y1 = Math.max(box1_y1,box2_y1);
    const x2 = Math.min(box1_x2,box2_x2);
    const y2 = Math.min(box1_y2,box2_y2);
    return (x2-x1)*(y2-y1)
}

function draw_boxes(canvas,boxes) {
    const ctx = canvas.getContext("2d");
    ctx.strokeStyle = "#00FF00";
    ctx.lineWidth = 3;
    ctx.font = "18px serif";
    boxes.forEach(([x1,y1,x2,y2,label]) => {
        ctx.strokeRect(x1,y1,x2-x1,y2-y1);
        ctx.fillStyle = "#00ff00";
        const width = ctx.measureText(label).width;
        ctx.fillRect(x1,y1,width+10,25);
        ctx.fillStyle = "#000000";
        ctx.fillText(label, x1, y1+18);
    });
}

const yolo_classes = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat',
    'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase',
    'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
    'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
    'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
];

And this is the worker thread code:

worker.js



importScripts("https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js");

onmessage = async(event) => {
    const input = event.data;
    const output = await run_model(input);
    postMessage(output);
}

async function run_model(input) {
    const model = await ort.InferenceSession.create("./yolov8n.onnx");
    input = new ort.Tensor(Float32Array.from(input),[1, 3, 640, 640]);
    const outputs = await model.run({images:input});
    return outputs["output0"].data;
}

Also, you can remove the ONNX runtime library import from the index.html, because it's imported in the worker.js. This is the final index.html file:

index.html



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Object detector</title>
</head>
<body>
<video controls style="display:none" src="sample.mp4"></video>
<br/>
<canvas></canvas><br/>
<button id="play">Play</button>&nbsp;
<button id="pause">Pause</button>
<script src="object_detector.js" defer></script>
</body>
</html>

If you run the index.html file now in a web server, you should see the following result.

Conclusion

In this article I showed how to detect objects in video using the YOLOv8 neural network right in a web browser, without any backend. We used the <video> HTML element to load the video. Then we used the HTML5 Canvas to capture each individual frame and convert it to input tensor for YOLOv8 model. Finally we sent this input tensor to the model and received the array of detected objects.

In addition, we discovered how to run several tasks in parallel in Javascript, using web workers. This way we moved the machine learning model execution code to background thread, to not interrupt the user interface with this CPU intensive task.

The full source code of this article you can find in this repository.

Using the algorithm, explained in this post you can detect objects not only in video files, but in other sources of video, like, for example in a video from web camera. All that you need to change in this project, is to set a web camera as a source for <video> element. Everything else will stay the same. This is just a few lines of code. You can read how to connect the webcam to the video element in this article.

The project created in this article is not a complete production ready solution. There is a lot to improve here. For example, you can increase the speed and accuracy if you use object tracking algorithms which work faster. Instead of running the neural network for each frame to detect the same objects, you can run it only for the first frame, to get initial object positions and then use object tracking algorithms to track detected bounding boxes on the subsequent frames. Read more about object tracking methods here. I will write more about this in next articles.

Follow me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Have a fun coding and never stop learning!

How to create YOLOv8-based object detection web service using Python, Julia, Node.js, JavaScript, Go and Rust

Andrey Germanov — Sat, 13 May 2023 14:42:37 +0000

Introduction
YOLOv8 deployment options
Export YOLOv8 model to ONNX
Explore object detection on image using ONNX
    Prepare the input
    Run the model
    Process the output
        Intersection over Union
        Non-maximum Suppression
Create a web service on Python
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on Julia
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on Node.js
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on JavaScript
    Setup the project
    Prepare the input
    Run the model and process the output
Create a web service on Go
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on Rust
    Setup the project
    Prepare the input
    Run the model
    Process the output
Conclusion

Introduction

This is a second part of my article about the YOLOv8 neural network. In the previous article I provided a practical introduction to this model, and it's common API. Then I showed how to create a web service that detects objects on images using Python and official YOLOv8 library based on PyTorch.

In this article, I am going to show how to work with the YOLOv8 model in low level, without the PyTorch and the official API. It will open a lot of new opportunities for deployment. Using concepts and examples of this post you will be able to create an AI powered object detection services that use ten time less resources, and you will be able to create these services not only on Python, but on most of the other programming languages. In particular, I will show how to create the YOLOv8 powered web service on Julia, Node.js, JavaScript, Go and Rust.

As a base, we will use the web service, developed in the previous article, which is available in this repository. We will just rewrite the backend of this web service on different languages. That is why it's required to read the first article before continue reading this.

YOLOv8 deployment options

The YOLOv8 neural network, initially created using the PyTorch framework and exported as a set of ".pt" files. We used the Ultralytics API to train these models or make predictions based on them. To run them, it's required to have an environment with Python and PyTorch.

PyTorch is a great framework to design, train and evaluate neural network models. In addition, it has tools to prepare or even generate the datasets to train the models and many other great utils. However, we do not need all this in production. If we talk about YOLOv8, then all that you need in production is to run the model with input image and receive resulting bounding boxes. However, the YOLOv8 implemented on Python. Does it mean that all programmers who want to use this great object detector must become the Python programmers? Does it mean that they must rewrite their applications on Python or integrate them with Python code? Fortunately not. The Ultralytics API has a great export function to convert any YOLOv8 model to a format, that can be used by external applications.

The following formats are supported at the moment:

Format	`format` Argument
TorchScript	`torchscript`
ONNX	`onnx`
OpenVINO	`openvino`
TensorRT	`engine`
CoreML	`coreml`
TF SavedModel	`saved_model`
TF GraphDef	`pb`
TF Lite	`tflite`
TF Edge TPU	`edgetpu`
TF.js	`tfjs`
PaddlePaddle	`paddle`

For example, the CoreML is a neural network format, that can be used in iOS applications that run on iPhone.

Using the links in this table, you can read an overview of each of these formats.

The most interesting of them for us today is ONNX which is a lightweight runtime, created by Microsoft, that can be used to run neural network models on a wide range of platforms and programming languages. This is not a framework, but it's just a shared library written in C. It's just 16 MB in size for Linux, but it has interface bindings for most programming languages, including Python, PHP, JavaScript, Node.js, C++, Go and Rust. It has a simple API and if you wrote an ONNX code to run a model on one programming language, then it will not be difficult to rewrite it and use on other, which we will see today.

To follow the sections started from this one, you need to have Python and Jupyter Notebooks installed.

Export YOLOv8 model to ONNX

First, let's load the YOLOv8 model and export in to ONNX format to make it usable. Run the Jupyter notebook and execute the following code in it.



from ultralytics import YOLO
model = YOLO("yolov8m.pt")
model.export(format="onnx")

In the code above, you loaded the middle-sized YOLOv8 model for object detection and exported it to the ONNX format. This model is pretrained on COCO dataset and can detect 80 object classes.

After running this code, you should see the exported model in a file with the same name and the .onnx extension. In this case, you will see the yolov8m.onnx file in a folder where you run this code.

Before writing a web service based on ONNX, let's discover how this library works in Jupyter Notebook to understand the main concepts.

Explore object detection on image using ONNX

Now when you have a model, let's use ONNX to work with it. For simplicity, we will start with Python, because we already have a Python web application, that uses PyTorch and Ultralytics APIs. So, it will be easier to move it to ONNX.

Install the ONNX runtime library for Python by running the following command in your Jupyter notebook:



!pip install onnxruntime

and import it:



import onnxruntime as ort

We set the ort alias to it. Remember this abbreviation because in other programming languages you will often see ort instead on ONNX runtime.

The ort module is a root of the ONNX API. The main object of this API is the InferenceSession which used to instantiate a model to run prediction on it. Model instantiation works very similar to what we did before with Ultralytics:



model = ort.InferenceSession("yolov8m.onnx", providers=['CPUExecutionProvider'])

Here we loaded the model, but from ".onnx" file instead on ".pt". And now it's ready to run.

And this is a moment when similarities between Ultralytics and ONNX end. If you remember, with Ultralytics you just run: outputs = model.predict("image_file") and received result. The smart predict method did the following for you automatically:

Read the image from file
Convert it to the format of the YOLOv8 neural network input layer
Pass it through the model
Receive the raw model output
Parse the raw model output
Return structured information about detected objects and their bounding boxes

The ONNX session object has a similar method run, but it implements only steps 3 and 4. Everything else is up to you, because ONNX does not know that this is the YOLOv8 model. It does not know which input this neural network expects to get and what the raw output of this model means. This is universal API for any kind of neural networks, it does not know about concrete use cases like object detection on images.

In terms of ONNX, the neural network is a black box that receives a multidimensional array of float numbers as an input and transforms it to other multidimensional array of numbers. Which numbers should be in the input and what mean the numbers in the output, it does not know. So, and what we can do with it?

Fortunately, the things are not so worst and something we can research. The shapes of input and output layers of a neural network are fixed, they are defined when neural network created and information about them exists in a model.

The ONNX session object has a helpful method get_inputs() to get the information about inputs that this model expects to receive and the get_outputs() to get the information about the outputs, that the model returns after processing the inputs.

Let's get the inputs first:



inputs = model.get_inputs();
len(inputs)

Here we got the array of inputs and displayed the length of this array. The result is obvious: the network expects to get a single input. Let's get it:



input = inputs[0]

The input object has three fields: name, type and shape. Let's get these values for our YOLOv8 model:



print("Name:",input.name)
print("Type:",input.type)
print("Shape:",input.shape)

And this is the output that you will get:



Name: images
Type: tensor(float)
Shape: [1, 3, 640, 640]

This is what we can discover from this:

The name of expected input is images which is obvious. The YOLOv8 model receives the images as an input
The type of input is tensor of float numbers. The tensor can have many definitions, but from practical point of view which is important for us now, this is a multidimensional array of numbers, the array of float numbers. So, we can deduce that we need to convert our image to a multidimensional array of float numbers.
The shape shows the dimensions of this tensor. Here, you see that this array should be four dimensional. This should be a single image (1), that contains 3 matrices of 640x640 float numbers. What numbers should be in these matrices? The matrix of color components. As you should know, each color pixel has Red, Green and Blue components. Each color component can have values from 0 to 255. Also, you can deduce that the image must have 640x640 size. Finally, there should be 3 matrices: one 640x640 matrix that contain red component of each pixel, one for green and one for blue.

Now you have enough observations to understand what need to do in the code to prepare the input data.

Prepare the input

We need to load an image, resize it to 640x640, extract information about Red, Green and Blue component of each pixel and construct 3 matrices of intensities of appropriate colors.

Let's just do it using the Pillow python package, that we already used before. Ensure that it's installed:



!pip install pillow

For example, we will use the cat_dog.jpg image, that we used in the previous article:

Let's load and resize it:



from PIL import Image

img = Image.open("cat_dog.jpg")
img_width, img_height = img.size;
img = img.resize((640,640))

First, you loaded the Image object from the Pillow library. Then you created the img object from the cat_dog.jpg file. Then we saved the original size of the image to the img_width and img_height variables, that will be needed later. Finally, we resized it, providing the new size as a (640,640) tuple.

Now we need to extract each color component of each pixel and construct 3 matrices from them. But here we have one thing that can lead to inconsistencies in the future. Each pixel has four color channels: Red, Green, Blue and Alpha. The alpha channel describes the transparency of a pixel. We do not need Alpha channel in the image for YOLOv8 predictions. Let's remove it:



img = img.convert("RGB");

By default, the image with Alpha channel has "RGBA" color model. By this line, you converted it to "RGB". This way, you've removed the alpha channel.

Now it's time to create 3 matrices of color channel values. We can do this manually, but Python has a great interoperability between libraries. The NumPy library, that usually used to work with multidimensional arrays, can just load the Pillow image object as an array as simple as this:



import numpy as np

input = np.array(img)

Here, you imported NumPy and just loaded the image to the input NumPy array. Let's see the shape of this array now:



input.shape



(640, 640, 3)

Almost fine, but the dimensions go in wrong order. We need to put 3 in the beginning. The transpose function can switch dimensions of NumPy array:



input = input.transpose(2,0,1)
input.shape



(3,640,640)

The numbering of dimensions starts from 0. So, we had 0=640, 1=640, 2=3. Then, using the transpose function, we moved the dimension number 2 to the first place. Finally, received the shape (3,640,640).

But we need to add one more dimension to the beginning to make it (1,3,640,640). The reshape function can do this:



input = input.reshape(1,3,640,640)

Now we have correct input shape, but if you try to see contents of this array, like for example, the red component of the first pixel:



input[0,0,0,0]

you'll probably see the integer:

but the float numbers required. Moreover, as a rule, the numbers for machine learning must be scaled, e.g. scaled to a range from 0 to 1. Having a knowledge, that the color value can be in a range from 0 to 255, we can scale all pixels to a 0-1 range if divide them by 255.0. The NumPy allows doing this in a single line of code:



input = input/255.0

input[0,0,0,0]



0.2784313725490196

In the code above, you divided all numbers in array and displayed the first of them: the red color component intensity for the first pixel. So, this is how the input data should look.

Run the model

Now, before running the prediction process, let's see, which output the YOLOv8 model should return. As said above, this can be done using the get_outputs() method of ONNX session object. The result value of this method has the same type as the value of the get_inputs(), because as I said before: "the only work of neural network is to transform one array of numbers provided as an input to other array of numbers". So, let's see the form of the output of pretrained YOLOv8 model:



outputs = model.get_outputs()
output = outputs[0]
print("Name:",output.name)
print("Type:",output.type)
print("Shape:",output.shape)



Name: output0
Type: tensor(float)
Shape: [1, 84, 8400]

The ONNX is a universal platform to run neural networks of any kind. That is why it assumes, that the network can have many inputs and many outputs, and it accepts array of inputs and array of outputs, even if these arrays have only single item. YOLOv8 has a single output, which is a first item of the outputs object.

Here you see that the output has an output0 name, it also has a form of tensor of float numbers and a shape of this output is [1,84,8400] which means that this is a single 84x8400 matrix, that nested to a single array. In practice, it means that the YOLOv8 network returns, 8400 bounding boxes and each bounding box has 84 parameters. It's a little bit ugly that each bounding box is column here, but not row. It's a technical requirement of neural network algorithm. I think it would be better to transpose it to 8400x84, so, it will be clear that there are 8400 rows that match detected objects and that each row is a bounding box with 84 parameters.

We will discuss why there are so many parameters for a single bounding box later. First, we should run the model to get the data for this output. We have everything for this now.

To run prediction for YOLOv8 model, we need to execute the run method, which has the following signature:



model.run(output_names,inputs)

output_names - the array of names of outputs that you want to receive. In YOLOv8 model, it will be an array with a single item.
inputs - the dictionary of inputs, that you pass to the network in a format {name:tensor} where name is a name of input and the tensor is an image data array that we prepared before.

To run the prediction for the data that you prepared, you can run the following:



outputs = model.run(["output0"], {"images":input})
len(outputs)

As you seen earlier, the only output of this model has a name output0 and the name of the only input is images. The data tensor for the input you prepared in the input variable.

If everything went well, it will display that the length of received outputs array is 1 which means that you have only single output. However, if you receive the error that says that the input must be in float format, then convert it to float32 using the following line:



input = input.astype(np.float32)

and then run again.

Then we are close to the most interesting part of the work: process the output.

Process the output

There is an only single output, so we can extract it from outputs:



output = outputs[0]
output.shape



(1, 84, 8400)

So, as you see, it returned the output of correct shape. As the first dimension has only single item, we can just get it:



output = output[0]
output.shape



(84, 8400)

We turned it out to a matrix with 84 rows and, 8400 columns. As I said before, it has a transposed form which is not very suitable for work, let's transpose it again:



output = output.transpose()



(8400, 84)

Now it's more clear: 8400 rows with 84 parameters. 8400 is a maximum number of bounding boxes that the YOLOv8 model can detect, and it returns 8400 lines for any image regardless of how many objects really detected on it, because the output of the neural network is fixed and defined during the neural network design. It can't be variable. So, it returns 8400 rows every time, but the most of these rows contain just garbage. How to detect, which of these rows have meaningful data and which of them are garbage? To do that, we need to discover 84 parameters that each of these row has.

The first 4 elements are coordinates of the bounding box, and all others are the probabilities of all object classes that this model can detect. The pretrained model that you use in this tutorial can detect 80 object classes, that is why, each bounding box has 84 parameters: 4+80. If you use another model, that, for example, trained to detect 3 object classes, then it will have 7 parameters in a row because of 4+3.

Let's for example display the row number 0:



row = output[0]
print(row)



[     5.1182      8.9662      13.247      19.459  2.5034e-06  2.0862e-07  5.6624e-07  1.1921e-07  2.0862e-07  1.1921e-07  1.7881e-07  1.4901e-07  1.1921e-07  2.6822e-07  1.7881e-07  1.1921e-07  1.7881e-07  4.1723e-07  5.6624e-07  2.0862e-07  1.7881e-07  2.3842e-07  3.8743e-07  3.2783e-07  1.4901e-07  8.9407e-08
  3.8743e-07  2.9802e-07  2.6822e-07  2.6822e-07  2.3842e-07  2.0862e-07  5.9605e-08  2.0862e-07  1.4901e-07  1.1921e-07  4.7684e-07  2.6822e-07  1.7881e-07  1.1921e-07  8.9407e-08  1.4901e-07  1.7881e-07  2.6822e-07  8.9407e-08  2.6822e-07  3.8743e-07  1.4901e-07  2.0862e-07  4.1723e-07  1.9372e-06  6.5565e-07
  2.6822e-07  5.3644e-07  1.2815e-06  3.5763e-07  2.0862e-07  2.3842e-07  4.1723e-07  2.6822e-07  8.3447e-07  8.9407e-08  4.1723e-07  1.4901e-07  3.5763e-07  2.0862e-07  1.1921e-07  5.9605e-08  5.9605e-08  1.1921e-07  1.4901e-07  1.4901e-07  1.7881e-07  5.9605e-08  8.9407e-08  2.3842e-07  1.4901e-07  2.0862e-07
  2.9802e-07  1.7881e-07  1.1921e-07  2.3842e-07  1.1921e-07  1.1921e-07]

Here you see that this row represents a bounding box with coordinates [5.1182, 8.9662, 13.247, 19.459]. These values are coordinates of a center of this bounding box, the width and the height:

x_center = 5.1182
y_center = 8.9662
width = 13.247
height = 19.459

Let's slice out these variables from the row:



xc,yc,w,h = row[:4]

All other values are the probabilities that the detected object belongs to each of 80 classes. So, assuming that the array numbering starts from 0, the item number 4 contains the probability that the object belongs to class 0 (2.5034e-06), item number 5 contains the probability that the object belongs to class 1 (2.0862e-07) etc.

Now lets remove all garbage and parse this row to a format, that we got in the previous article: [x1,y1,x2,y2,class_label,probability].

To calculate coordinates of bounding box corners you can use the following formulas:



x1 = xc-w/2
y1 = yc-h/2
x2 = xc+w/2
y2 = yc+h/2

but there is a very important reminder: do you remember that we scaled the image to 640x640 in the beginning? It means that these coordinates returned in assumption that the image has this size. To get coordinates of this bounding box for the original image, we need to scale them in proportion to the dimensions of the original image. We saved the original width and height to the img_width and img_height variables, and to scale the corners of the bounding box, we need to modify the formulas:



x1 = (xc - w/2) / 640 * img_width
y1 = (yc - h/2) / 640 * img_height
x2 = (xc + w/2) / 640 * img_width
y2 = (yc + h/2) / 640 * img_height

Then you need to find the object with a maximum probability. On the one hand you can do this in a loop, iterating from 4 to 84 items of this array and select the item index with maximum probability value, but the NumPy has the convenient methods for this:



prob = row[4:].max()
class_id = row[4:].argmax()

print(prob, class_id)



2.503395e-06 0

The first line returns the maximum value of subarray from 4 until the end of the row. The second line returns the index of the element with this maximum value. So, here you see that the first probability has a maximum value, and it means that this bounding box belongs to class 0.

To replace class ID with class label, you should have an array of classes, that the model can predict. In case of this model, this is 80 classes from the COCO dataset. Here they are:



yolo_classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

In case if you use other custom trained model, then you can get this array from the YAML file, that used for training. You can find about YAML files that used to train YOLOv8 models in my previous article.

Then you can just get a class label by ID:



label = yolo_classes[class_id]

This is how you should parse each row of YOLOv8 model output.

However, this probability is too low, because 2.503395e-06 = 2.503395 / 1000000 = 0.000002503. So, this bounding box, perhaps just garbage that should be filtered out. I recommend filtering out all bounding boxes with probability less than 0.5.

Let's write all the row parsing code above as a function, to parse any row this way:



def parse_row(row):
    xc,yc,w,h = row[:4]
    x1 = (xc-w/2)/640*img_width
    y1 = (yc-h/2)/640*img_height
    x2 = (xc+w/2)/640*img_width
    y2 = (yc+h/2)/640*img_height
    prob = row[4:].max()
    class_id = row[4:].argmax()
    label = yolo_classes[class_id]
    return [x1,y1,x2,y2,label,prob]

Now you can write a code that parses and filter outs all rows from output:



boxes = [row for row in [parse_row(row) for row in output] if row[5]>0.5]
len(boxes)

Here I used the Python list comprehensions. The internal list:



[parse_row(row) for row in output]

used to parse each row and return an array of parsed rows in
a format [x1,y1,x2,y2,class_id,prob].

and then, the external list used to filter all of these rows if their probability is less than 0.5



[row for row in [((parsed_rows))] in row[5]>0.5]

After this, the len(boxes) shows that only 20 boxes left after filtering. Much closer to expected result than 8400, but still it's too much, because we have an image with only one cat and one dog. Curious, what else detected? Let's show this data:



[261.28302669525146, 95.53291285037994, 461.15666942596437, 313.4492515325546, 'dog', 0.9220365]
[261.16701192855834, 95.61400711536407, 460.9202187538147, 314.0579136610031, 'dog', 0.92195505]
[261.0219168663025, 95.50403118133545, 460.9265221595764, 313.81584787368774, 'dog, 0.9269446]
[260.7873046875, 95.70514416694641, 461.4101188659668, 313.7423722743988, 'dog', 0.9269207]
[139.5556526184082, 169.4101345539093, 255.12585411071777, 314.7275745868683, 'cat', 0.8986903]
[139.5316062927246, 169.63674533367157, 255.05698356628417, 314.6878091096878, 'cat', 0.90628827]
[139.68495998382568, 169.5753903388977, 255.12413234710692, 315.06962299346924, 'cat', 0.88975877]
[261.1445414543152, 95.70124578475952, 461.0543995857239, 313.6095304489136, 'dog', 0.926944]
[260.9405124664307, 95.77976751327515, 460.99450263977053, 313.57664155960083, 'dog', 0.9247296]
[260.49400663375854, 95.79500484466553, 461.3895306587219, 313.5762457847595, 'dog', 0.9034922]
[139.59658827781678, 169.2822597026825, 255.2673086643219, 314.9018738269806, 'cat', 0.88215613]
[139.46405625343323, 169.3733571767807, 255.28112654685975, 314.9132820367813, 'cat', 0.8780577]
[139.633131980896, 169.65343713760376, 255.49261894226075, 314.88970375061035, 'cat', 0.8653987]
[261.18754177093507, 95.68838310241699, 461.0297842025757, 313.1688747406006, 'dog', 0.9215225]
[260.8274451255798, 95.74608707427979, 461.32597131729125, 313.3906273841858, 'dog', 0.9093932]
[260.5131794929504, 95.89693665504456, 461.3481791496277, 313.24405217170715, 'dog', 0.8848127]
[139.4986301422119, 169.38371658325195, 255.34583129882813, 314.9019331932068, 'cat', 0.836439]
[139.55282192230223, 169.58951950073242, 255.61378440856933, 314.92880630493164, 'cat', 0.87574947]
[139.65414333343506, 169.62119138240814, 255.79856758117677, 315.1192432641983, 'cat', 0.8512477]
[139.86577434539797, 169.38782274723053, 255.5904968261719, 314.77193105220795, 'cat', 0.8271704]

All these boxes have high probability and their coordinates overlap each other. Let's draw these boxes on the image to see why is it.

The PIL package has the ImageDraw module, that allows to draw rectangles or other figures on top of images. Let's load the image using this object:



from PIL import ImageDraw
img = Image.open("cat_dog.jpg")
draw = ImageDraw(img)

and draw each bounding box on the image using the created draw object in a loop:



for box in boxes:
    x1,y1,x2,y2,class_id,prob = box
    draw.rectangle((x1,y1,x2,y2),None,"#00ff00")

img

This code draws the green rectangles for each bounding box and displays the resulting image, which will look like this:

It draws all these 20 boxes on top of each other, so they look like just 2 boxes. As a human, you can see that all these 20 boxes belong to the same 2 objects. However, the neural network is not a human, and it thinks that it found 20 different cats and dogs that overlap each other, because it's theoretically possible that different objects on the image can overlap each other. Perhaps it sounds crazy, but this is how it works.

It's up to you to select which of these boxes should stay and which to filter out. How you can do this? On the one hand, you can select the box with the highest probability for dog and the box with the highest probability for cat and remove all others. However, it's not a useful solution for all cases, because you can have images with several dogs and several cats at the same time. You should find and use some general purpose algorithm that removes all boxes that closely overlap each other. Fortunately, this algorithm already exists and it's called the Non-maximum suppression. These are the steps that you should implement to make it working:

Create an empty resulting array that will contain a list of boxes that you want to keep.
Start a loop
From source boxes array, select the box with the highest probability and move it to the resulting array.
Compare the selected box with each other box from the source array and remove all of them that overlap the selected one too much.
If the source array contains more boxes, move to step 2 and repeat

After loop finished, the source boxes array will be empty, and the resulting array will contain only different boxes. Now let's understand how to implement step 4, how to compare two boxes and find that they overlap each other too much. To find it, we will use other algorithm - "Intersection over Union" or IoU. This algorithm is actually a formula:

The idea of this algorithm is:

Calculate the area of intersection of two boxes.
Calculate the area of their union.
Divide first by second.

The closer the result to 1, the more two boxes overlap each other. You can see this visually: the closer the area of intersection of two boxes to the area of their union, the more it looks like the same box. In the left box below the formula these boxes overlap each other, but not too much, and the IoU in this case could be about 0.3. Definitely, these two boxes can be treated as different objects, even if they overlap. On the second example it's clear that the area of intersection is much closer to the area of their union, perhaps the IoU will be about 0.8 here. Highly likely that one of these boxes should be removed. Finally, the boxes on the right sample represent almost the same area and definitely only one of them should stay.

Now let's implement both IoU and Non-Maximum suppression in code.

Intersection over union

1 Calculate the area of intersection



def intersection(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    x1 = max(box1_x1,box2_x1)
    y1 = max(box1_y1,box2_y1)
    x2 = min(box1_x2,box2_x2)
    y2 = min(box1_y2,box2_y2)
    return (x2-x1)*(y2-y1)

Here, we calculate the area of intersection rectangle using its width (x2-x1) and height (y2-y1).

2 Calculate the area of union



def union(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)

3 Divide first by second



def iou(box1,box2):
    return intersection(box1,box2)/union(box1,box2)

Non-maximum suppression

So, we have an array of boxes in the boxes variable, and we need to leave only different items in it, using the created iou function as a criterion of difference. Let's say that if IoU of two boxes less than 0.7, then they both should stay. Otherwise, one of them with lesser probability should leave. Let's implement it:



boxes.sort(key=lambda x: x[5], reverse=True)
result = []
while len(boxes)>0:
    result.append(boxes[0])
    boxes = [box for box in boxes if iou(box,boxes[0])<0.7]

For convenience, in the first line, we sorted all boxes by probability in reverse order to move the boxes with the highest probabilities to the top.

Then the code defines the array for resulting boxes. In a loop it puts the first box (which is a box with the highest probability) in the resulting array and on the next line it overwrites the boxes array with only boxes, that have the 'IoU' with selected box that is less than 0.7.

It continues doing that in a loop until the boxes contains no items.

After running it, you can print the result array:



print(result)



[
[261.0219168663025, 95.50403118133545, 460.9265221595764, 313.81584787368774, 'dog', 0.9269446],
[139.5316062927246, 169.63674533367157, 255.05698356628417, 314.6878091096878, 'cat', 0.90628827]
]

Now it has just 2 items, as it should. The IoU did it magic work and selected the best boxes for cat and dog with the highest probabilities.

So, finally, you did it! Can you realize how much code you had to write instead of single model.predict() line in Ultralytics API? However, now you have a knowledge how it really works, and awareness of these algorithms makes you independent of PyTorch environment. Now you can create applications which use the YOLOv8 models using any programming language supported by ONNX and I will show you how to do this.

In the next sections we will refactor the object detection web service, written in the previous article, to use ONNX instead of PyTorch. We will rewrite it on Python, Julia, Node.js, JavaScript, Go and Rust.

The first section with Python defines the project structure, the functions, and their relations, and then we will rewrite all these functions in other programming languages without changing the structure of the project.

The Python section is recommended for everyone, then you can move on to sections related to your chosen language. Using the defined project structure and algorithms, you will be able to write the web service on any other language, that supports ONNX.

I assume that you are familiar with all languages that you choose and have all required IDE's and tools to write, compile and run that code. I will focus only on ONNX and algorithms, described above, and will not teach you programming on these languages. Furthermore, I will not dive to their standard libraries. However, I will provide links to API docs of all external packages and frameworks that we will use, and you should either know APIs of these libraries or be able to learn them using that documentation.

Create a web service on Python

Setup the project

We will use the project, created in the previous article as a base. You can get it from this repository.

Create a new folder and copy the following files to it from the project above:

index.html - frontend
object_detector.py - backend
requirements.txt - list of external dependencies

also copy the ONNX model yolov8m.onnx that you exported in the beginning of the article.

Then, open the requirements.txt file and replace the ultralytics dependence to onnxruntime. Also, add the numpy package to the list. It will be used to convert image to array. Finally, the list of dependencies should look like this:

onnxruntime
flask
waitress
pillow
numpy

Ensure that all these packages installed: you can install them one by one using PIP, or the better option is to install all them at once:



pip install -r requirements.txt

We will not change frontend, so index.html will stay the same. The only file that we will change is the object_detector.py, where we will rewrite the object detection code, that previously used Ultralytics APIs to use ONNX runtime.

Let's make a few changes to the structure of this file:



import onnxruntime as ort
from flask import request, Flask, jsonify
from waitress import serve
from PIL import Image
import numpy as np
import json

app = Flask(__name__)


def main():
    serve(app, host='0.0.0.0', port=8080)


@app.route("/")
def root():
    with open("index.html") as file:
        return file.read()


@app.route("/detect", methods=["POST"])
def detect():
    buf = request.files["image_file"]
    boxes = detect_objects_on_image(buf.stream)
    return jsonify(boxes)


def detect_objects_on_image(buf):
    model = YOLO("best.pt")
    results = model.predict(buf)
    result = results[0]
    output = []
    for box in result.boxes:
        x1, y1, x2, y2 = [
            round(x) for x in box.xyxy[0].tolist()
        ]
        class_id = box.cls[0].item()
        prob = round(box.conf[0].item(), 2)
        output.append([
            x1, y1, x2, y2, result.names[class_id], prob
        ])
    return output


main()

If you compare this listing with the original object_detector.py, you'll see that I removed the ultralytics package and put the line that imports the ONNX runtime: import onnxruntime as ort. Also, I've imported numpy as np.

Then, I put the code that runs a web server to the main function and put it to the beginning. Finally, I call the main() as a last line.

We will not change the routes inside the main function, so the root and detect functions will remain the same. We will rewrite only the detect_objects_on_image to use ONNX runtime instead of Ultralytics. The implementation will be more complex than now, but you already know everything if followed the previous section of this article.

We will split the dected_objects_on_image function to three parts:

Prepare the input
Run the model
Process the output

Each phase we will put to a separate function, which the detect_objects_on_image will call. Replace the content of this function to the following:



def detect_objects_on_image(buf):
    input, img_width, img_height = prepare_input(buf)
    output = run_model(input)
    return process_output(output,img_width,img_height)

def prepare_input(buf):
    pass

def run_model(input):
    pass

def process_output(output,img_width,img_height):
    pass

In the first line, the prepare_input function receives the uploaded file content, converts it to the input array and returns it. In addition, it returns the original dimensions of the image: image_width and image_height, that will be used later to scale detected bounding boxes.
Then, the run_model function receives the input and runs the ONNX session with it. It returns the output which is an array with (1,84,8400) shape.
Finally, the output passed to the process_output function, along with the original image size (img_width, img_height). This function should return the array of bounding boxes. Each item of this array has the following format: [x1,y1,x2,y2,class_label,prob].

Let's write these functions one by one.

Prepare the input

The prepare_input function uses the code that you have written in the Prepare the input section. This is how it looks:



def prepare_input(buf):
    img = Image.open(buf)
    img_width, img_height = img.size
    img = img.resize((640, 640))
    img = img.convert("RGB")
    input = np.array(img)
    input = input.transpose(2, 0, 1)
    input = input.reshape(1, 3, 640, 640) / 255.0
    return input.astype(np.float32), img_width, img_height

This code loads the image, saves its size to img_width and img_height variables.
Then it resizes it, removes the transparency by converting to RGB, and converts to a tensor of pixels by loading as an np.array().
Then it transposes and reshapes the array to convert it from (640,640,3) shape to the (1,3,640,640) shape, divides all values by 255.0 to scale it and make compatible with ONNX model input format.
Finally, it returns the input array converted to "Float32" data type along with original img_width and img_height. It's important here to convert to np.float32, because by default, Python uses the double as a type for floating point numbers, but ONNX runtime model requires the Float32.

Run the model

In this function you can reuse the code, that we wrote in the Run the model section.



def run_model(input):
    model = ort.InferenceSession("yolov8m.onnx", providers=['CPUExecutionProvider'])
    outputs = model.run(["output0"], {"images":input})
    return outputs[0]

First, you load the model from the yolov8m.onnx file and then use the run method to process the input and return the outputs. Finally, it returns the first output which is an array of (1,84,8400) shape.

Now, it's time to process and convert this output to the array of bounding boxes.

Process the output

The code to process the output will include the functions from the Process the output section to filter out all overlapping boxes using the "Intersection over Union" algorithm. Also, it will use the array of YOLO classes to obtain the labels for each detected object. This code you can just copy/paste from the appropriate places:



def iou(box1,box2):
    return intersection(box1,box2)/union(box1,box2)

def union(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)

def intersection(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    x1 = max(box1_x1,box2_x1)
    y1 = max(box1_y1,box2_y1)
    x2 = min(box1_x2,box2_x2)
    y2 = min(box1_y2,box2_y2)
    return (x2-x1)*(y2-y1)

yolo_classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

This is the iou function and it's dependencies to calculate the intersection over union coefficient. Also, there is an array of YOLO classes, that the model can detect.

Now, having all that, you can implement the process_output function:



def process_output(output, img_width, img_height):
    output = output[0].astype(float)
    output = output.transpose()

    boxes = []
    for row in output:
        prob = row[4:].max()
        if prob < 0.5:
            continue
        class_id = row[4:].argmax()
        label = yolo_classes[class_id]
        xc, yc, w, h = row[:4]
        x1 = (xc - w/2) / 640 * img_width
        y1 = (yc - h/2) / 640 * img_height
        x2 = (xc + w/2) / 640 * img_width
        y2 = (yc + h/2) / 640 * img_height
        boxes.append([x1, y1, x2, y2, label, prob])

    boxes.sort(key=lambda x: x[5], reverse=True)
    result = []
    while len(boxes) > 0:
        result.append(boxes[0])
        boxes = [box for box in boxes if iou(box, boxes[0]) < 0.7]
    return result

First two lines convert the output shape from (1,84,8400) to (8400,84) which is 8400 rows with 84 columns. Also, it converts the values of array from np.float32 to float data type. It's required to serialize result to JSON finally.
The first loop used to go through the rows. For each row, it calculates the probability of this prediction and skips all rows if the probability less than 0.5.
For rows that passed the probability check, it determines the detected object class_id and the text label of this class, using the yolo_classes array.
Then it calculates the corner coordinates of the bounding box using coordinates of its center, width and height. Also, it scales it to the original image size using the img_width and img_height parameters.
Then it appends the calculated bounding box to the boxes array.
The last part of the function filters the detected boxes using the "Non-maximum suppression" algorithm. It filters all boxes that overlap the box with the highest probability, using the iou function to determine the overlapping criteria value.
Finally, all boxes that passed the filter returned as a result array.

That is it for Python implementation.

If everything implemented without mistakes, you can run this web service this way:



python object_detector.py

then open http://localhost:8080 in a web browser, and it should work exactly the same, as an original service, implemented using the PyTorch version of YOLOv8 model.

The ONNX runtime is a low level library, so it requires much more code to make the model work, however, the solution built this way is better to deploy in production, because it requires 10 times less hard disk space.

You can find the whole project with comments in this GitHub repository.

The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.

We used only a small subset of ONNX runtime Python API required for basic operations. Full reference available here.

If you followed this guide step by step and implemented this web service on Python, then by this moment you know the foundational algorithm on how the ONNX runtime works in general and ready to try implementing this on other languages.

In the sections below, we will implement the same projects with the same functions on other programming languages. If curious, you can read all next sections or move directly to the language that is interesting for you the most.

Create a web service on Julia

Julia is a modern programming language well suited for data science and machine learning. It combines simple syntax with superfast runtime performance. Sometimes it's stated as a future of machine learning and the most natural replacement for Python in this field.

The Julia has good libraries for machine learning and deep learning. You can read my articles which introduces these libraries to create and run classical machine learning models and neural networks.

Furthermore, having a binding to the ONNX runtime library, you can use any machine learning model, created using Python, including neural networks, created in PyTorch and TensorFlow. The YOLOv8 is not an exception, and you can run that models, exprorted to ONNX format in Julia.

Below, we will implement the same object detection project on Julia.

Setup the project

Enter the Julia REPL by running the following command:



julia

In the REPL, switch to pkg mode by pressing the ] key and then, enter this command:



generate object_detector

This command will create a folder object_detector and will generate the new project in it.

Enter the shell mode by pressing the ; key and move to the project folder by running the following command:



cd object_detector

Return to the pkg mode by pressing Esc and then press the ] key. Then exec this command to activate the project:



activate .

Then you need to install dependencies that will be used. They are ONNX runtime, the Images package and the Genie web framework.



add ONNXRunTime
add Images
add Genie

ONNXRuntime - this is the Julia bindings for ONNX runtime library.
Images - this is the Julia Images package, which we will use to read images and convert them to pixel color arrays.
Genie - this is a web framework for Julia, similar to Flask in Python.

Then you can exit the Julia REPL by pressing Ctrl+D.

Open the project folder to see what is there:

src - the folder with Julia source code
Project.toml - the project properties file
Manifest.toml - the project package cache file

Also, it already generated the template source code file object_detector.jl in the src folder. In this file we will do all the work. However, before we start, copy the index.html and the yolov8m.onnx files from Python project to this project root. The frontend will be the same.

After you've done that, open the src/object_detector.jl, erase all content from it and add the following boilerplate code:



using Images, ONNXRunTime, Genie, Genie.Router, Genie.Requests, Genie.Renderer.Json

function main()    
    route("/") do 
        String(read("index.html"))
    end 

    route("/detect", method=POST) do
        buf = IOBuffer(filespayload()["image_file"].data)
        json(detect_objects_on_image(buf))
    end

    up(8080, host="0.0.0.0", async=false)
end

function detect_objects_on_image(buf)
    input, img_width, img_height = prepare_input(buf)
    output = run_model(input)
    return process_output(output, img_width,img_height)
end

function prepare_input(buf)
end

function run_model(input)
end

function process_output(output, img_width, img_height)
end

main()

This is a template of the whole application. You can compare this with the Python project and see that it has almost the same structure.

First you import dependencies, including ONNX Runtime, Genie Web framework and Images library.
Then, in the main function, you create two endpoints: one for main index.html page and one /detect, which will receive the image file and pass it to the detect_objects_on_image function. Then you start the web server on port 8080 which serves these two endpoints.
The detect_objects_on_image has exactly the same content as the Python one. It prepares input from the image, passes it through the model, processes the model output and returns the array of bounding boxes.
Then, the processed output returned to client as a JSON.

In the next sections we will implement prepare_input, run_model and process_output functions one by one.

Prepare the input



function prepare_input(buf)
    img = load(buf)
    img_height, img_width = size(img)
    img = imresize(img,(640,640))
    img = RGB.(img)
    input = channelview(img)
    input = reshape(input,1,3,640,640)
    return Float32.(input), img_width, img_height    
end

This code loads the image, saves its size to img_width and img_height variables.
Then it resizes it, removes the transparency by converting to RGB, and converts to a tensor of pixels using the channelview function.
Then it reshapes the array to convert it from (640,640,3) shape to the (1,3,640,640) shape, that required for the ONNX model.
Finally, it returns the input array converted to "Float32" data type along with original img_width and img_height.

Run the model



function run_model(input)
    model = load_inference("yolov8m.onnx")
    outputs = model(Dict("images" => input))
    return outputs["output0"]
end

This code is almost the same as appropriate Python code.

First, you load the model from the yolov8m.onnx file and then run this model to process the input and return the outputs. Finally, it returns the first output which is an array of (1,84,8400) shape.

Now, it's time to process and convert this output to the array of bounding boxes.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Julia. Include them to your code below the process_output function:



function iou(box1,box2)
    return intersection(box1,box2) / union(box1,box2)
end

function union(box1,box2)
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[1:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[1:4]
    box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)
end

function intersection(box1,box2)
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[1:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[1:4]
    x1 = max(box1_x1,box2_x1)
    y1 = max(box1_y1,box2_y1)
    x2 = min(box1_x2,box2_x2)
    y2 = min(box1_y2,box2_y2)
    return (x2-x1)*(y2-y1)
end

Also, include the array of YOLOv8 class labels, which will be used to convert class IDs to text labels:



yolo_classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

Now, it's time to write the process_output function:



function process_output(output, img_width, img_height)
    output = output[1,:,:]
    output = transpose(output)

    boxes = []
    for row in eachrow(output)        
        prob = maximum(row[5:end])
        if prob < 0.5
            continue
        end
        class_id = Int(argmax(row[5:end]))
        label = yolo_classes[class_id]
        xc,yc,w,h = row[1:4]
        x1 = (xc-w/2)/640*img_width
        y1 = (yc-h/2)/640*img_height
        x2 = (xc+w/2)/640*img_width
        y2 = (yc+h/2)/640*img_height
        push!(boxes,[x1,y1,x2,y2,label,prob])
    end

    boxes = sort(boxes, by = item -> item[6], rev=true)
    result = []
    while length(boxes)>0
        push!(result,boxes[1])
        boxes = filter(box -> iou(box,boxes[1])<0.7,boxes)
    end
    return result
end

As a python version, it consists of three parts.

In the first two lines it converts the output array from (1,84,8400) shape to the (8400,84).
The first loop used to go through the rows. For each row, it calculates the probability of this prediction and skips all rows if the probability less than 0.5.
For rows that passed the probability check, it determines the class_id of the detected object and the text label of this class, using the yolo_classes array.
Then it calculates the corner coordinates of the bounding box from coordinates of its center, width and height. Also, it scales it to the original image size using the img_width and img_height parameters.
Then it appends the calculated bounding box to the boxes array.
The last part of the function filters the detected boxes using the "Non-maximum suppression" algorithm. It filters all boxes that overlap the box with the highest probability, using the iou function to determine the overlapping criteria value.
Finally, all boxes that passed the filter returned as a result array.

That is it for Julia implementation.

If everything implemented without mistakes, you can run this web service from the project folder using the following command:



julia src/object_detector.jl

then open http://localhost:8080 in a web browser, and it should work exactly the same, as Python version.

We used only a small subset of ONNX runtime Julia API required for basic operations. Full reference available here.

You can find the source code of the Julia project in this repository.

Create a web service on Node.js

The Node.js needs no introduction. This is the most used platform to develop server side JavaScript applications, including backends for web services. Obviously, it would be great to have a feature to use neural networks in it. Fortunately, the ONNX runtime for Node.js opens the door to all machine learning models trained on PyTorch, TensorFlow and other frameworks. The YOLOv8 is not an exception. In this section, I will show how to rewrite our object detection web service on Node.js, using the ONNX runtime.

Setup the project

Create new folder for the project like object_detector, open it and run:



npm init

to create new Node.js project. After answering all questions about project, install required dependencies:



npm i --save onnxruntime-node
npm i --save express
npm i --save multer
npm i --save sharp

onnxruntime-node - The Node.js library for ONNX Runtime
express - Express.js web framework
multer - Middleware for Express.js to handle file uploads
sharp - An image processing library

We are not going to change frontend, so you can copy the index.html file from the previous project as is to the folder of this project. Also, copy the model file yolov8m.onnx.

Create a object_detector.js file in which you will write the whole backend. Add the following boilerplate code to it:



const ort = require("onnxruntime-node");
const express = require('express');
const multer = require("multer");
const sharp = require("sharp");
const fs = require("fs");

function main() {
    const app = express();
    const upload = multer();

    app.get("/", (req,res) => {
        res.end(fs.readFileSync("index.html", "utf8"))
    })

    app.post('/detect', upload.single('image_file'), async function (req, res) {
        const boxes = await detect_objects_on_image(req.file.buffer);
        res.json(boxes);
    });

    app.listen(8080, () => {
        console.log('Server is listening on port 8080')
    });
}

async function detect_objects_on_image(buf) {
    const [input,img_width,img_height] = await prepare_input(buf);
    const output = await run_model(input);
    return process_output(output,img_width,img_height);
}

async function prepare_input(buf) {

}

async function run_model(input) {

}

async function process_output(output, img_width, img_height) {

}

main()

In the first block of require lines you import all required external modules: ort for ONNX runtime, express for web framework, multer to support file uploads in Express framework, sharp to load the uploaded file as an image and convert it to array of pixel colors and fs to read static files.
In the main function, it creates a new Express web application in the app variable and instantiates the uploads module for it.
Then it defines two routes: the root route that reads and returns a content of the index.html file and the /detect route that used to get uploaded file, to pass it to the detect_objects_on_image function and to return bounding boxes of detected objects to client.
The detect_objects_on_image looks almost the same as in Python and Julia projects: first it converts the uploaded file to the array of numbers, passes it to the model, processes the output and returns the array of detected objects.
Then function stubs for all actions defined
Finally, the main() function called to start a web server on port 8080.

The project is ready, and it's time to implement the prepare_input, run_model and process_output functions one by one.

Prepare the input

We will use the Sharp library to load the image as an array of pixel colors. However, JavaScript does not have such packages as NumPy, which support multidimensional arrays. All arrays in JavaScript are flat. We can make "array of arrays", but it's not true multidimensional array with shape. For example, we can't make the array with shape (3,640,640) which means the array of 3 matrices: first one for reds, second one for greens and third one for blues. Instead, the ONNX runtime for Javascript requires the flat array with 3*640*640=1228800 elements in which reds will go in the beginning, greens will go next and blues will go at the end. This is the result that the prepare_input function should return. Now let's do it step by step.

First, let's do the same actions with image as we did in other languages:



function prepare_input(buf) {
    const img = sharp(buf);
    const md = await img.metadata();
    const [img_width,img_height] = [md.width, md.height];
    const pixels = await img.removeAlpha()
        .resize({width:640,height:640,fit:'fill'})
        .raw()
        .toBuffer();

It loads the file as an image using sharp.
It saves the original image dimensions to img_width and img_height
on the next line, it uses the chain of operations to
remove the transparency channel,
resize the image to 640x640,
return the image as a raw array of pixels to buffer

The Sharp also can't return a matrix of pixels because there are no matrices in JavaScript. That is why, now, you have the pixels array, that contains a single dimensional array of image pixels. Each pixel consists of 3 numbers: R, G, B, There are no rows and columns and pixels just go one after another. To convert it to required format, you need to convert it to 3 arrays: array of reds, array of greens and array of blues and then concatenate these 3 arrays to one in which the reds will go first, greens will go next and blues will go at the end.

The next image shows what you need to do with the pixels array and return from the function:

The first step is to create 3 arrays for reds, greens and blues:



const red = [], green = [], blue = [];

Then, traverse the pixels array and collect numbers to appropriate arrays:



for (let index=0; index<pixels.length; index+=3) {
    red.push(pixels[index]/255.0);
    green.push(pixels[index+1]/255.0);
    blue.push(pixels[index+2]/255.0);
}

This loop jumps from pixel to pixel with step=3. On each iteration, the index is equal to the red component of the current pixel, the index+1 is equal to the green component and the index+2 is equal to the blue. As you see, we divide components by 255.0 to scale and put to appropriate arrays.

The only thing that left to do after this, is to concatenate these arrays in correct order and return along with img_width and img_height.

Here is a full code of the prepare_input function:



async function prepare_input(buf) {
    const img = sharp(buf);
    const md = await img.metadata();
    const [img_width,img_height] = [md.width, md.height];
    const pixels = await img.removeAlpha()
        .resize({width:640,height:640,fit:'fill'})
        .raw()
        .toBuffer();

    const red = [], green = [], blue = [];
    for (let index=0; index<pixels.length; index+=3) {
        red.push(pixels[index]/255.0);
        green.push(pixels[index+1]/255.0);
        blue.push(pixels[index+2]/255.0);
    }

    const input = [...red, ...green, ...blue];
    return [input, img_width, img_height];
}

Perhaps there are other less resource consuming ways exist to convert the pixels array to required form without temporary arrays (you can try your options), but I just wanted to be logical and simple in this implementation.

Now, let's run this input through the YOLOv8 model using the ONNX runtime.

Run the model

The code of the run_model function follows:



async function run_model(input) {
    const model = await ort.InferenceSession.create("yolov8m.onnx");
    input = new ort.Tensor(Float32Array.from(input),[1, 3, 640, 640]);
    const outputs = await model.run({images:input});
    return outputs["output0"].data;
}

On the first line, we load the model from yolov8m.onnx file.
On the second line, we prepare the input array. The ONNX Runtime requires to convert it to an internal ort.Tensor object. Constructor of this object require specifying the flat numbers array, converted to Float32 and a shape, that this array should have, which is as usual [1,3,640,640].
On the third line, we run the model with constructed tensor and receive outputs.
Finally, we return the data of the first output. In JavaScript version, we require specifying the name of this output, instead of index. The name of the YOLOv8 output, as you have seen in the beginning of this article, is output0.

As a result, the function returns the array with (1,84,8400) shape, or you can think about this as about 84x8400 matrix. However, JavaScript does not support matrices, that is why, it returns an output as a single dimension array. The numbers in this array ordered as 84x8400, but as a flat array of 705600 items. So, you can't transpose it, and you can't traverse it by rows in a loop, because it's required to specify the absolute position of the item. But do not worry, in the next section we will learn how to deal with it.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to JavaScript. Include them to your code below the process_output function:



function iou(box1,box2) {
    return intersection(box1,box2)/union(box1,box2);
}

function union(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    const box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)
}

function intersection(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const x1 = Math.max(box1_x1,box2_x1);
    const y1 = Math.max(box1_y1,box2_y1);
    const x2 = Math.min(box1_x2,box2_x2);
    const y2 = Math.min(box1_y2,box2_y2);
    return (x2-x1)*(y2-y1)
}

also, you will need to find YOLO class label by ID, so add the yolo_classes array to your code:



const yolo_classes = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat',
    'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase',
    'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
    'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
    'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
];

Now let's implement the process_output function. As mentioned above, the function receives output as a flat array that ordered as 84x8400 matrix. When work in Python, we had a NumPy to transform it to 8400x84 and then traverse in a loop by row. Here, we can't transform it this way, so, we need to traverse it by columns.



boxes=[];
for (index=0;index<8400;index++) {

}

Moreover, you do not have row indexes and column indexes, but have only absolute indexes. You can only virtually reshape this flat array to 84x8400 matrix in your head and use this representation to calculate these absolute indexes, using those "virtual rows" and "virtual columns".

Let's display how the output array looks to clarify this:

Here we virtually reshaped the output array with 705600 items to a 84x8400 matrix. It has 8400 columns with indexes from 0 to 8399 and 84 rows with indexes from 0 to 83. The absolute indexes of items have written inside boxes. Each detected object represented by a column in this matrix. The first 4 rows of each column with indexes from 0 to 3 are coordinates of the bounding box of the appropriate object: x_center, y_center, width and height. Cells in the other 80 rows, starting from 4 to 83 contain the probabilities that the object belongs to each of the 80 YOLO classes.

I drew this table to understand how to calculate the absolute index of any item in it, knowing the row and column indexes. For example, how you calculate the index of first greyed item that stands on row 2 and column 2, which is a bounding box width of the third detected object? If you think about this a little more, you will find, that to calculate this you need to multiply the row index by the length of the row (8400) and add the column index to this. Let's check it: 8400*2+2=16802. Now, let's calculate the index of the item below it, which is a height of the same object: 8400*3+2=25202. Bingo! Matched again! Finally, let's check the bottom gray box, which is a probability that object 8399 belongs to class 79 (toothbrush): 8400*83+8398=705598. Great, so you have a formula to calculate absolute index: 8400*row_index+column_index.

Let's return to our empty loop. Assuming that the index loop counter is an index of current column and that coordinates of bounding box located in rows 0-3 of current column, we can extract them this way:



boxes=[];
for (index=0;index<8400;index++) {
    const xc = output[8400*0+index];
    const yc = output[8400*1+index];
    const w = output[8400*2+index];
    const h = output[8400*3+index];
}

Then you can calculate the corners of the bounding box and scale them to the size of the original image:



const x1 = (xc-w/2)/640*img_width;
const y1 = (yc-h/2)/640*img_height;
const x2 = (xc+w/2)/640*img_width;
const y2 = (yc+h/2)/640*img_height;

Now similarly you need to get probabilities of the object, that goes in rows from 4 to 83, find which of them is biggest and the index of this probability, and save these values to the prob and the class_id variables. You can write a nested loop, that traverses rows from 4 to 83 and saves the highest value, and it's index:



let class_id = 0, prob = 0;
for (let col=4;col<84;col++) {
    if (output[8400*col+index]>prob) {
        prob = output[8400*col+index];
        class_id = col - 4;
    }
}

It works fine, but I'd better rewrite this in a functional way:



const [class_id,prob] = [...Array(80).keys()]
    .map(col => [col, output[8400*(col+4)+index]])
    .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);

The first line [...Array(80).keys()] generates a range array with numbers from 0 to 79
Then, the map function constructs the array of probabilities for each class_id where each item collected as a [class_id,probability] array
The reduce function reduces the array to a single item, that contains maximum probability and its class id.
This item finally returned and destructured to class_id and prob variables.

Then, having the maximum probability and class_id, you can either skip that object, if the probability is less than 0.5 or find the label of this class.

Here is a final code, that processes and collects bounding boxes to the boxes array:



    let boxes = [];
    for (let index=0;index<8400;index++) {
        const [class_id,prob] = [...Array(80).keys()]
            .map(col => [col, output[8400*(col+4)+index]])
            .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
        if (prob < 0.5) {
            continue;
        }
        const label = yolo_classes[class_id];
        const xc = output[index];
        const yc = output[8400+index];
        const w = output[2*8400+index];
        const h = output[3*8400+index];
        const x1 = (xc-w/2)/640*img_width;
        const y1 = (yc-h/2)/640*img_height;
        const x2 = (xc+w/2)/640*img_width;
        const y2 = (yc+h/2)/640*img_height;
        boxes.push([x1,y1,x2,y2,label,prob]);
    }

The last step is to filter the boxes array using "Non-maximum suppression", to exclude all overlapping boxes from it. This code is close to the Python implementation:



boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
const result = [];
while (boxes.length>0) {
    result.push(boxes[0]);
    boxes = boxes.filter(box => iou(boxes[0],box)<0.7);
}

We sort the boxes by probability in reverse order to put the boxes with the highest probability to the top
In a loop, we put the box with the highest probability to result
Then we filter out all boxes that overlap the selected box too much (all boxes that have IoU>0.7 with this box)

That's all! For convenience, here is a full code of the process_output function:



function process_output(output, img_width, img_height) {
    let boxes = [];
    for (let index=0;index<8400;index++) {
        const [class_id,prob] = [...Array(80).keys()]
            .map(col => [col, output[8400*(col+4)+index]])
            .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
        if (prob < 0.5) {
            continue;
        }
        const label = yolo_classes[class_id];
        const xc = output[index];
        const yc = output[8400+index];
        const w = output[2*8400+index];
        const h = output[3*8400+index];
        const x1 = (xc-w/2)/640*img_width;
        const y1 = (yc-h/2)/640*img_height;
        const x2 = (xc+w/2)/640*img_width;
        const y2 = (yc+h/2)/640*img_height;
        boxes.push([x1,y1,x2,y2,label,prob]);
    }

    boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
    const result = [];
    while (boxes.length>0) {
        result.push(boxes[0]);
        boxes = boxes.filter(box => iou(boxes[0],box)<0.7);
    }
    return result;
}

If you like to work with this output in a more convenient "Pythonic" way, there is a NumJS library that emulates NumPy in JavaScript. You can use it to physically reshape the output to 84x8400, then transpose to 8400x84 and then traverse detected objects by row.

However, the option to work with single dimension array as with matrix described in this section is the most efficient, because we got all values we need without additional array transformations. I think that installing additional external dependency is overkill for this case.

That is it for Node.js implementation. If you wrote everything correctly, then you can start this web service by running the following command:



node object_detector.js

and open http://localhost:8080 in a web browser.

We used only a small subset of ONNX runtime JavaScript API required for basic operations. Full reference available here.

You can find a source code of Node.js object detector web service in this repository.

Create a web service on JavaScript

Could you ever realize that you can write all code for object detector right in the HTML page? Using the ONNX library for JavaScript, you can process the image right in the frontend, without sending it to any server. Furthermore, you can reuse most code that we wrote for Node.js because the underlying ONNX runtime API is the same.

Setup the project

You can reuse the frontend from Node.js project. Create a new folder and copy the index.html and yolov8m.onnx files to it.

Then, open the index.html and add the JavaScript library for ONNX runtime to the head section of the HTML:



<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>

This library exposes the ort global variable, that is a root of the ONNX runtime API. You can use it to instantiate and run models the same way as we used the ort variable in the Node.js project.

Perhaps in a moment when you read it, the URL to the library will change, so you can look in the official documentation for installation instructions.

This is an index.html file that you should have in the beginning:



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>YOLOv8 Object Detection</title>
    <style>
      canvas {
          display:block;
          border: 1px solid black;
          margin-top:10px;
      }
    </style>
    <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
</head>
<body>
    <input id="uploadInput" type="file"/>
    <canvas></canvas>
    <script>

       const input = document.getElementById("uploadInput");
       input.addEventListener("change",async(event) => {
           const data = new FormData();
           data.append("image_file",event.target.files[0],"image_file");
           const response = await fetch("/detect",{
               method:"post",
               body:data
           });
           const boxes = await response.json();
           draw_image_and_boxes(event.target.files[0],boxes);
       })

      function draw_image_and_boxes(file,boxes) {
          const img = new Image()
          img.src = URL.createObjectURL(file);
          img.onload = () => {
              const canvas = document.querySelector("canvas");
              canvas.width = img.width;
              canvas.height = img.height;
              const ctx = canvas.getContext("2d");
              ctx.drawImage(img,0,0);
              ctx.strokeStyle = "#00FF00";
              ctx.lineWidth = 3;
              ctx.font = "18px serif";
              boxes.forEach(([x1,y1,x2,y2,label]) => {
                  ctx.strokeRect(x1,y1,x2-x1,y2-y1);
                  ctx.fillStyle = "#00ff00";
                  const width = ctx.measureText(label).width;
                  ctx.fillRect(x1,y1,width+10,25);
                  ctx.fillStyle = "#000000";
                  ctx.fillText(label, x1, y1+18);
              });
          }
      }
    </script>
</body>
</html>

To run ONNX runtime in a browser, you need to run the content of this folder on a web server. You can use VS Code embedded web server to run the index.html in it.

When it works, let's load the image and prepare an input array from it.

Prepare the input

User loads the image by using the upload file field to select the image file. This process implemented in the change event listener:



input.addEventListener("change",async(event) => {
    const data = new FormData();
           data.append("image_file",event.target.files[0],"image_file");
    const response = await fetch("/detect",{
        method:"post",
        body:data
    });
    const boxes = await response.json();
    draw_image_and_boxes(event.target.files[0],boxes);
})

In this code, you used fetch to post the file from event.target.files[0] variable to the backend. Then backend returns the array of bounding boxes that decoded to a boxes array.

However, in this version, we will not have a backend to load the image to. All code we will write here, in the index.html file, including the detect_objects_on_image and all other functions. So you need to remove this fetch call and just pass the file to the detect_objects_on_image function:



input.addEventListener("change",async(event) => {
    const boxes = await detect_objects_on_image(event.target.files[0]);
    draw_image_and_boxes(event.target.files[0],boxes);
})

Then, define the detect_objects_on_image function, which is the same as in Node.js example:



async function detect_objects_on_image(buf) {
    const [input,img_width,img_height] = await prepare_input(buf);
    const output = await run_model(input);
    return process_output(output,img_width,img_height);
}

The only difference here is that buf is a File object, that user selected in the upload file field. You need to load this file as an image in the browser and convert to array of pixels. The most common way to load an image in HTML and JavaScript is using the HTML5 canvas object. This object loads the image as a flat array of pixel colors, almost the same, as the Sharp library loaded it in the Node.js version. This work we will do in the prepare_input function:



 async function prepare_input(buf) {
      const img = new Image();
      img.src = URL.createObjectURL(buf);
      img.onload = () => {
          const [img_width,img_height] = [img.width, img.height]
          const canvas = document.createElement("canvas");
          canvas.width = 640;
          canvas.height = 640;
          const context = canvas.getContext("2d");
          context.drawImage(img,0,0,640,640);
          const imgData = context.getImageData(0,0,640,640);
          const pixels = imgData.data;
      }
  }

The HTML5 Canvas element can draw the HTML images, that is why, we need to load the file to the Image() object first.
Then, before drawing it on the canvas, we need to ensure that the image is loaded. That is why, all next code we write in the onload() event handler of the image object, that executed only after the image is loaded.
We save the original image size to img_width and img_height.
Then we create a canvas object and set it size to 640x640, because this is a size, that required by the YOLOv8 model.
Then we get the HTML5 canvas drawing context of created canvas to draw the image on the canvas. The drawImage method allows drawing and resize at the same time, that is why we set the size of image on the canvas to 640x640.
Then the getImageData() used to get the imageData object with image pixels.
The only required property of the ImageData object is the data which contains the array of pixels that we need.

Now you have the pixels array, that contains one dimensional array of image pixels. Each pixel consists of 4 numbers that define the color components: R, G, B, A where R=red, G=green, B=blue and A=transparency(Alpha channel). There are no rows and columns in this array, and pixels just go one after another. To convert it to required format, you need to convert it to 3 arrays: array of reds, array of greens and array of blues first and then concatenate these 3 arrays to one in which the reds will go first, greens will go next and blues will go at the end.

The next image shows what you need to do with the pixels array and return from the function:

The first step is to create 3 arrays for reds, greens and blues:



const red = [], green = [], blue = [];

Then, traverse the pixels array and collect numbers to appropriate arrays:



for (let index=0; index<pixels.length; index+=4) {
    red.push(pixels[index]/255.0);
    green.push(pixels[index+1]/255.0);
    blue.push(pixels[index+2]/255.0);
}

This loop jumps from pixel to pixel with step=4. On each iteration, the index is equal to the red component of the current pixel, the index+1 is equal to the green component and the index+2 is equal to blue. The fourth component of color is skipped in this loop. As you see, we divide components by 255.0 to scale and put to appropriate arrays.

The only thing that left to do after this, is to concatenate these arrays in correct order and return along with img_width and img_height. But we can't add the return from the prepare_input function here, because we write all this code inside an internal function, in the onload event handler and by writing return, we are just returning from this handler but not from the prepare_input function.

To handle this issue, we wrap the code of the prepare_input function to the Promise and return it. Then, inside the event handler, we will use the resolve([input, img_width, img_height]) to resolve that promise with results, that will be returned.

Here is a full code of the prepare_input function:



async function prepare_input(buf) {
    return new Promise(resolve => {
        const img = new Image();
        img.src = URL.createObjectURL(buf);
        img.onload = () => {
            const [img_width,img_height] = [img.width, img.height]
            const canvas = document.createElement("canvas");
            canvas.width = 640;
            canvas.height = 640;
            const context = canvas.getContext("2d");
            context.drawImage(img,0,0,640,640);
            const imgData = context.getImageData(0,0,640,640);
            const pixels = imgData.data;

            const red = [], green = [], blue = [];
            for (let index=0; index<pixels.length; index+=4) {
                red.push(pixels[index]/255.0);
                green.push(pixels[index+1]/255.0);
                blue.push(pixels[index+2]/255.0);
            }
            const input = [...red, ...green, ...blue];
            resolve([input, img_width, img_height])
        }
    })
}

Run the model and process the output

This prepare_input function returns the input exactly in the same format as in the Node.js version. That is why, all other code, including run_model, process_output, iou, intersection and union functions can be copy/pasted as is from the Node.js project.

After it's done, the JavaScript web service finished!

Now you can use any web server to run the index.html file and try this wonderful feature - to run neural network models right in a web browser frontend.

We used only a small subset of ONNX runtime JavaScript API required for basic operations. Full reference available here.

You can find a source code of JavaScript object detector web service in this repository.

Create a web service on Go

Go is the first statically typed and compiled programming language in our journey. From my point of view, the greatest thing about Go is how you can deploy the apps written on it. You can compile all your code and it's dependencies to a single binary executable, then just copy this file to a production server and run. This is how the whole deployment process looks on Go. You do not need to install any third party dependencies to run Go programs, that is why, the Go applications usually compact and convenient to update. Also, the go is faster than Python and JavaScript. Definitely, it would be great to have an opportunity to deploy neural networks this way. Fortunately, there are several ONNX runtime bindings exist that will help us to achieve this goal.

Setup the project

Create a new folder, enter it and run:



go mod init object_detector

This command will initialize the object_detector project in the current folder.

Install required external modules:



go get github.com/yalue/onnxruntime_go
go get github.com/nfnt/resize

github.com/yalue/onnxruntime_go - ONNX runtime library bindings for Golang
github.com/nfnt/resize - the library to resize images. (Perhaps you can find more modern library, but I just used this one because it works properly)

The other thing for which I respect Go, is that all other modules, including web framework and image processing functions, already exist in standard library.

The ONNX module for Go provides the API, but does not contain the Microsoft ONNX runtime library itself. Instead, it has a function to specify a path, in which this library located. Here you have two options: install the Microsoft ONNX runtime library to a well known system path, or download the version for your operating system and put it to the project folder. For this project, I will go the second way, to make the project autonomous and independent of operating system setup.

Go to the Releases page: https://github.com/microsoft/onnxruntime/releases and download the archive for your operating system. After it's done, extract the files from the archive and copy all files from the lib subfolder to the project.

We are not going to change the frontend, that is why, just copy the index.html file from one of the previous projects to current folder. Also, copy the yolov8m.onnx model file.

By convention, the main file of Go project should have a main.go name. So, create this file and put the following boilerplate code to it:



package main

import (
    "encoding/json"
    "github.com/nfnt/resize"
    ort "github.com/yalue/onnxruntime_go"
    "image"
    _ "image/gif"
    _ "image/jpeg"
    _ "image/png"
    "io"
    "math"
    "net/http"
    "os"
    "sort"
)

func main() {
    server := http.Server{
    Addr: "0.0.0.0:8080",
    }
    http.HandleFunc("/", index)
    http.HandleFunc("/detect", detect)
    server.ListenAndServe()
}

func index(w http.ResponseWriter, _ *http.Request) {
    file, _ := os.Open("index.html")
    buf, _ := io.ReadAll(file)
    w.Write(buf)
}

func detect(w http.ResponseWriter, r *http.Request) {
    r.ParseMultipartForm(0)
    file, _, _ := r.FormFile("image_file")
    boxes := detect_objects_on_image(file)
    buf, _ := json.Marshal(&boxes)
    w.Write(buf)
}

func detect_objects_on_image(buf io.Reader) [][]interface{} {
    input, img_width, img_height := prepare_input(buf)
    output := run_model(input)
    return process_output(output, img_width, img_height)
}

func prepare_input(buf io.Reader) ([]float32, int64, int64) {

}

func run_model(input []float32) []float32 {

}

func process_output(output []float32, img_width, img_height int64) [][]interface{} {

}

First, we import required packages. Most of them go from Go standard library:

encoding/json - to encode bounding boxes to JSON before sending response
github.com/nfnt/resize - to resize image to 640x640
ort "github.com/yalue/onnxruntime_go" - ONNX runtime library. We import it as ort variable
image, image/gif, image/jpeg, image/png - image library and libraries to support images of different formats
io - to read data from local files
math - for Max an Min functions
net/http - to create and run a web server
os - to open local files
sort - to sort bounding boxes

Then, the main function defines two HTTP endpoints: index and detect that are handled by appropriate functions and starts the web server on port 8080 that handles these endpoints.

The index endpoint just returns the content of the index.html file.

The detect endpoint receives the uploaded image file, sends it to the detect_objects_on_image function, which passes it through the YOLOv8 model. Then it receives the array of bounding boxes, encodes them to JSON and returns this JSON to the frontend.

The detect_objects_on_image is the same as in previous projects. The only difference is the type of value that it returns, which is the [][]interface{}. The detect_objects_on_image should return an array of bounding boxes. Each bounding box is an array of 6 items (x1,y1,x2,y2,label, probability). These items have different types. However, the Go as strong typed programming language does not allow having array with items of different types. But it has a special type interface{} which can hold value of any type. This is a common trick in the Go to define a variable using the interface{} type, if it can have values of different types. That is why, to have an array of items of different types, you need to create an array of interfaces: []interface{}. Consequently, the bounding box is an array of interfaces and the array of bounding boxes is an array of interface arrays: [][]interface{}.

Then there are stubs of prepare_input, run_model and process_output functions defined. In the next sections, we will implement them one by one.

Prepare the input

To prepare the input for the YOLOv8 model, you need to load the image, resize it and convert to a tensor of (3,640,640) shape where the first item is an array of red components of image pixels, second item is an array of greens and the last component is an array of blues. Furthermore, the ONNX library for Go, requires you to provide this tensor as a flat array, e.g. to concat these three arrays one after one, like displayed on the next image.

So, let's load and resize the image first:



func prepare_input(buf io.Reader) ([]float32, int64, int64) {
    img, _, _ := image.Decode(buf)
    size := img.Bounds().Size()
    img_width, img_height := int64(size.X), int64(size.Y)
    img = resize.Resize(640, 640, img, resize.Lanczos3)

This code:

loaded the image,
saved the size of original image to img_width, img_height variables
resized it to 640x640 pixels

Then you need to collect the colors of pixels to different arrays, that you should define first:



    red := []float32{}
    green := []float32{}
    blue := []float32{}

Then you need to extract pixels and their colors from the image. To do that, the img object has .At(x,y) method, that can be used to get the pixel object at a specified point of the image. The color object, returned by this method has an .RGBA() method, that returns the color components as an array of 4 elements: [R,G,B,A]. You need to extract only R,G,B and scale them.

Now, you have everything to traverse the image and collect pixel colors to created arrays:



for y := 0; y < 640; y++ {
    for x := 0; x < 640; x++ {
        r, g, b, _ := img.At(x, y).RGBA()
        red = append(red, float32(r/257)/255.0)
        green = append(green, float32(g/257)/255.0)
        blue = append(blue, float32(b/257)/255.0)
    }
}

This code traverses all rows and columns of image.
It extracts array of color components of each pixel and destructures them to r, g and b variables.
Then it scales these components and appends them to appropriate arrays.

Finally, you need to concatenate these arrays to a single one in correct order:



input := append(red, green...)
input = append(input, blue...)

So, the input variable contains the input, required for ONNX runtime. Here is a full code of this function, which returns the input and the size of original image that will be used later when process the output from the model.



func prepare_input(buf io.Reader) ([]float32, int64, int64) {
    img, _, _ := image.Decode(buf)
    size := img.Bounds().Size()
    img_width, img_height := int64(size.X), int64(size.Y)
    img = resize.Resize(640, 640, img, resize.Lanczos3)
    red := []float32{}
    green := []float32{}
    blue := []float32{}
    for y := 0; y < 640; y++ {
        for x := 0; x < 640; x++ {
            r, g, b, _ := img.At(x, y).RGBA()
            red = append(red, float32(r/257)/255.0)
            green = append(green, float32(g/257)/255.0)
            blue = append(blue, float32(b/257)/255.0)
        }
    }
    input := append(red, green...)
    input = append(input, blue...)
    return input, img_width, img_height
}

Now, let's run it through the model.

Run the model

The run_model does the same as in Python example, but it is quite wordy, because of Go language specifics:



func run_model(input []float32) []float32 {
    ort.SetSharedLibraryPath("./libonnxruntime.so")
    _ = ort.InitializeEnvironment()

    inputShape := ort.NewShape(1, 3, 640, 640)
    inputTensor, _ := ort.NewTensor(inputShape, input)

    outputShape := ort.NewShape(1, 84, 8400)
    outputTensor, _ := ort.NewEmptyTensor[float32](outputShape)

    model, _ := ort.NewSession[float32]("./yolov8m.onnx",
        []string{"images"}, []string{"output0"},
        []*ort.Tensor[float32]{inputTensor},[]*ort.Tensor[float32]{outputTensor})

    _ = model.Run()
    return outputTensor.GetData()
}

As written in the setup section, the Go ONNX library needs to know where is ONNX runtime library located. You need to use the ort.SetSharedLibraryPath() to specify a location of main file of the ONNX runtime library and initialize the environment with this library. If you downloaded it manually, as suggested earlier, then just specify a name of the file. For Linux, the file name will be libonnxruntime.so, for macOS - libonnxruntime.dylib, for Windows - onnxruntime.dll. I work on Linux, so in this example I use the Linux library.
Then, the library requires converting the input to internal tensor format with (1,3,640,640) shape.
Then, the library also requires creating an empty structure for output tensor, and specify its shape. The Go ONNX library does not return the output, but it writes it to the variable, that defined in advance. Here, we defined the outputTensor variable as a tensor with (1,84,8400) shape that will be used to receive the data from the model.
Then we create a model using the NewSession function, which receives both arrays of input and output names and arrays of input and output tensors.
Then we run this model, that processes input and writes the output to the outputTensor variable.
The outputTensor.GetData() method returns the output data as a flat array of float numbers.

As a result, the function returns the array with (1,84,8400) shape, or you can think about this as about 84x8400 matrix. However, it returns an output as a single dimension array. The numbers in this array ordered as 84x8400, but as a flat array of 705600 items. So, you can't transpose it, and you can't traverse it by rows in a loop, because it's required to specify the absolute position of each item. But do not worry, in the next section we will learn how to deal with it.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Go. Include them to your code below the process_output function:



func iou(box1, box2 []interface{}) float64 {
    return intersection(box1, box2) / union(box1, box2)
}

func union(box1, box2 []interface{}) float64 {
    box1_x1, box1_y1, box1_x2, box1_y2 := box1[0].(float64), box1[1].(float64), box1[2].(float64), box1[3].(float64)
    box2_x1, box2_y1, box2_x2, box2_y2 := box2[0].(float64), box2[1].(float64), box2[2].(float64), box2[3].(float64)
    box1_area := (box1_x2 - box1_x1) * (box1_y2 - box1_y1)
    box2_area := (box2_x2 - box2_x1) * (box2_y2 - box2_y1)
    return box1_area + box2_area - intersection(box1, box2)
}

func intersection(box1, box2 []interface{}) float64 {
    box1_x1, box1_y1, box1_x2, box1_y2 := box1[0].(float64), box1[1].(float64), box1[2].(float64), box1[3].(float64)
    box2_x1, box2_y1, box2_x2, box2_y2 := box2[0].(float64), box2[1].(float64), box2[2].(float64), box2[3].(float64)
    x1 := math.Max(box1_x1, box2_x1)
    y1 := math.Max(box1_y1, box2_y1)
    x2 := math.Min(box1_x2, box2_x2)
    y2 := math.Min(box1_y2, box2_y2)
    return (x2 - x1) * (y2 - y1)
}

also, you will need to find YOLO class label by ID, so add the yolo_classes array to your code:



var yolo_classes = []string{
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush",
}



boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {

}

Let's display how the output array looks to clarify this:



boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
    xc := output[index]
    yc := output[8400+index]
    w := output[2*8400+index]
    h := output[3*8400+index]
}

Then you can calculate the corners of the bounding box and scale them to the size of the original image:



    x1 := (xc - w/2) / 640 * float32(img_width)
    y1 := (yc - h/2) / 640 * float32(img_height)
    x2 := (xc + w/2) / 640 * float32(img_width)
    y2 := (yc + h/2) / 640 * float32(img_height)



class_id, prob := 0, float32(0.0)
for col := 0; col < 80; col++ {
    if output[8400*(col+4)+index] > prob {
        prob = output[8400*(col+4)+index]
        class_id = col
    }
}

Then, having the maximum probability and class_id, you can either skip that object, if the probability is less than 0.5 or find the label of this class.

Here is a final code, that processes and collects bounding boxes to the boxes array:



boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
    class_id, prob := 0, float32(0.0)
    for col := 0; col < 80; col++ {
        if output[8400*(col+4)+index] > prob {
            prob = output[8400*(col+4)+index]
            class_id = col
        }
    }
    if prob < 0.5 {
        continue
    }
    label := yolo_classes[class_id]
    xc := output[index]
    yc := output[8400+index]
    w := output[2*8400+index]
    h := output[3*8400+index]
    x1 := (xc - w/2) / 640 * float32(img_width)
    y1 := (yc - h/2) / 640 * float32(img_height)
    x2 := (xc + w/2) / 640 * float32(img_width)
    y2 := (yc + h/2) / 640 * float32(img_height)
    boxes = append(boxes, []interface{}{float64(x1), float64(y1), float64(x2), float64(y2), label, prob})
}

The last step is to filter the boxes array using "Non-maximum suppression", to exclude all overlapping boxes from it. This code does the same as the Python implementation, but looks slightly different because of the Go language specifics:



sort.Slice(boxes, func(i, j int) bool {
    return boxes[i][5].(float32) < boxes[j][5].(float32)
})
result := [][]interface{}{}
for len(boxes) > 0 {
    result = append(result, boxes[0])
    tmp := [][]interface{}{}
    for _, box := range boxes {
        if iou(boxes[0], box) < 0.7 {
            tmp = append(tmp, box)
        }
    }
    boxes = tmp
}

First we sort the boxes by probability in reverse order to put the boxes with the highest probability to the top
In a loop, we put the box with the highest probability to the result array
Then we create a temporary tmp array and in the inner loop over all boxes, we put to this array only boxes, that do not overlap selected too much (that have IoU<0.7).
Then we overwrite the boxes array with the tmp array. This way, we filter out all overlapping boxes from the boxes array.
If some boxes exist after filtering, the loop continues going until the boxes array becomes empty.

Finally, the result variable contains all bounding boxes that should be returned.

That's all! For convenience, here is a full code of the process_output function:



func process_output(output []float32, img_width, img_height int64) [][]interface{} {
    boxes := [][]interface{}{}
    for index := 0; index < 8400; index++ {
        class_id, prob := 0, float32(0.0)
        for col := 0; col < 80; col++ {
            if output[8400*(col+4)+index] > prob {
                prob = output[8400*(col+4)+index]
                class_id = col
            }
        }
        if prob < 0.5 {
            continue
        }
        label := yolo_classes[class_id]
        xc := output[index]
        yc := output[8400+index]
        w := output[2*8400+index]
        h := output[3*8400+index]
        x1 := (xc - w/2) / 640 * float32(img_width)
        y1 := (yc - h/2) / 640 * float32(img_height)
        x2 := (xc + w/2) / 640 * float32(img_width)
        y2 := (yc + h/2) / 640 * float32(img_height)
        boxes = append(boxes, []interface{}{float64(x1), float64(y1), float64(x2), float64(y2), label, prob})
    }

    sort.Slice(boxes, func(i, j int) bool {
        return boxes[i][5].(float32) < boxes[j][5].(float32)
    })
    result := [][]interface{}{}
    for len(boxes) > 0 {
        result = append(result, boxes[0])
        tmp := [][]interface{}{}
        for _, box := range boxes {
            if iou(boxes[0], box) < 0.7 {
                tmp = append(tmp, box)
            }
        }
        boxes = tmp
    }
    return result
}

If you like to work with this output in a more convenient "Pythonic" way, there is a Gorgonia Tensor library that emulates features of NumPy in Go. You can use it to physically reshape the output to 84x8400, then transpose to 8400x84 and then traverse detected objects by row.

That is it for Go implementation. If you wrote everything correctly, then you can start this web service by running the following command:



go run main.go

and open http://localhost:8080 in a web browser.

The code that we developed here intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. I made it as simple as possible, and it does not include any details, except working with ONNX. It does not include any resource management, error processing and exception handling. These tasks depend on real use cases and it's up to you how to implement it for your projects.

Full reference of GO library for ONNX runtime available here.

You can find a source code of Go object detector web service in this repository.

Create a web service on Rust

This article can not be complete without an example of a low level language, the high performance and efficient language, on which developers manage memory by themselves and not rely on a garbage collector. I was thinking which one to choose, either C++ or Rust. Finally, I decided to ask people and created the following poll in the LinkedIn group:

Regardless of received results, I also analyzed comments and understood that highly likely people answered not the question that I have asked. I did not ask "Which of these programming languages do you know?", or "Which of them do you like?" or "Which of them is the most popular?". Instead, I asked: "Which is better to learn TODAY to create NEW high performance server applications?".

Finally, I got only one valuable comment:

It was the only comment that received some likes and I completely agree with that text.

Finally, the choice was made! We are going to create an object detection web service on Rust - the safest low-level programming language today.

Setup the project

Enter the command to create a new Rust project:



cargo new object_detector

This will create an object_detector folder with a project template in it.

Go to this folder and open the Cargo.toml file in it.

Write the following packages to the dependencies section:



[dependencies]
image = "0.24.6"
ndarray = "0.15.6"
ort = "1.14.6"
serde = "1.0.84"
serde_derive = "1.0.84"
serde_json = "1.0.36"
rocket = "=0.5.0-rc.3"

image - library for image processing.
ndarray - multidimensional array support library.
ort - ONNX runtime library.
serde,serde_derive,serde_json - Serialization library to serialize data to JSON.
rocket - Web framework.

Create a Rocket.toml file which will contain configuration for the Rocket web server and add the following lines to it:



[global]
address = "0.0.0.0"
port = 8080

We are not going to change frontend, so copy the index.html to the project. Also, copy the yolov8m.onnx model.

Before continue, ensure that the ONNX runtime installed on your operating system, because the library that integrated to the Rust package may not work correctly. To install it, you can download the archive for your operating system from here, extract and copy contents of "lib" subfolder to the system libraries path of your operating system.

The main.rs, the main project file already generated, and it's located in the src subfolder. Open this file and add the following boilerplate code to it:



use std::{sync::Arc, path::Path, vec};
use image::{GenericImageView, imageops::FilterType};
use ndarray::{Array, IxDyn, s, Axis};
use ort::{Environment,SessionBuilder,tensor::InputTensor};
use rocket::{response::content,fs::TempFile,form::Form};
#[macro_use] extern crate rocket;

#[rocket::main]
async fn main() {
    rocket::build()
        .mount("/", routes![index])
        .mount("/detect", routes![detect])
        .launch().await.unwrap();
}

#[get("/")]
fn index() -> content::RawHtml<String> {
    return content::RawHtml(std::fs::read_to_string("index.html").unwrap());
}

#[post("/", data = "<file>")]
fn detect(file: Form<TempFile<'_>>) -> String {
    let buf = std::fs::read(file.path().unwrap_or(Path::new(""))).unwrap_or(vec![]);
    let boxes = detect_objects_on_image(buf);
    return serde_json::to_string(&boxes).unwrap_or_default()
}

fn detect_objects_on_image(buf: Vec<u8>) -> Vec<(f32,f32,f32,f32,&'static str,f32)> {
    let (input,img_width,img_height) = prepare_input(buf);
    let output = run_model(input);
    return process_output(output, img_width, img_height);    
}

fn prepare_input(buf: Vec<u8>) -> (Array<f32,IxDyn>, u32, u32) {

}

fn run_model(input:Array<f32,IxDyn>) -> Array<f32,IxDyn> {

}

fn process_output(output:Array<f32,IxDyn>,img_width: u32, img_height: u32) -> Vec<(f32,f32,f32,f32,&'static str, f32)> {

}

First block imports required modules:

image - to process images
ndarray - to work with tensors
ort - ONNX runtime library
rocket - Rocket Web framework
std - some objects from Rust standard library

Then, in the main function we start the Rocket web server and attach index and detect routes to it.

The index function serves the root of the service, it just returns the content of the index.html file as HTML.

The detect function serves the /detect endpoint. It receives the uploaded file, passes it to the detect_objects_on_image, receives the array of bounding boxes, serializes them to JSON and returns this JSON string to the frontend.

The detect_objects_on_image implements the same actions as the Python version. It converts the image to the multidimensional array of numbers, passes it to the ONNX runtime and processes the output. Finally, it returns the array of bounding boxes, where each bounding box is a tuple of (x1,y1,x2,y2,label, prob). The Rust is strong typed language, so we have to specify types of all variables in this tuple. That is why it returns Vec<(f32,f32,f32,f32,&'static str,f32)> which is a vector of bounding box tuples.

Then we define stubs for prepare_input, run_model and process_output functions, that will be implemented one by one in the following sections.

Prepare the input

To prepare the input for the YOLOv8 model, you need to load the image, resize it and convert to a tensor of (1,3,640,640) shape which is an array of single image represented as 3 640x640 matrices. The first item is an array of red components of image pixels, the second item is an array of greens, and the last item is an array of blues. We will use the ndarray library to construct this tensor and fill it with pixel color values. But first we need to load the image, and resize it to 640x640:



let img = image::load_from_memory(&buf).unwrap();
let (img_width, img_height) = (img.width(), img.height());
let img = img.resize_exact(640, 640, FilterType::CatmullRom);

In the first line, the image is loaded from uploaded file buffer
Next, we save the original image width and height for future
Finally, we resized the image to 640x640

Then, let's construct the input array of required shape:



let mut input = Array::zeros((1, 3, 640, 640)).into_dyn();

This line created a new 4-dimensional tensor filled with zeros.

Now, you need to get access to the image pixels and their color components. The img object has a pixels() method, which is an iterator for image pixels. You can use it to get access to each pixel in a loop:



for pixel in img.pixels() {
}

The pixel is a Pixel object with properties that we need:

x - the x coordinate of pixel
y - the y coordinate of pixel
color - the object with an array with 4 items [r,g,b,a]: color components of pixel.

Having this, you can fill the tensor input in a loop:



for pixel in img.pixels() {
    let x = pixel.0 as usize;
    let y = pixel.1 as usize;
    let [r,g,b,_] = pixel.2.0;
    input[[0, 0, y, x]] = (r as f32) / 255.0;
    input[[0, 1, y, x]] = (g as f32) / 255.0;
    input[[0, 2, y, x]] = (b as f32) / 255.0;
};

First, we extract x and y variables and convert them to the type that can be used as a tensor index
Then we destructure color to r, g and b variables.
Finally, we put these pixel color components to appropriate cells of the tensor. Notice that the y goes first and the x goes next. This is because in matrices, the first dimension is a row and the second is a column.

So, now you have an input prepared for the neural network. You need to return it from the function along with img_width and img_height. Here is a full source of the prepare_input:



fn prepare_input(buf: Vec<u8>) -> (Array<f32,IxDyn>, u32, u32) {
    let img = image::load_from_memory(&buf).unwrap();
    let (img_width, img_height) = (img.width(), img.height());
    let img = img.resize_exact(640, 640, FilterType::CatmullRom);
    let mut input = Array::zeros((1, 3, 640, 640)).into_dyn();
    for pixel in img.pixels() {
        let x = pixel.0 as usize;
        let y = pixel.1 as usize;
        let [r,g,b,_] = pixel.2.0;
        input[[0, 0, y, x]] = (r as f32) / 255.0;
        input[[0, 1, y, x]] = (g as f32) / 255.0;
        input[[0, 2, y, x]] = (b as f32) / 255.0;
    };
    return (input, img_width, img_height);
}

Now, it's time to pass this input through the YOLOv8 model.

Run the model

The run_model function used to pass the input tensor through the model and return the output tensor. This is its source code:



fn run_model(input:Array<f32,IxDyn>) -> Array<f32,IxDyn> {
    let input = InputTensor::FloatTensor(input);
    let env = Arc::new(Environment::builder().with_name("YOLOv8").build().unwrap());
    let model = SessionBuilder::new(&env).unwrap().with_model_from_file("yolov8m.onnx").unwrap();
    let outputs = model.run([input]).unwrap();
    let output = outputs.get(0).unwrap().try_extract::<f32>().unwrap().view().t().into_owned();
    return output;
}

First it converts the input to the internal ONNX runtime tensor format
Then it creates the environment and instantiates the ONNX model in it from the yolov8m.onnx file.
Then it runs the model with the input tensor and receives the array of outputs.
Finally, it extracts the first output and returns it.

The returned output is an Ndarray tensor, so we can traverse it in a loop. Let's process it.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Rust. Include them to your code below the process_output function:



fn iou(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
    return intersection(box1, box2) / union(box1, box2);
}

fn union(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
    let (box1_x1,box1_y1,box1_x2,box1_y2,_,_) = *box1;
    let (box2_x1,box2_y1,box2_x2,box2_y2,_,_) = *box2;
    let box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1);
    let box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1);
    return box1_area + box2_area - intersection(box1, box2);
}

fn intersection(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
    let (box1_x1,box1_y1,box1_x2,box1_y2,_,_) = *box1;
    let (box2_x1,box2_y1,box2_x2,box2_y2,_,_) = *box2;
    let x1 = box1_x1.max(box2_x1);
    let y1 = box1_y1.max(box2_y1);
    let x2 = box1_x2.min(box2_x2);
    let y2 = box1_y2.min(box2_y2);
    return (x2-x1)*(y2-y1);
}

Also, we will need to get labels for detected objects, so include this array of COCO class labels:



const YOLO_CLASSES:[&str;80] = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
];

Now let's start writing the process_output function.

Let's define an array to which you will put collected bounding boxes:



let mut boxes = Vec::new();

The output from YOLOv8 model is a tensor and for some reason, it has a shape [8400,84,1], instead of how it looks in other programming languages. It's already ordered by rows, but has an extra dimension at the end. Let's remove it:



let output = output.slice(s![..,..,0])

This line extracted the (8400,84) matrix from this tensor, and we can traverse it by first axis, e.g. by rows:



for row in output.axis_iter(Axis(0)) {
}

Here, the row is a single dimension NdArray object that represents a row with 84 float numbers. It will be more convenient to convert it to the basic array, let's do it:



for row in output.axis_iter(Axis(0)) {
    let row:Vec<_> = row.iter().map(|x| *x).collect();
}

The first 4 items of this array contain bounding box coordinates, and we can convert and scale them to x1,y1,x2,y2 now:



let xc = row[0]/640.0*(img_width as f32);
let yc = row[1]/640.0*(img_height as f32);
let w = row[2]/640.0*(img_width as f32);
let h = row[3]/640.0*(img_height as f32);
let x1 = xc - w/2.0;
let x2 = xc + w/2.0;
let y1 = yc - h/2.0;
let y2 = yc + h/2.0;

Then, all items from 4 to 83 are probabilities that this bounding box belongs to each of 80 object classes. You need to find maximum of these items and the index of this item, which can be used as an ID of object class. You can do this in a loop:



let mut class_id = 0;
let mut prob:f32 = 0.0;
for index in 4..row.len() {
    if row[index]>prob {
        prob = row[index];
        class_id = index-4;
    }
}
let label = YOLO_CLASSES[class_id];

Here we determined the maximum probability, the class_id of object with maximum probability and the label of object of this class.

It works ok, but I'd better implement it in a functional way instead of loop:



let (class_id, prob) = row.iter().skip(4).enumerate()
    .map(|(index,value)| (index,**value))
    .reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
let label = YOLO_CLASSES[class_id];

This code gets an iterator for row element that starts from 4th item.
Then it maps the row items to a tuples (class_id, prob).
Then it reduces this array of tuples to a single element with maximum prob.
The resulting tuple, the destructured to the class_id and prob variables.

Finally, you can skip the row if the prob < 0.5 or collect all values to a bounding box and push this bounding box to the boxes array.

Here is all code that we have now, in which operations ordered correctly:



let mut boxes = Vec::new();
let output = output.slice(s![..,..,0]);
for row in output.axis_iter(Axis(0)) {
    let row:Vec<_> = row.iter().map(|x| *x).collect();
    let (class_id, prob) = row.iter().skip(4).enumerate()
        .map(|(index,value)| (index,**value))
        .reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
    if prob < 0.5 {
        continue
    }
    let label = YOLO_CLASSES[class_id];
    let xc = row[0]/640.0*(img_width as f32);
    let yc = row[1]/640.0*(img_height as f32);
    let w = row[2]/640.0*(img_width as f32);
    let h = row[3]/640.0*(img_height as f32);
    let x1 = xc - w/2.0;
    let x2 = xc + w/2.0;
    let y1 = yc - h/2.0;
    let y2 = yc + h/2.0;
    boxes.push((x1,y1,x2,y2,label,prob));
}

P.S. Actually, it's possible to implement all this in a functional way instead of loop. You can do it as a homework.

Finally, you need to filter the boxes array to exclude the boxes, that overlap each other, using the Intersection over union. The filtered boxes should be collected to the result array:



let mut result = Vec::new();
boxes.sort_by(|box1,box2| box2.5.total_cmp(&box1.5));
while boxes.len()>0 {
    result.push(boxes[0]);
    boxes = boxes.iter().filter(|box1| iou(&boxes[0],box1) < 0.7).map(|x| *x).collect()
}

First, we sort boxes by probability in descending order to put the boxes with the highest probability to the top.
Then, in a loop, we put the first box with highest probability to the resulting array
Then, we overwrite the boxes array using a filter, that adds to it only those boxes, which iou value is less than 0.7 if compare with the selected box.
If after filter, the boxes contains more elements, the loop continues.

Finally, after the loop, the boxes array will be empty and the result will contain bounding boxes of all different detected objects.

The result array should be returned by this function. Here is the whole code:



fn process_output(output:Array<f32,IxDyn>,img_width: u32, img_height: u32) -> Vec<(f32,f32,f32,f32,&'static str, f32)> {
    let mut boxes = Vec::new();
    let output = output.slice(s![..,..,0]);
    for row in output.axis_iter(Axis(0)) {
        let row:Vec<_> = row.iter().map(|x| *x).collect();
        let (class_id, prob) = row.iter().skip(4).enumerate()
            .map(|(index,value)| (index,**value))
            .reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
        if prob < 0.5 {
            continue
        }
        let label = YOLO_CLASSES[class_id];
        let xc = row[0]/640.0*(img_width as f32);
        let yc = row[1]/640.0*(img_height as f32);
        let w = row[2]/640.0*(img_width as f32);
        let h = row[3]/640.0*(img_height as f32);
        let x1 = xc - w/2.0;
        let x2 = xc + w/2.0;
        let y1 = yc - h/2.0;
        let y2 = yc + h/2.0;
        boxes.push((x1,y1,x2,y2,label,prob));
    }

    boxes.sort_by(|box1,box2| box2.5.total_cmp(&box1.5));
    let mut result = Vec::new();
    while boxes.len()>0 {
        result.push(boxes[0]);
        boxes = boxes.iter().filter(|box1| iou(&boxes[0],box1) < 0.7).map(|x| *x).collect()
    }
    return result;
}

That is it for Rust web service. If everything written correctly, you can start web service by running the following command in the project folder:



cargo run

and open http://localhost:8080 in a web browser.

The code that we developed here is oversimplified. It's intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. I made it as simple as possible, and it does not include any other details, except working with ONNX. It does not include any resource management, error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.

Full reference of Rust library for ONNX runtime available here.

You can find a source code of Rust object detector web service in this repository.

Conclusion

In this article I showed that even if the YOLOv8 neural network created on Python, you can use it from other programming languages, because it can be exported to universal ONNX format.

We explored the foundational algorithms, used to prepare the input and process the output from ONNX model, which is the same for all programming languages that have interfaces for ONNX runtime.

After discovered the main concepts, I showed how to create an object detection web service based on ONNX runtime using Python, Julia, Node.js, JavaScript, Go and Rust. Each language has some differences, but in general, all workflow follows the same algorithm.

You can apply this experience for any other neural networks, created using PyTorch or TensorFlow (which are the most neural networks, existing in the world), because each framework can export its models to ONNX.

There are ONNX runtime interfaces for other programming languages like Java, C# or C++ and for other platforms, including mobile phones. You can find the list of official bindings here.

Also, there are unofficial bindings for other languages, like PHP. It's a great way to integrate neural networks to WordPress websites.

I believe that it won't be difficult to rewrite the projects that we created here on those other languages if you know those languages, of course.

In the next article, I will show how to detect objects on a video in web browser in real time. Follow me to know first when I publish this.

You can find me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Have a fun coding and never stop learning!

How to detect objects on images using the YOLOv8 neural network

Andrey Germanov — Mon, 24 Apr 2023 05:59:32 +0000

Introduction
Problems YOLOv8 Can Solve
Getting started with YOLOv8
How to prepare data to train the YOLOv8 model
How to train the YOLOv8 model
How to create an object detection web service
How to create a frontend
How to create a backend
Conclusion

Introduction

Object detection is a computer vision task that involves identifying and locating objects in images or videos. It is an important part of many applications, such as self-driving cars, robotics, and video surveillance.

Over the years, many methods and algorithms have been developed to find objects in images and their positions. The best quality in performing these tasks comes from using convolutional neural networks.

One of the most popular neural networks for this task is YOLO, created in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in their famous research paper "You Only Look Once: Unified, Real-Time Object Detection".

Since that time, there have been quite a few versions of YOLO. Recent releases can do even more than object detection. The newest release is YOLOv8, which we are going to use in this tutorial.

Here, I will show you the main features of this network for object detection. First, we will use a pre-trained model to detect common object classes like cats and dogs. Then, I will show how to train your own model to detect specific object types that you select, and how to prepare the data for this process. Finally, we will create a web application to detect objects on images right in a web browser using the custom trained model.

To follow this tutorial, you should be familiar with Python and have a basic understanding of machine learning, neural networks, and their application in object detection. You can watch this short video course to familiarize yourself with all required machine learning theory.

Once you've refreshed the theory, let's get started with the practice!

Problems YOLOv8 Can Solve

You can use the YOLOv8 network to solve classification, object detection, and image segmentation problems. All these methods detect objects in images or in videos in different ways, as you can see in the image below:

Classification	Detection	Segmentation

The neural network that created and trained for image classification determines a class of object on the image and returns its name and the probability of this prediction. For example, on the left image, it returned that this is a "cat" and that the confidence level of this prediction is 92% (0.92).

The neural network for object detection, in addition to the object type and probability, returns the coordinates of the object on the image: x, y, width and height, as shown on the second image. Furthermore, object detection neural networks can detect several objects on the image and their bounding boxes.

Finally, in addition to object types and bounding boxes, the neural network trained for image segmentation detects the shapes of the objects, as shown on the right image.

There are many different neural network architectures developed for these tasks, and for each of them you had to use a separate network in the past. Fortunately, things changed after the YOLO created. Now you can use a single platform for all these problems.

In this article, we will discover the object detection using YOLOv8. I will guide you how to create a web application, that will use it to detect traffic lights and road signs on the images. In the next articles I will cover other features, including image segmentation.

In the next sections we will go through all steps that required to create an object detector. By the end of reading, you will have a complete AI powered web application.

Getting started with YOLOv8

Technically speaking, The YOLOv8 is a group of convolutional neural network models, created and trained using the PyTorch framework.

In addition, the YOLOv8 package provides a single Python API to work with all of them using the same methods. That is why, to use it, you need an environment to run Python code. I highly recommend using the Jupyter Notebook.

After ensuring that you have Python and Jupyter installed on your computer, run the notebook and install the YOLOv8 package in it by running the following command:



!pip install ultralytics

The ultralytics package has the YOLO class, that used to create neural network models.

To get access to it, import it to your Python code:



from ultralytics import YOLO

Now everything is ready to create the neural network model:



model = YOLO("yolov8m.pt")

As I wrote before, the YOLOv8 is a group of neural network models. These models were created and trained using the PyTorch and exported to files with the .pt extension. There are three types of models exist and 5 models of different size for each type:

Classification	Detection	Segmentation	Kind
yolov8n-cls.pt	yolov8n.pt	yolov8n-seg.pt	Nano
yolov8s-cls.pt	yolov8s.pt	yolov8s-seg.pt	Small
yolov8m-cls.pt	yolov8m.pt	yolov8m-seg.pt	Medium
yolov8l-cls.pt	yolov8l.pt	yolov8l-seg.pt	Large
yolov8x-cls.pt	yolov8x.pt	yolov8x-seg.pt	Huge

The bigger model you choose, the better prediction quality you could achieve, but the slower it will work. In this tutorial I will cover object detection, that is why on the previous code snippet, I selected the "yolov8m.pt", which is a middle-sized model for object detection.

When you run this code for the first time, it will download the yolov8m.pt file from the Ultralytics server to the current folder and then, will construct the model object. Now you can train this model, detect objects and export to use in production. For all these tasks, it has convenient methods:

train({path to dataset descriptor file}) - used to train the model on images dataset.
predict({image}) - used to make a prediction for specified image, e.g. to detect bounding boxes of all objects, that the model could find on this image.
export({format}) - used to export this model from default PyTorch format to specified one.

All YOLOv8 models for object detection shipped already pretrained on the COCO dataset, which is a huge collection of images of 80 types. So, if you do not have specific needs, then you can just run it as is, without additional training. For example, you can download this image as "cat_dog.jpg":

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wr8bm7gga15xp9gfz7yz.jpg)

and run predict to detect all objects on it:



results = model.predict("cat_dog.jpg")

The predict method accepts many different input types, including a path to a single image, an array of paths to images, the Image object of the well-known PIL Python library and others.

After run the input through the model, it returns an array of results for each input image. As we provided only a single image, it returns an array with a single item, that you can extract this way:



result = results[0]

The result contains detected objects and convenient properties to work with them. The most important one is the boxes array with information about detected bounding boxes on the image. You can determine how many objects detected, by running the len function:



len(result.boxes)

When I ran this, I got "2", which means that there are two boxes detected, perhaps one for the dog and one for the cat.

Then, you can analyze each box either in a loop, or manually. Let's get the first one:



box = result.boxes[0]

The box object contains the properties of the bounding box, including:

xyxy - the coordinates of the box as an array [x1,y1,x2,y2]
cls - the ID of object type
conf - the confidence level of the model about this object. If it's very low, like < 0.5, then you can just ignore the box.

Let's print information about the detected box:



print("Object type:", box.cls)
print("Coordinates:", box.xyxy)
print("Probability:", box.conf)

For the first box, you will receive the following information:



Object type: tensor([16.])
Coordinates: tensor([[261.1901,  94.3429, 460.5649, 312.9910]])
Probability: tensor([0.9528])

As written above, the YOLOv8 contains PyTorch models. The outputs from PyTorch models encoded as an array of PyTorch Tensor objects, so you need to extract the first item from each of these arrays:



print("Object type:",box.cls[0])
print("Coordinates:",box.xyxy[0])
print("Probability:",box.conf[0])



Object type: tensor(16.)
Coordinates: tensor([261.1901,  94.3429, 460.5649, 312.9910])
Probability: tensor(0.9528)

Now you see the data as Tensor objects. To unpack actual values from Tensor, you need to use .tolist() method for tensor with array inside and .item() method for tensors with scalar values. Let's extract the data to appropriate variables:



cords = box.xyxy[0].tolist()
class_id = box.cls[0].item()
conf = box.conf[0].item()
print("Object type:", class_id)
print("Coordinates:", cords)
print("Probability:", conf)



Object type: 16.0
Coordinates: [261.1900634765625, 94.3428955078125, 460.5649108886719, 312.9909973144531]
Probability: 0.9528293609619141

Now you see the actual data. The coordinates can be rounded, the probability also can be rounded to two digits after the dot.

The object type is 16 here. What does it mean? Let's talk more about that. All objects, that the neural network can detect, have numeric IDs. In case of YOLOv8 pretrained model, there are 80 object types with IDs from 0 to 79. The COCO object classes are well known and can be easily googled on the Internet. In addition, the YOLOv8 result object contains the convenient names property to get these classes:



print(result.names)



{0: 'person',
 1: 'bicycle',
 2: 'car',
 3: 'motorcycle',
 4: 'airplane',
 5: 'bus',
 6: 'train',
 7: 'truck',
 8: 'boat',
 9: 'traffic light',
 10: 'fire hydrant',
 11: 'stop sign',
 12: 'parking meter',
 13: 'bench',
 14: 'bird',
 15: 'cat',
 16: 'dog',
 17: 'horse',
 18: 'sheep',
 19: 'cow',
 20: 'elephant',
 21: 'bear',
 22: 'zebra',
 23: 'giraffe',
 24: 'backpack',
 25: 'umbrella',
 26: 'handbag',
 27: 'tie',
 28: 'suitcase',
 29: 'frisbee',
 30: 'skis',
 31: 'snowboard',
 32: 'sports ball',
 33: 'kite',
 34: 'baseball bat',
 35: 'baseball glove',
 36: 'skateboard',
 37: 'surfboard',
 38: 'tennis racket',
 39: 'bottle',
 40: 'wine glass',
 41: 'cup',
 42: 'fork',
 43: 'knife',
 44: 'spoon',
 45: 'bowl',
 46: 'banana',
 47: 'apple',
 48: 'sandwich',
 49: 'orange',
 50: 'broccoli',
 51: 'carrot',
 52: 'hot dog',
 53: 'pizza',
 54: 'donut',
 55: 'cake',
 56: 'chair',
 57: 'couch',
 58: 'potted plant',
 59: 'bed',
 60: 'dining table',
 61: 'toilet',
 62: 'tv',
 63: 'laptop',
 64: 'mouse',
 65: 'remote',
 66: 'keyboard',
 67: 'cell phone',
 68: 'microwave',
 69: 'oven',
 70: 'toaster',
 71: 'sink',
 72: 'refrigerator',
 73: 'book',
 74: 'clock',
 75: 'vase',
 76: 'scissors',
 77: 'teddy bear',
 78: 'hair drier',
 79: 'toothbrush'}

Here is it: everything that this model can detect. Now you can find that 16 is "dog", so, this bounding box is the bounding box for detected DOG. Let's modify the output to show results in a more representative way:



cords = box.xyxy[0].tolist()
cords = [round(x) for x in cords]
class_id = result.names[box.cls[0].item()]
conf = round(box.conf[0].item(), 2)
print("Object type:", class_id)
print("Coordinates:", cords)
print("Probability:", conf)

In this code I rounded all coordinates using the Python list comprehensions, then, I got the name of detected object class by ID, using the result.names dictionary and also rounded the confidence. Finally, you should get the following output:



Object type: dog
Coordinates: [261, 94, 461, 313]
Probability: 0.95

This data is good enough to show in the user interface. Let's now write a code to get this information for all detected boxes in a loop:



for box in result.boxes:
  class_id = result.names[box.cls[0].item()]
  cords = box.xyxy[0].tolist()
  cords = [round(x) for x in cords]
  conf = round(box.conf[0].item(), 2)
  print("Object type:", class_id)
  print("Coordinates:", cords)
  print("Probability:", conf)
  print("---")

This code will do the same for each box and will output the following:



Object type: dog
Coordinates: [261, 94, 461, 313]
Probability: 0.95
---
Object type: cat
Coordinates: [140, 170, 256, 316]
Probability: 0.92
---

This way you can play with other images and see everything, that COCO-trained model can detect on them.

Also, if you like, you can rewrite the same code in a functional style, using list comprehensions:



def print_box(box):
    class_id, cords, conf = box
    print("Object type:", class_id)
    print("Coordinates:", cords)
    print("Probability:", conf)
    print("---")

[
    print_box([
        result.names[box.cls[0].item()],
        [round(x) for x in box.xyxy[0].tolist()],
        round(box.conf[0].item(), 2)
    ]) for box in result.boxes
]

This video shows the whole coding session of this chapter in Jupyter Notebook, assuming that it's installed.

Using the models pretrained on well-known objects is ok to start, but in practice, you may need a solution to detect specific objects for a concrete business problem.

So, you have to teach your own model to detect these types of objects. To do that, you need to create a database of annotated images for your problem and train the model on these images.

How to prepare data to train the YOLOv8 model

These are the steps that you need to follow to create each of the datasets:

Decide and encode classes of objects you want to teach your model to detect. For example, if you want to detect only cats and dogs, then you can state that "0" is cat and "1" is dog.
Create a folder for your dataset and two subfolders in it: "images" and "labels".
Put the images to the "images" subfolder. The more images you collect, the better for training.
For each image, create an annotation text file in the "labels" subfolder. Annotation text files should have the same names as image files and the ".txt" extensions. In annotation file you should add records about each object, that exist on the appropriate image in the following format:



{object_class_id} {x_center} {y_center} {width} {height}

Actually, this is the most time-consuming manual work in a machine learning process: to measure bounding boxes for all objects and add them to annotation files. Moreover, coordinates should be normalized to fit in a range from 0 to 1. To calculate them, you need to use the following formulas:

x_center = (box_x_left+box_x_width/2)/image_width
y_center = (box_y_top+box_height/2)/image_height
width = box_width/image_width
height = box_height/image_height

For example, if you want to add the "cat_dog.jpg" image that we used before to the dataset, you need to copy it to the "images" folder and then measure and collect the following data about the image, and it's bounding boxes:

Image:

image_width = 612
image_height = 415

Objects:

Dog	Cat
box_x_left=261 box_x_top=94 box_width=200 box_height=219	box_x_left=140 box_x_top=170 box_width=116 box_height=146

Then, create the "cat_dog.txt" file in the "labels" folder and, using the formulas above, calculate the coordinates:

Dog (class id=1):

x_center = (261+200/2)/612 = 0.589869281
y_center = (94+219/2)/415 = 0.490361446
width = 200/612 = 0.326797386
height = 219/415 = 0.527710843

Cat (class id=0)

x_center = (140+116/2)/612 = 0.323529412
y_center = (170+146/2)/415 = 0.585542169
width = 116/612 = 0.189542484
height = 146/415 = 0.351807229

and add the following lines to the file:



1 0.589869281 0.490361446 0.326797386 0.527710843
0 0.323529412 0.585542169 0.189542484 0.351807229

The first line contains a bounding box for the dog (class id=1), the second line contains a bounding box for the cat (class id=0). Of course, you can have the image with many dogs and many cats at the same time, and you can add bounding boxes for all of them.

After adding and annotating all images, the dataset is ready. You need to create two datasets and place them in different folders. The final folder structure can look like this:

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7obu30iswcnm9hb8sk93.png)

Here the training dataset located in the "train" folder and the validation dataset located in the "val" folder.

Finally, you need to create a dataset descriptor YAML-file, that points to created datasets and describes the object classes in them. This is a sample of this file for the data, created above:



train: ../train/images
val: ../val/images

nc: 2
names: ['cat','dog']

This YAML file should be passed to the train method of the model to start a training process.

To make this process easier, there are a lot of programs exist to visually annotate images for machine learning. You can ask a search engine something like "software to annotate images for machine learning" to get a list of them. There are also many online tools that can do all this work. One of the great online tools for this is the Roboflow Annotate. Using this service, you just need to upload your images, draw bounding boxes on them, and set class for each bounding box. Then, the tool will automatically create annotation files, split your data to train and validation datasets, will create a YAML descriptor file, and then you can export and download the annotated data as a ZIP file.

In the next video, I show how to use the Roboflow to create the "cats and dogs" micro-dataset.

For real life problems, that database should be much bigger. To train a good model, you should have hundreds or thousands of annotated images.

Also, when prepare images database, try to make it balanced. It should have equal number of objects of each class, e.g. equal number of dogs and cats. Otherwise, the model trained on it could predict one class better than another.

After the data is ready, copy it to the folder with your Python code, that you will use for training and return back to your Jupyter Notebook to start the training process.

How to train the YOLOv8 model

After the data is ready, you need to pass it through the model. To make it more interesting, we will not use this small "cats and dogs" dataset. We will use other custom dataset for training. It contains traffic lights and road signs. This is free dataset that I got from the Roboflow Universe: https://universe.roboflow.com/roboflow-100/road-signs-6ih4y. Press "Download Dataset" and select the "YOLOv8" as a format.

If it will not available on the Roboflow when you read these lines, then you can get it from my Google Drive. This dataset can be used to teach the YOLOv8 to detect different objects on the roads, like displayed on the next screenshot.

You can open the downloaded zip file and ensure that it structured using the rules, described above. You can find the dataset descriptor file data.yaml in the archive as well.

If you downloaded the archive from the Roboflow, it will contain the additional "test" dataset, which is not used by the training process. You can use the images from it for additional testing on your own after training.

Extract the archive to the folder with your Python code and execute the train method to start a training loop:



model.train(data="data.yaml", epochs=30)

The data is the only required option. You have to pass the YAML descriptor file to it. The epochs option specifies the number of training cycles (100 by default). There are other options, that can affect the process and quality of trained model.

Each training cycle consists of two phases: training phase and validation phase.

On the training phase, the train method does the following:

Extracts the random batch of images from the training dataset (the number of images in the batch can be specified using the batch option).
Passes these images through the model and receives the resulting bounding boxes of all detected objects and their classes.
Passes the result to the loss function, that used to compare the received output with correct result from annotation files for these images. The loss function calculates the amount of error.
The result of loss function passed to the optimizer to adjust the model weights based on the amount of error in correct direction to reduce the error in the next cycle. By default, the SGD (Stochastic Gradient Descent) optimizer used, but you can try others, like Adam to see the difference.

On the validation phase, the train does the following:

Extracts the images from the validation dataset.
Passes them through the model and receives the detected bounding boxes for these images.
Compares the received result with true values for these images from annotation text files.
Calculates the precision of the model based on the difference between actual and expected results.

The progress and results of each phase for each epoch displayed on the screen. This way you can see how the model learns and improves from epoch to epoch.

When you run the train code, you will see the similar output during the training loop:

For each epoch it shows summary for both training and validation phases: the lines 1 and 2 show results of training phase and the lines 3 and 4 shows results of validation phase for each epoch.

The training phase includes calculation of the amount of error in a loss function, so, the most valuable metrics here are box_loss and cls_loss.

box_loss shows the amount of error in detected bounding boxes.
cls_loss shows the amount of error in detected object classes.

Why the loss split to several metrics? Because the model could correctly detect the bounding box around the object, but incorrectly detect the object class in this box. For example, in my practice, it detected the dog as a horse, but the dimensions of the object were detected correctly.

If the model really learns something from data, then you should see that these values decrease from epoch to epoch. On previous screenshot the box_loss decreases: 0.7751,0.7473,0.742 and the cls_loss decreases too: 0.702,0.6422,0.6211.

On the validation phase, it calculates the quality of the model after training using the images from the validation dataset. The most valuable quality metric is mAP50-95, which is a Mean Average Precision. If the model learns and improves, the precision should grow from epoch to epoch. On previous screenshot it slowly grows: 0.788, 0.788, 0.791.

If after the last epoch you did not get acceptable precision, you can increase the number of epochs and run the training again. Also, you can tune other parameters like batch, lr0, lrf or change used optimizer. There are no clear rules what to do here, but there are a lot of recommendations to write a book about this. But in a few words, need to experiment and compare results.

In addition to these metrics, the train writes a lot of statistics during its work on disk. When training starts, it creates the runs/detect/train subfolder in the current folder and after each epoch it logs different log files to it.

Furthermore, it exports the trained model after each epoch to the /runs/detect/train/weights/last.pt file and the model with the highest precision to the /runs/detect/train/weights/best.pt file. So, after training finished, you can get the best.pt file to use in production.

Watch this video to see how the training process works. I used the Google Colab which is a cloud version of Jupyter Notebook to get access to hardware with more powerful GPU to speed up the training process. The video shows how to train the model on 5 epochs and download the final best.pt model. In real world problems, you need to run much more epochs and be prepared to wait hours or maybe days until training finishes.

After it finished, it's time to run the trained model in production. In the next section, we will create a web service to detect objects on images online in a web browser.

How to create an object detection web service

This is a moment when we finish experiments with the model in the Jupyter Notebook. Next code you need to write as a separate project, using any Python IDE, like VS Code or PyCharm💚.

The web service that we are going to create will have a web page with a file input field and an HTML5 canvas element. When the user selects an image file using the input field, the interface will send it to the backend. Then, the backend will pass the image through the model that we created and trained and return the array of detected bounding boxes to the web page. When receive this, the frontend will draw the image on the canvas element and the detected bounding boxes on top of it. The service will look and work as demonstrated on this video:

On the video, I used the model trained on 30 epochs, and it still does not detect some traffic lights. You can try to train it more to get better results. However, the best way to improve the quality of machine learning is adding more and more data. So, as an additional practice, you can import the dataset folder to the Roboflow, then add and annotate more images to it and then use the updated data to continue training the model.

How to create a frontend

To start with, create a folder for a new Python project and the index.html file in it for the frontend web page. Here is a content of this file



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>YOLOv8 Object Detection</title>
    <style>
        canvas {
            display:block;
            border: 1px solid black;
            margin-top:10px;
        }
    </style>
</head>
<body>
    <input id="uploadInput" type="file"/>
    <canvas></canvas>
    <script>
       /**
       * "Upload" button onClick handler: uploads selected 
       * image file to backend, receives an array of
       * detected objects and draws them on top of image
       */
       const input = document.getElementById("uploadInput");
       input.addEventListener("change",async(event) => {
           const file = event.target.files[0];
           const data = new FormData();
           data.append("image_file",file,"image_file");
           const response = await fetch("/detect",{
               method:"post",
               body:data
           });
           const boxes = await response.json();
           draw_image_and_boxes(file,boxes);
       })

       /**
       * Function draws the image from provided file
       * and bounding boxes of detected objects on
       * top of the image
       * @param file Uploaded file object
       * @param boxes Array of bounding boxes in format
         [[x1,y1,x2,y2,object_type,probability],...]
       */
       function draw_image_and_boxes(file,boxes) {
          const img = new Image()
          img.src = URL.createObjectURL(file);
          img.onload = () => {
              const canvas = document.querySelector("canvas");
              canvas.width = img.width;
              canvas.height = img.height;
              const ctx = canvas.getContext("2d");
              ctx.drawImage(img,0,0);
              ctx.strokeStyle = "#00FF00";
              ctx.lineWidth = 3;
              ctx.font = "18px serif";
              boxes.forEach(([x1,y1,x2,y2,label]) => {
                  ctx.strokeRect(x1,y1,x2-x1,y2-y1);
                  ctx.fillStyle = "#00ff00";
                  const width = ctx.measureText(label).width;
                  ctx.fillRect(x1,y1,width+10,25);
                  ctx.fillStyle = "#000000";
                  ctx.fillText(label,x1,y1+18);
              });
          }
       }
  </script>  
</body>
</html>

The HTML part is very tiny and consists only from the file input field with "uploadInput" ID and the canvas element below it. Then, in the Javascript part, we define an "onChange" event handler for the input field. When the user selects an image file, the handler uses the fetch to make a POST request to the /detect backend endpoint (which we will create later) and send this image file to it.

The backend should detect objects on this image and return a response with a boxes array as a JSON. This response then decoded and passed to the "draw_image_and_boxes" function along with an image file itself.

The "draw_image_and_boxes" function loads the image from file and as soon as it loaded, draws it on canvas. Then, it draws each bounding box with class label on top of the canvas with the image.

So, now let's create a backend with /detect endpoint for it.

How to create a backend

We will create backend using Flask. The Flask has its own internal web server, but as stated by the Flask developers, it's not enough reliable for production, so we will use the Waitress web server to run the Flask app in it.

Also, we will use a Pillow library to read an uploaded binary file as an image. Ensure that all packages installed to your system before continue:



pip3 install flask
pip3 install waitress
pip3 install pillow

The backend will be in a single file. Let's name it object_detector.py:



from ultralytics import YOLO
from flask import request, Flask, jsonify
from waitress import serve
from PIL import Image
import json

app = Flask(__name__)

@app.route("/")
def root():
    """
    Site main page handler function.
    :return: Content of index.html file
    """
    with open("index.html") as file:
        return file.read()


@app.route("/detect", methods=["POST"])
def detect():
    """
        Handler of /detect POST endpoint
        Receives uploaded file with a name "image_file", 
        passes it through YOLOv8 object detection 
        network and returns an array of bounding boxes.
        :return: a JSON array of objects bounding 
        boxes in format 
        [[x1,y1,x2,y2,object_type,probability],..]
    """
    buf = request.files["image_file"]
    boxes = detect_objects_on_image(Image.open(buf.stream))
    return jsonify(boxes)    


def detect_objects_on_image(buf):
    """
    Function receives an image,
    passes it through YOLOv8 neural network
    and returns an array of detected objects
    and their bounding boxes
    :param buf: Input image file stream
    :return: Array of bounding boxes in format 
    [[x1,y1,x2,y2,object_type,probability],..]
    """
    model = YOLO("best.pt")
    results = model.predict(buf)
    result = results[0]
    output = []
    for box in result.boxes:
        x1, y1, x2, y2 = [
          round(x) for x in box.xyxy[0].tolist()
        ]
        class_id = box.cls[0].item()
        prob = round(box.conf[0].item(), 2)
        output.append([
          x1, y1, x2, y2, result.names[class_id], prob
        ])
    return output

serve(app, host='0.0.0.0', port=8080)

First, we import the required libraries:

ultralytics for the YOLOv8 model.
flask to create a Flask web application, to receive requests from frontend and to send responses back to it. Also, jsonify imported to convert result to JSON.
waitress to run a web server and serve the Flask web app in it.
PIL to load an uploaded file as an Image object, that required for YOLOv8.

Then, we define two routes:

/ that serves as a root of web service. It just returns a content of the "index.html" file.
/detect that responds to an image upload requests from frontend. It converts the RAW file to the Pillow Image object, then, passes this image to the detect_objects_on_image function.

The detect_objects_on_image function creates a model object, based on the best.pt model, that we trained in the previous section. Ensure that this file exists in the folder, where you write the code.

Then it calls the predict method for the image. The predict returns the detected bounding boxes. Then for each box it extracts the coordinates, class name and probability in a way, as we did in the beginning of the tutorial, and adds this info to the output array. Finally, the function returns the array of detected object coordinates and their classes.

After this, the array encoded to JSON and returned to the frontend.

Finally, the last line of code starts the web server on port 8080, that serves the app Flask application.

To run the service, execute the following command:



python3 object_detector.py

If the code written without mistakes and all dependencies installed, you can open http:///localhost:8080 in a web browser. It should show the index.html page. When you select any image file, it will process it and display bounding boxes around all detected objects (or just display the image if nothing detected on it).

The web service we just created is universal. You can use it with any YOLOv8 model. Now it detects traffic lights and road signs, using the best.pt model we created. However, you can change it to use other model, like the yolov8m.pt model used earlier to detect cats, dogs and other object classes, that pretrained YOLOv8 models can detect.

Conclusion

In this tutorial, I guided you thought a process of creating an AI powered web application that uses the YOLOv8 - the state-of-the-art convolutional neural network for object detection. We covered such steps as creating models, using the pretrained models, prepare the data to train custom models and finally created a web application with frontend and backend, that uses the custom trained YOLOv8 model to detect traffic lights and road signs.

You can find a source code of this app in this GitHub repository: https://github.com/AndreyGermanov/yolov8_pytorch_python

For all the job, we used the Ultralytics high level APIs, provided with YOLOv8 package by default. These APIs are based on the PyTorch framework, that used to create the bigger part of neural networks today. It's quite convenient on the one hand, but dependence on these high level APIs has a negative effect as well. If you need to run this web app in production, you should install all this environment there, including Python, PyTorch and many other dependencies. To run this on a clean new server, you'll need to download and install more than 1 GB of third party libraries!! This is definitely not a way to go. Also, what if you do not have Python in your production environment? What if all your other code written on other programming language, and you do not plan to use Python? Or what if you want to run the model on mobile phone on Android or iOS?

Using Ultralytics packages is great for experimenting, training and preparing the models for production. However, in production itself, you should get rid of these high-level APIs. You have to load and use the model directly. To do this, you need to understand how the YOLOv8 neural network works under the hood and write more code to provide input to the model and to process the output from it. As a reward, you will get an opportunity to make your apps tiny and fast, you will not need to have PyTorch installed to run them. Furthermore, you will be able to run your models even without Python, using many other programming languages, including Julia, C++, Node.js on backend, or even without backend at all. You can run the YOLOv8 models right in a browser, using only JavaScript on frontend. Want to know how? This will be in the next article of my YOLOv8 series. Follow me to know first when it published.

You can find me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Have a fun coding and never stop learning!

Deep Learning with Julia – How to Build and Train a Model using a Neural Network

Andrey Germanov — Thu, 09 Mar 2023 10:46:27 +0000

Introduction
What should you know in advance
Handwritten digits recognition workflow
How to collect initial image data
How to work with images in Julia
    How to load and view the image
    How to implement basic image transformations
    How to convert the image to numeric matrix
How to prepare the image data for machine learning
How to create a machine learning model
    Neural network basics
    How to create the neural network with Flux
How to train the model
How to evaluate the accuracy of the trained model
How to create and train the convolutional neural network
How to export trained model to a file
How to create a frontend
How to create a backend
Intro to advanced convolutional networks
Conclusion

Introduction

Julia is a general purpose programming language well suited for numerical analysis and computational science. Some consider it the future of machine learning and the most natural replacement for Python in this field.

In the previous post "Machine learning with Julia – How to Build and Deploy a Trained AI Model as a Web Service" I introduced the basic machine learning features of Julia and explained why it's so good for this.

In this article, I want to move one step forward and explore deep learning features of Julia to show how you can use it to solve computer vision tasks using neural networks.

Computer vision is one of the most impressive areas of artificial intelligence. It includes such interesting tasks as image classification, text recognition, object detection and image segmentation. Neural networks showed the best performance in solving computer vision problems.

In this tutorial, I will guide you through the process of building and training a neural network to recognize handwritten digits using Julia. I will also explain how to create a website that will use the trained model to read handwritten phone numbers.

What should you know in advance

This tutorial assumes that you have basic Julia knowledge, that possible to get by reading my previous article. That article also includes instructions on how to install Julia and integrate it with Jupyter notebook, which will be used to write most of the code.

The "Handwritten digit recognition using deep learning" problem and the theory that stands behind it is well known. That is why I will cover it only briefly. There are many good resources that explain how neural networks are used to solve the image classification tasks. Personally, I recommend watching this video and read the first chapter of this great online book.

The goal of this tutorial is only to show you how to implement the theory, explained in those resources, using Julia.

Handwritten digits recognition workflow

To build a machine learning model we will use the Flux.jl framework which is a pure Julia implementation of most well-known neural network types including feed forward, convolutional and recurrent networks.

Recognizing handwritten numbers is a supervised machine learning task of image classification. To implement it, you need to have a labeled dataset of handwritten digits and use it to train the machine learning model.

This is how the ML workflow looks:

Collect the images of handwritten digits for recognition.
Prepare a labeled dataset for machine learning by cleaning and labeling the data.
Create a machine learning model to recognize handwritten digits.
Train the model using training dataset.
Evaluate the accuracy of the trained model by feeding it with data from a testing dataset.
After achieving good accuracy, export the model to a file to use in applications.

How to collect initial image data

The first step of any machine learning task is to collect the data that will be used for training. Usually this is the bigger part of the whole process.

How do you collect handwritten digits for this? Well, for example, you can ask all your friends in social networks to write down digits from 0 to 9 and save them to images. They also can ask their friends to do the same and finally send all these images to you.

The more data you collect, the better for machine learning.

Then, you could create folders with names from "0" to "9" and arrange these images within them. Also, you need to convert the images to the same format: convert to grayscale and resize them. All images should have the same size and color format.

Finally, you'll have a labeled collection of handwritten digits that are ready to work with.

Fortunately, you do not need to do all this manual work, because it was already done in 1998 by the National Institute of Standards and Technology. The database of handwritten digits, that called MNIST, is available to download from Kaggle or from many other places. For example, you can download and extract the MNIST archive using this link.

This database is already split into testing and training data in appropriate folders. Each of these folders contains images of handwritten digits, classified to folders from "0" to "9". There are 60000 images in the training folder and 10000 images in the testing folder:

Each file is a 28x28 gray scaled image. We will use the content of the training folder to prepare the dataset for training the neural network model. Then we will use the content of the testing folder to validate the accuracy of the trained model. Before doing that, we need to convert this raw data to datasets.

In order to continue, run the Jupyter notebook and create a new notebook in it, selecting "Julia" as a language. Then, copy the training and testing folders with images to the folder in which you created the notebook.

How to work with images in Julia

An image is not a natural data format for machine learning models. The models understand only numbers. That is why, to prepare the images for machine learning, you need to load them and convert to numbers.

To work with images in Julia, we will use the Julia Images library. Using this library, you can load the image, convert it to matrix of pixels, and apply different transformations that can be required before pushing it to ML. The transformations include resizing, converting from color to black and white, inverting, cropping, and more.

To start working with these functions, you need to install the Images package and import it to your notebook:



using Pkg
Pkg.add("Images")
using Images

How to load and view the image

You can use the load function to load the image. Let's load the first digit from our training dataset. If this file exists, it should load it to the img variable and display the image itself:



img = load("training/0/1.png")

This is a loaded digit. Let's see the shape of the img variable:



size(img)

(28,28)

As you see, the img variable is an 2D array or matrix of image pixels. The first dimension of the array is a number of rows and the second dimension is a number of columns. That is why the height of image is the first value and the width of image is the second value.

Let's see the type of this variable now:



typeof(img)

Matrix{Gray{N0f8}} (alias for Array{Gray{Normed{UInt8, 8}}, 2})

It shows that this is a matrix of "Gray" objects. The Gray type defines a gray pixel. It means that the image that we loaded does not have color information.

The Gray data type defines the pixel by a single value – the intensity of gray color in a range between 0 and 1. So, the 0 is completely black and the 1 is completely white.

You can change a color of any pixel using the following code:



img[5,5] = Gray(0.5)

This way you set the average gray color to the specified pixel (which was previously black).

If you load the full color image and request its type, it will show something like this:

Matrix{RGB{N0f8}} (alias for Array{RGB{Normed{UInt8, 8}}, 2})

In this case, each pixel has a type of RGB which defined by 3 values: intensity of Red, intensity of Green and intensity of Blue. Also, if you run size(img) for a colored image, you will see that this is a 3D array, like this:

(3,28,28)

where the first dimension is a number of color channels, the second dimension is a height and the third dimension is a width.

In other words, this color image consists of three matrices of 28x28 size. Each of them contains intensities of the appropriate color.

To set the color of any pixel in this image, you need to specify intensities of 3 channels in the RGB type constructor:



img[5,5] = RGB(1,0.5,0)

This code sets the pixel color to orange.

How to implement basic image transformations

Because the image is an array, you can use the array syntax to get access to any part of the image or even to individual pixels.

For example, you can run this to extract the first 10 rows and 20 columns of this image and write them to the new image:



img2 = img[1:10,1:20]

You can crop the image by 5 pixels from all sides:



img3 = img[5:22,5:22]

You can apply different filters to the image by applying the specified function to each element of the matrix, using the Julia broadcasting feature via "dot" syntax.

For example, this code applies the Gray function to each pixel of the image. This approach can be used to convert images from colored to grayscale:



img4 = Gray.(img)

Similarly, you can convert gray images to colored:



img5 = RGB.(img)

You can apply custom functions to each pixel. For example, if you apply the next anonymous function to the gray image this way:



img6 = (x-> Gray(1)-x.val).(img)

it will invert the image colors by subtracting the color value of each pixel from 1. If the img has a white digit on a black background, then the img6 will have a black digit on a white background:

Finally, to resize the image, you can use the imresize function. For example, to resize the img to 50x50 pixels, you can use the following code:



img6 = imresize(img,(50,50))

We will use only the features described above to prepare the images for handwritten digit recognition. But the Images module has many more interesting and fun things. Watch this video to see some of them. Also, you can find a lot of interesting information in this book.

How to convert the image to numeric matrix

The last image preprocessing step is converting the pixels to numbers, because objects of type Gray() or RGB() are not suitable as an input for the machine learning model.

You can do this in two steps. First, you need to apply the channelview function to the image to get the matrix view of the image object, and then, convert the result to float numbers. So, if you run this command:



data = Float32.(channelview(img))

you will get the matrix, where each value is a float number that represents an intensity of the corresponding pixel. This data is ready to go to the neural network.

How to prepare the image data for machine learning

As I wrote in a previous article, the training dataset should consist of data from the feature matrix and from the labels vector. Both should contain only numbers.

Let's go back to our image collections in the training and testing folders. The labels are subfolder names where images located. They are already numbers. The features of an image are the pixels. Each pixel is defined by its color intensity.

So, to create a dataset that is ready for training from the images folder, you need to read all files from all subfolders, convert them to matrices of float numbers, and put them in the array.



path = "training"
X = []
y = []
for label in readdir(path)
    for file in readdir("$path/$label")
        img = load("$path/$label/$file")
        data = reshape(Float32.(channelview(img)),28,28,1)
        if length(X) == 0
            X = data
        else
            X = cat(X,data,dims=3)
        end
        push!(y,parse(Float32,label))
    end
end

Ensure that the "training" and the "testing" folders with the MNIST images exist in the current folder before running this program. It will take a while to execute this code, because it will load 60000 images and will convert them to matrices.

In the outer loop, it reads the contents of the "training" folder. There are subfolders with names from 0 to 9 that will be used as labels.

Then, in the inner loop, it reads all image files of each of these subfolders using the load function from the Images package.

Next, it converts each image to the matrix of color intensities and places it in the data variable. After that, it appends this matrix to X.

Finally, it appends the name of the subfolder (which is an actual digit) to the labels vector y.

This way, you will have a dataset with feature matrix in X and labels vector in y. Let's refactor this code to a function to be able to reuse it to convert any folder with images, classified this way, to the dataset.



using Images
function createDataset(path)
    X = []
    y = []
    for label in readdir(path)
        for file in readdir("$path/$label")
            img = load("$path/$label/$file")
            data = reshape(Float32.(channelview(img)),28,28,1)
            if length(X) == 0
                X = data
            else
                X = cat(X,data,dims=3)
            end
            push!(y,parse(Float32,label))
        end
    end
    return X,y
end

Using this function, you can now easily create both training and testing datasets:



x_train, y_train = createDataset("training")
x_test, y_test = createDataset("testing")

How to create a machine learning model

We will use a neural network to create a model and train it using the training data. To work with neural networks we will use the Flux.jl framework which allows you to create and train neural networks of various types, including feed forward, convolutional, and recurrent.

For handwritten image classification, we will implement both the Feed Forward and the Convolutional networks and compare their accuracy. If you need to, you can review the basics of neural networks by watching this video. Now is the best time to watch this before you continue reading.

Neural network basics

A neural network is a chain of layers. Each layer has a defined number of neurons with inputs and outputs.

To convert input to output for each layer, the neurons use the activation function, defined for this layer. Features of the image are the inputs of the first layer, and the classification results are the outputs of the last layer.

The best way to understand all this is to visualize some neural network architecture. Let's see the following basic neural net of 3 layers:

Source: http://neuralnetworksanddeeplearning.com/chap1.html

In this picture, the input layer contains 784 neurons that should receive the features of each image. As you remember, the training dataset consists of 28x28 images, which is 784 pixels. This is how this neural network works:

The color value of each pixel goes to each neuron of the input layer.
Each neuron of the input layer sends its value to each neuron of the hidden layer.
Each neuron of the hidden layer has a weight coefficient for each input. By default, these coefficients are random numbers. So, each neuron on the hidden layer receives input values from the previous layer and multiplies each input by the appropriate weight, summarizes these products, and applies the activation function to that sum.
Each neuron of the hidden layer sends the resulting sum to each neuron of the output layer, which has 10 neurons.
The output layer does exactly the same for each input value as the previous layer and finally accumulates some sum inside.
This sum is treated as a probability of the appropriate digit, for example the first neuron should contain the probability that the input image is "0", the second neuron should contain the probability that the image is "1", and so on.

Then, the application should look at which of these 10 neurons has the highest value and make the appropriate prediction.

How to create the neural network with Flux

Let's create this neural network using Flux. If you haven't installed and imported it yet, do this in your notebook:



using Pkg
Pkg.add("Flux")
using Flux

As you have seen, the neural network is a chain of layers with different parameters. So, Flux has a Chain function that you use to construct neural networks. Let's construct that network:



model = Chain(
    Flux.flatten,
    Dense(784=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

The Chain receives an array of functions as arguments. Each function defines a layer and it's parameters. Each of these functions receives some inputs, then after the appropriate actions returns the outputs and forwards them as inputs to the next function in the chain.

So, this is how the defined neural network works:

The input image, which is a 28x28 array of pixel color intensities, comes to the Flux.flatten function. This function just converts this 28x28 matrix to a vector with 784 elements. This way we constructed the input for the first Dense layer.
Then, the next Dense function receives 784 values by 15 neurons. Then it multiplies these values by weights, summarizes these products, applies the relu activation function to this sum, and forwards these 15 values to 10 neurons of the next layer.
Next, the dense layer also multiplies each 15 inputs by the weight coefficients, summarizes them, and applies the sigmoid activation function to convert these sums to fractions of 1.
The final softmax function actually doesn't build a new layer, but it just converts values that accumulated in the 10 neurons of the output layer to correct probabilities to properly show the probability distribution. Applying this function ensures that the sum of all 10 probabilities is equal to 1. The array of these probabilities will be returned by the model as a result.

You can call the model which you just created as a function by passing an image matrix as an input argument.

You can run the model to predict the digit for the first image from the training dataset using the following code:



predict = model(Flux.unsqueeze(x_train[:,:,1],dims=3))

We use the unsqueeze function here to convert the image of the (28,28) shape to the batch of images of (28,28,1) shape. The model function receives data in batches. In this case, it receives a single image of 28x28 size. Then it passes it through a chain of layers and returns the array of probabilities.

As you can see, the highest probability has a neuron number 2 (0.12457416) which means that the model predicted the digit "1". However, if you check the real answer in the labels vector:



y_train[1]

you will see "0", so the prediction is incorrect. This is because this model is untrained and just uses random weights to calculate the output for each layer. You need to train it to adjust these weights and calculate more accurate probability.

How to train the model

Flux.jl has different approaches to training a model. The most obvious one is the Flux.train function. The function runs the following training process:

The function receives the training dataset as an argument, including the features matrix and the labels vector.
The function runs the model for each row of the training dataset and receives the resulting probabilities array.
The function compares these probabilities with the true values from the labels vector and calculates the amount of error (about this later).
Using information about the error, the function adjusts the weights and bias for each neuron on each layer.

Usually you need to run this training process many times in a loop. On each iteration it will adjust the weights for each neuron, decreasing the error value more and more.

This visualization shows how the training process in a loop works for a single neuron on a single layer. For the whole network it works similar.

Source: https://7-hiddenlayers.com/wp-content/uploads/2020/06/NEURONS-IN-NUERAL-NETWORK.gif

This is a syntax of the train function:



Flux.train!(loss_function, model, data, optimizer)

Let's break this down:

loss_function – as I described before, during the training process, the train function measures the amount of error. To do this, it uses the loss_function, which you should define and provide here.

This function receives the model, the row of the training data, and the truth label. Based on these arguments, the loss function should make a prediction by passing the row of data through the model, comparing this prediction with the truth label, calculating the difference between them, and returning the amount of error as a float number.

There are different algorithms exist to calculate the amount of error for different machine learning problem types. For classification problems we will use cross entropy.

model – the neural network model to train.
data – the training data that includes both x_train and y_train assembled to a single array of tuples. You can do this simply by using the Flux.DataLoader function, which we will use below.
optimizer – as described above, after measuring the amount of error, the function adjusts the weights to decrease the error. The weights are not adjusted randomly, but by the optimizer that defines the algorithm. You use it to adjust the weights in the correct direction.

Most of the weight adjustment algorithms are based on Gradient Descent. In particular, we will use the ADAM optimizer, which is very common today.

Let's connect all these parts together in the following code:



# Assemble the training data
data = Flux.DataLoader((x_train,y_train), shuffle=true)

# Initialize the ADAM optimizer with default settings
optimizer = Flux.setup(Adam(), model)

# Define the loss function that uses the cross-entropy to 
# measure the error by comparing model predictions of data 
# row "x" with true data label in the "y"
function loss(model, x, y)
    return Flux.crossentropy(model(x),Flux.onehotbatch(y,0:9))
end

# Train the model 10 times in a loop
for epoch in 1:10
    Flux.train!(loss, model, data, optimizer)
end

For each row of data, the Flux.train! calls the loss function, then the loss function runs the model. Using cross entropy, it calculates the difference between the predictions with true values of this row. This difference is returned as an error, and then the optimizer is used to adjust the weights of the model neurons based on this error value and the loss function. On each iteration, the error value should go down.

Finally, after running the training process, you can check how it predicts the digit for the first image using the trained model:



predict = model(Flux.unsqueeze(x_train[:,:,1],dims=3))

When I did that, I received the following probabilities:

The first one, related to "0" is the highest and this is definitely true. You can try to check other images, like image number 100 or 200. But it doesn't make much sense to measure model quality this way, because this is a training data that the model has already seen. Only the testing data should be used to measure the accuracy of the model.

How to evaluate the accuracy of the trained model

We have the testing dataset in the x_test features matrix and in the y_test labels vector. We will run the model for each row of this data and measure the accuracy: the number of correct predictions divided by the number of all predictions.

Let's create a function for this:



function accuracy()
    correct = 0
    for index in 1:length(y_test)
        probs = model(Flux.unsqueeze(x_test[:,:,index],dims=3))
        predicted_digit = argmax(probs)[1]-1
        if predicted_digit == y_test[index]
            correct +=1
        end
    end
    return correct/length(y_test)
end

The function goes over all items of the testing dataset. For each item it runs the model and receives the probs array. Then, it writes an index of the highest probability using the argmax function to the predicted_digit variable. Next it compares the predicted digit with the truth value from y_test labels vector and increases the number of correct predictions if they match. The function returns the quotient of the number of correct predictions and the total number of rows.

Now you can run this function to see the accuracy. For example, when I ran this, I received the 0.9455, which is about 94.6%.

However, it's better to place this function call inside the training loop, right after the Flux.train! line to see how the accuracy changes after each training iteration.



for epoch in 1:10
    Flux.train!(loss, model, data, optimizer)
    println(accuracy())
end

Then run the training again. You should receive output similar to this:

It shows that accuracy was going up until the 6th iteration. Since then, it started to go down, which could be a sign that the model started to overfit.

To increase the prediction quality, you can either add more data to the training dataset or change the model architecture.

For example, you can add more Dense layers, increase the number of neurons on the hidden layer, or change activation functions from relu to sigmoid or vice versa.

When I increased the number of neurons from 15 to 42 on the hidden layer and then removed the sigmoid activation from the output layer, I've achieved about 97% accuracy. But when I added one more hidden layer before output, the accuracy dropped to 90%.

So, building the neural net architecture is like art – you need to try different options a lot of times and finally select the one that works the best.

Regardless of the options I chose, I could never achieve more than 97%. Also, when I finally tried to use this network architecture in production with real handwritten digits from users, the prediction quality was poor. Very often it could not recognize the 7 digit properly, and it recognized 1 as 4 and 6 as 5.

This is because using the feed forward neural network, in which we just put all 784 pixels of the image as an input without any filters, is not the best approach.

For most machine learning tasks with images, the Convolutional neural networks is the better option. We will create and try this one in the next section.

How to create and train the convolutional neural network

The most important step during the machine learning process is data preprocessing. If input features are processed properly, then the prediction accuracy will be better.

To increase the model quality, you need to remove noise from data, or features that are not relevant for the value that you need to predict.

Also, oftentimes you need to create new features from existing ones that could be more relevant to the result.

For example, for the Titanic machine learning problem, you can remove such features as "Passenger ID" and "Passenger name", because they can't help to predict whether the passenger might survive or not.

Also, if you have a task to predict the price of a flat and have input data with fields of room areas like "Area 1", "Area 2" and so on, you can create a new field "Total Flat Area" and write the sum of all room areas to it.

Perhaps this new feature that you generated is more relevant than others for the model, so you can remove the fields from which you generated that new column.

Using these techniques, you generalize the data by keeping and creating the features that are important and by removing others that can only confuse the machine learning model.

When working with tabular data, you can use your own experience or statistical methods to find which features to generate or remove from input data. But when working with images, things are not as clear as with strings or numbers.

For example, the model for the handwritten digits recognition task receives the 784 pixel colors in a single row as an input. They have an equal value from a human point of view, and it's unknown which of them are more important and which of them are less.

To help you in this, you can use convolutional neural networks to preprocess this kind of data. They help you do the feature engineering automatically.

You build a convolutional neural network from two types of layers:

Convolution layers used to generate new features from input image pixels.
Pooling layers used to generalize features using some rules and this way reduce their quantity.

By combining these two types of layers in the chain, you can preprocess the input image matrix to receive a reduced number of the most valuable features. Then, you can train the network using these generated features as input data in the same way as you did before.

I think it's difficult to describe CNNs better than it's done in this video, so I highly recommend watching it (or at least the first 15 minutes) before continue. It clearly explains the theoretical aspects of all steps that you will do below.

So, let's review the neural network that you have now:



model = Chain(
    Flux.flatten,
    Dense(784=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

The only data preprocessing step here is the Flux.flatten, that receives the image of 28x28 pixels and returns it joined to a single row of 784 numbers. We need to add some convolution layers before the Flux.flatten to give to our network the ability to generate better features than just raw pixels.

To create the convolution layer, the Flux.jl has the Conv function with the following main parameters:



Conv(filter,in=>out,activation_function)

filter defines dimensions of the kernel matrix that will be applied to each pixel of the input matrix to create a feature from it. For example, the value (3,3) defines the 3x3 kernel matrix. This is how the convolution using this kernel matrix works to generate the features for an image of 6x6 size:

Source: https://en.m.wikipedia.org/wiki/File:2D_Convolution_Animation.gif

in is the number of input image channels. For our input data, gray images have a single channel. For other layers, the number of in channels of current layer must be equal to the out channels of previous layer.
out is the number of output channels after apply the convolution. In other words, it's a number of features that will be generated for each pixel.
activation_function is the function that will be applied to each feature after convolution and before sending to the next layer of the network, the same as we did before for Dense layers.

For example, if you add the following Conv layer on top of the others to the Chain:



model = Chain(
    Conv((5,5),1=>6,relu),
    Flux.flatten,
    Dense(4704=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

this network will get a single channel image of the following shape: (28,28,1). It will produce 6 matrices from this image by applying different convolution kernels of 5x5 to the input data.

The output of this layer will be the image of the following shape: (28,28,6). In other words, this convolution layer will generate 28*28*6 = 4704 features from 784 input pixels for our network.

But if you have more features, it does not mean that they are all good. Perhaps you need to generalize them and leave only the most valuable ones. This is why the pooling layers are created.

In Flux.jl, the pooling layer can be defined using the MaxPool function. It receives the pooling window dimensions as an argument.

For example, if you create the following MaxPool layer:



MaxPool((2,2))

Source: https://nico-curti.github.io/NumPyNet/NumPyNet/layers/maxpool_layer.html

it will apply the 2x2 window to the input image. As you can see, for each window it selects the maximum value and adds it to the output. This way it reduces the input data by leaving only maximums in it. That is why it's called the MAX pool layer.

Let's add the MaxPool layer to the chain:



model = Chain(
    Conv((5,5),1=>6,relu),
    MaxPool((2,2)),
    Flux.flatten,
    Dense(1176=>15,relu),
    Dense(15=>10,sigmoid),
    softmax
)

So, the MaxPool receives the (28,28,6) sized image from the convolution layer, applies the 2x2 max pool window to it, and outputs (14,14,6) image. After this, the 14*14*6=1176 generalized features are forwarded to the network layers below.

The main question is how to know which number of convolution and max pool layers to add, and which parameters to set for each of them to achieve good prediction accuracy.

Well, the first way is to try different options. But to build a good neural network architecture this way could take days, months, or even years.

Fortunately, for many machine learning tasks, it has already been done by other people. You can find suitable architectures for most of your problems, including the model for the handwritten digit recognition.

The most known architecture for this task was created by Yann LeCun, and it's named LeNet. You can find a full description and implementations of this model for different ML platforms here. It was created exactly for the digit images from MNIST dataset. It's relatively old, but still used in many ATMs to recognize digits for processing deposits.

This is how this architecture looks:

Just like the network we created, this one consists of a convolutional part and a feed forward part. The convolutional net part consists of 2 Conv and 2 MaxPool layers. The feed forward neural network part consists of 3 dense layers.

You can create this network using Flux.jl this way:



model = Chain(
    Conv((5,5),1 => 6, relu),
    MaxPool((2,2)),
    Conv((5,5),6 => 16, relu),
    MaxPool((2,2)),
    Flux.flatten,
    Dense(256=>120,relu),
    Dense(120=>84, relu),
    Dense(84=>10, sigmoid),
    softmax
)

After applying 2 convolutions and pooling to the input image matrix, the Flux.flatten layer receives the 4x4x16 image and converts it to 4*4*16=256 generalized features. Then they go through 3 dense layers to finally calculate probabilities for 10 digits.

Before training this model using the data from x_train, you need to reshape it a little bit. The convolution layer expects to get the data in the following 4-dimensional shape (width,height,channels,length), but the x_train has the following shape: (28,28,60000) which is 60000 images of 28x28.

To make it compatible, you need to reshape it to (28, 28, 1, 60000). You can do this using the following code:



x_train = reshape(x_train, 28, 28, 1, :)

You'll need to do the same with x_test:



x_test = reshape(x_test, 28, 28, 1, :)

To run this model, you also need to pass a 4 dimensional image structure to the model function. For example, to make a prediction for the first image, you can run this:



model(Flux.unsqueeze(x_test[:,:,:,1],dims=4))

Then you can train the model the same way as you did before.

This is the whole code to define and train the convolutional network:



# Create a LeNet model
model = Chain(
    Conv((5,5),1 => 6, relu),
    MaxPool((2,2)),
    Conv((5,5),6 => 16, relu),
    MaxPool((2,2)),
    Flux.flatten,
    Dense(256=>120,relu),
    Dense(120=>84, relu),
    Dense(84=>10, sigmoid),
    softmax
)

# Function to measure the model accuracy
function accuracy()
    correct = 0
    for index in 1:length(y_test)
        probs = model(Flux.unsqueeze(x_test[:,:,:,index],dims=4))
        predicted_digit = argmax(probs)[1]-1
        if predicted_digit == y_test[index]
            correct +=1
        end
    end
    return correct/length(y_test)
end

# Reshape the data
x_train = reshape(x_train, 28, 28, 1, :)
x_test = reshape(x_test, 28, 28, 1, :)

# Assemble the training data
train_data = Flux.DataLoader((x_train,y_train), shuffle=true)

# Initialize the ADAM optimizer with default settings
optimizer = Flux.setup(Adam(), model)

# Define the loss function that uses the cross-entropy to 
# measure the error by comparing model predictions of 
# data row "x" with true data from label "y"
function loss(model, x, y)
    return Flux.crossentropy(model(x),Flux.onehotbatch(y,0:9))
end

# Train model 10 times in a loop
for epoch in 1:10
    Flux.train!(loss, model, train_data, optimizer)
    println(accuracy())
end

After running this code, I received about 99% accuracy, which is close to ideal:

Now it's time to save this model to a file and move it to production.

How to export trained model to a file

Flux.jl models can be saved to BSON files. You need to import the BSON package and use the @save macro command to export the model object:



using BSON
BSON.@save "digits.bson" model

This will save the model to the digits.bson file into the current folder.

This is the end of your work in the Jupyter notebook. We'll implement the following code as a new application.

How to create a frontend

The application which you are going to create will allow a user to write their phone number and recognize it using the model that you created and trained before. The frontend page will look like this:

Using this interface, the user can draw digits of a phone number in the boxes using the mouse, then press the "Recognise" button and display the recognised digits in the "Result" input field.

Also, there is a "Switch to eraser" button. When the user presses it, the drawing mode changes to the eraser mode and the user can erase any number in any box.

Let's start building the web application. Create a new folder with any name that you like. Then create an index.html file in it and copy the following code to this file:



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Phones reader</title>
</head>
<body>
    <h1>Draw phone number and recognise it</h1>
    <div class="digits">
        <strong>+</strong>
        <canvas width="50" height="50"></canvas>
        <strong>(</strong>
        <canvas width="50" height="50"></canvas>
        <canvas width="50" height="50"></canvas>
        <canvas width="50" height="50"></canvas>
        <strong>)</strong>
        <canvas width="50" height="50"></canvas>
        <canvas width="50" height="50"></canvas>
        <canvas width="50" height="50"></canvas>
        <strong>-</strong>
        <canvas width="50" height="50"></canvas>
        <canvas width="50" height="50"></canvas>
        <canvas width="50" height="50"></canvas>
        <canvas width="50" height="50"></canvas>
        <div class="buttons">
            <button id="mode">Switch to eraser</button>
        </div>
    </div>
    <div class="result">
        <button id="recognise">Recognise</button>
        <label>Result:</label>
        <input id="result"></div>
    </div>
</body>
<script>
    let mode = "brush";
    // "Switch" button handler. Switches mode from 
    // brush to eraser and back
    document.querySelector("#mode").addEventListener("click",() => {
        if (mode === "brush") {
            mode = "eraser";
            event.target.innerHTML = "Switch to brush";
        } else {
            mode = "brush";
            event.target.innerHTML = "Switch to eraser";
        }
    });
    // Digits canvases mouse move handler.
    // If mouse button pressed while user moves the mouse
    // on canvas, it draws circles in cursor position.
    // If mode="brush" then circles are black, otherwise
    // they are white
    document.querySelectorAll("canvas").forEach(item => {
        ctx = item.getContext("2d");  
        ctx.fillStyle="#FFFFFF";
        ctx.fillRect(0,0,50,50);
        item.addEventListener("mousemove", (event) => {
            if (event.buttons) {
                ctx = event.target.getContext("2d");  
                if (mode === "brush") {
                    ctx.fillStyle = "#000000";         
                } else {
                    ctx.fillStyle = "#FFFFFF";         
                }
                ctx.beginPath();               
                ctx.arc(event.offsetX-1,event.offsetY-1,2,0, 2 * Math.PI);
                ctx.fill();   
            }
        })
    })
    // "Recognise" button handler. Captures
    // content of all digit canvases as BLOB.
    // Construct files from these blobs and
    // posts them to backend as a files as a
    // multipart form
    document.querySelector("#recognise").addEventListener("click", async() => {
        data = new FormData();
        canvases = document.querySelectorAll("canvas");
        const getPng = (canvas) => {
            return new Promise(resolve => {
                canvas.toBlob(png => {
                    resolve(png)
                })
            })
        }
        index = 0
        for (let canvas of canvases) {
            const png = await getPng(canvas);
            data.append((++index)+".png",new File([png],index+".png"));
        }
        const response = await fetch("http://localhost:8080/api/recognize", {
            body: data,
            method: "POST"
        })
        document.querySelector("#result").value = await response.text();
    })

</script>
<style>
    body {
        display:flex;
        flex-direction: column;
        justify-content: flex-start;
        align-items: flex-start;
    }
    canvas {
        border-width:1px;
        border-color:black;
        border-style: solid;
        margin-right:5px;
        cursor:crosshair;
    }
    .digits {
        display:flex;
        flex-direction: row;
        align-items: center;
        justify-content: flex-start;
    }
    .digits strong {
        font-size: 72px;
        margin:10px;
    }
    .buttons {
        display:flex;
        flex-direction: column;
        justify-content: flex-start;
        align-items: center;
    }
    button {
        width:100px;
        margin-bottom:5px;
        margin-right:10px;
    }
    .result {
        margin-top:10px;
        display:flex;
        flex-direction: row;
        align-items: flex-start;
        justify-content: flex-start;
    }
    input {
        margin-left:10px;
    }
</style>
</html>

The HTML part of this code contains 11 HTML5 canvas elements that display the boxes where you can draw. Each box has a size of 50x50 pixels and is filled with a white color. Also, the HTML contains "Switch to ..." and "Recognise" buttons and the "Result" input field.

The JavaScript part defines the "mode" global variable, which is equal to "brush" by default. When the user presses the "Switch to ..." button, it changes the mode to the "eraser". If they press it again, it switches back to the "brush".

Next, the JavaScript code defines "mousemove" event handlers for all canvas boxes. If the user presses the left mouse button in the "brush mode" and moves the mouse in the box, it draws black circles in place of the mouse cursor. This way, the user draws the digits. If the mode is "eraser", then it draws white circles. This way, the user can erase the black marks.

Finally, we defined the "Recognise" button click handler. When the user clicks this button, the handler function collects 11 digit images from the canvas elements and converts them to BLOB objects in a PNG image format.

Then it creates a POST request, puts these 11 digit images in it as files with names 1.png, 2.png and so on, and sends them to the /api/recognize endpoint of the backend service on port 8080 of a local host (which we will create in the next section).

The backend should receive these images, recognise digits in them, and return the recognition result as a string. This string will be displayed in the "Result" input field.

Lastly, I defined some CSS to apply basic styles to this page. You can modify them as you want. Now, let's move to the most interesting part – the digits recognition backend.

How to create a backend

As a modern and mature programming language, Julia has a lot of libraries and frameworks for different tasks. Web frameworks are not an exception. We will use the Genie.jl framework, which is similar to the Express in Node.js or Flask in Python.

With Genie.jl you can run a basic web service in two lines of code:



using Genie
up(8080, async=false)

It will run a web server on port 8080 of a local host.

Using any text editor, for example VSCode with the Julia extension, create a new Julia file like digits.jl in the same folder with the index.html. This is where you'll write the next bit of code.

This web service will have two endpoints:

/ to display the index.html web page that you created before.
/api/recognize to receive POST requests with the images of digits, recognize them, and return a string with recognized numbers.

As with most other web frameworks, to receive and process HTTP requests Genie.jl uses routes. This application will have two routes:



using Genie, Genie.Router, Genie.Requests

route("/") do 
    return String(read("index.html"))
end

route("/api/recognize", method=POST) do
    result = ""
    # TODO: in a loop, extract each image 
    # from POST request body, send it to 
    # the digit recognition function, 
    # receive recognized digit and add 
    # it to the result
    return result
end

up(8080, async=false)

To work with routes and requests, you need to import two additional subpackages – Genie.Router and Genie.Requets.

The first route just returns the content of the index.html file.

The second route processes the POST requests to the /api/recognize endpoint. This is how you can define it:



using Images
route("/api/recognize", method=POST) do
    result = ""
    files = filespayload();   
    for index in 1:11
        file = files["$index.png"]
        img = load(IOBuffer(file.data))
        result *= recognizeDigit(img)        
    end    
    return result
end

To load the received file as an image, we will use the Julia Images library that we imported on the first line.

Then, the filespayload() function extracts all files from the received request.

Then, we assume that the request has 11 files and we process them in a loop. Each file data is extracted as an array of bytes, but the load function requires the object that implements an IO buffer. That is why the IOBuffer converts the array of bytes to a suitable format.

Then, the loaded image gets passed to the recognizeDigit function. This function will be written below. It should receive the image, then recognize it using the trained model and return the recognized digit as a string. This digit will be appended to the result string. Finally, the result with 11 recognized digits will be sent to the web page.

Before writing the recognizeDigit function, ensure that the saved model file digits.bson was copied to the folder with your backend code.

Also, it's important to understand that we can't process the input image as is because it has a size of 50x50, and it is a black digit on a white background.

If the model trained on images with size 28x28, then it can't be used to recognize images of other sizes.

Also, the model that trained on images that had white text written on black background will work poorly for colored images and for images with black text on a white background.

So, before you send the image to the model for recognition, you need to preprocess them using the following steps:

Convert the images to gray
Invert the colors
Resize them to 28x28

Now you are ready to implement the digits recognition function:



using Flux, MLUtils, BSON
function recognizeDigit(img)
    # load the model
    BSON.@load "digits.bson" model
    # Convert image to grayscale
    img = Gray.(img)
    # Invert each pixel color
    img = (x->Gray(1)-x.val).(img)
    # resize image to 28x28 pixels
    img = imresize(img,(28,28))
    # Get matrix of image
    digit_data = Float32.(channelview(img))
    # predict the digit (get probabilities)
    probs = model(cat(digit_data,dims=4))
    # return the digit with the largest 
    # probability, converted to a string
    return "$(argmax(probs)[1]-1)"
end

When all this is done, you are almost ready to run the app. Before doing that, ensure that all required packages are installed. Run the julia REPL in a project folder. Then run the following code line by line, to install all packages mentioned in the using lines:



using Pkg
Pkg.add("Genie")
Pkg.add("Images")
Pkg.add("Flux")
Pkg.add("MLUtils")

Then exit the repl using the exit() command.

Now you can run the app. To do that, either execute the julia digits.jl command from the terminal or press Ctrl+F5 in VSCode.

Then, go to http://localhost:8080 in a web browser, draw the digits, press the "Recognise" button, and in a few moments you will see the recognised number as a text in the "Result" field.

This is how the final app should look and work:

Intro to advanced convolutional networks

The image classification task, where you detect the type of object on the image, is the simplest one for convolutional networks. They can do many more complex computer vision tasks, like object detection and image segmentation.

The network, constructed and trained for object detection, can not only determine the type of object on the image, but also the coordinates of this object. Also, object detection can find several objects on the image. Furthermore, assuming that a video is a sequence of images, the object detection can be used on videos to track objects in real time, like showed on the next animation:

The image segmentation goes one step forward. The image segmentation neural network can detect not only object types and their locations, but also the contours of them:

Using this feature, you can extract objects from background or replace background around them. This network used, for example, in Chroma Cam to replace background around a person, during a Zoom call.

As a practical example of both features, let me introduce the Smart Image Cleaner service. It uses object detection to detect objects on images and image segmentation to remove background around them. This tool can be used by designers or by frontend web developers to preprocess images. This video shows how it works:

You can find the service here: https://icleaner.germanov.dev .

Conclusion

In this tutorial, I demonstrated how to create and train both feed forward and convolutional neural networks using Julia. You also learned how to export and use them in a web application.

In addition, I tried to show that you should not reinvent the wheel when creating neural networks.

When solving real life problems, you should not build neural network architectures from scratch. Most of them have already been created by data scientists and enthusiasts around the world. In practice, you will just reuse them.

You'll just need to find the suitable architecture and either use it as is or change the last few layers to adjust the outputs according to your needs.

For example, you can search this collection where you'll find different models classified by problem types. Even if many of them were not created with Julia, you can create them using Flux.jl after reading their descriptions.

The way we created and trained our neural network is not the best or the only possible one. Perhaps in some points I oversimplified things, because I wanted to explain all this as simply as possible.

But if you've understood the examples here, you can learn and reuse the following more advanced Julia solutions of the handwritten digits recognition task:

You can see the source code of this article including the Jupyter Notebook and the web service in this repository.

Have a fun coding and never stop learning!

You can find me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Machine learning with Julia – How to Build and Deploy a Trained AI Model as a Web Service

Andrey Germanov — Fri, 17 Feb 2023 14:00:02 +0000

Introduction
About you
Why Julia?
Install Julia and Jupyter notebook support
Julia basics
    Linear algebra
    Working with datasets
    Vizualizing data
Overview of Titanic machine learning problem
Prepare the training data for machine learning
    Fix missing values
    Fix non-numeric data
    Visual analysis
Train machine learning model
Make predictions and submit them to the Kaggle
Deploy the model to production
    Export the model to a file
    Create the frontend
    Create the backend
Conclusion

Introduction

Julia is a general purpose programming language well suited for numerical analysis and computational science. Sometimes it's stated as a future of machine learning and the most natural replacement for Python in this field.

This article introduces Julia language, and it's ecosystem, shows how to use it to solve a Titanic machine learning competition and submit it to the Kaggle. In addition, it will show how to deploy the created machine learning model to production as a web service and create a web interface to send prediction requests to this service from a web browser.

By the end of the article, you will create a simple AI-powered web application that can be used as a template for creating more complex Julia ML solutions.

About you

This is not a book, but only an article. That is why it can't cover everything and assumes that you already have some base knowledge to get the most from reading it. It is essential that you are familiar with Python machine learning and understand how to train machine learning models using Numpy, Pandas, SciKit-Learn and Matplotlib Python libraries. Also, I assume that you are familiar with machine learning theory: types of machine learning problems like regression and classification, the concept and process of Supervised machine learning (fit/predict and evaluate quality using metrics) and common models used for it, including Random Forest Classifier, and it's implementation in SciKit-Learn Python library. Additionally, it would be great if you previously participated in Kaggle competitions, because to understand and run all code of this article you need to have an account on https://kaggle.com.

There are a lot of books and articles already written, and many courses already released about topics described above. In this article I only show how to create, train and deploy basic machine learning model using Julia, without diving to theoretical aspects of ML and AI.

Why Julia?

For a long time, Python known as a standard for data science and machine learning because of it simplicity and great set of libraries and tools. Among others there are great libraries as Numpy to do linear algebra with vectors and matrices, Pandas to manipulate datasets, Matplotlib for data visualizations and Scikit-Learn that provides a uniform interface to work with well-known machine learning models. Furthermore, the Jupyter Notebooks that allows to write and run Python code online right in a web browser make a comfortable environment for data researchers to design and implement the whole machine learning cycle even if they are not very experienced in programming.

However, all this is good to research in laboratories, but at some step need to go to production and at this moment things change dramatically. The Python was created in early nineties and never supposed to be fast. It's kernel, never assumed to be used for new modern technologies like distributed computing. That is why, to make complex ML tasks production ready, a lot of third party dependencies should be installed and a lot of tricks should be made with that Python code to speed it up. A few companies even rewrite or convert Python machine learning models before deploying them to production in faster languages like C++.

The Julia aimed to resolve these problems. This is what the authors wrote about reasons of creating the Julia:

We are greedy: we want more. We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

Source: The Julia blog.

So, from ML perspective, the Julia got the best from two worlds. It's aimed to be as fast as C and as simple as Python. In addition, it has replacements for all libraries, that Python data scientists used to use in their work:

Purpose	Python	Julia
Linear algebra	Numpy	Built in arrays, LinearAlgebra package
Work with datasets	Pandas	DataFrames.jl
Data visualization	Matplotlib	Plots.jl
Classic Machine learning	SciKit-Learn	MLJ.jl, ScikitLearn.jl, BetaML.jl
Neural Networks	TensorFlow or Pytorch	Flux.jl, BetaML.jl

Read more about why Julia is a great choice for machine learning here.

Furthermore, Julia has a module to support Jupyter Notebooks, so you can write Julia code there the same as on Python. All this makes the Julia ready to do machine learning tasks, including Kaggle competitions, at the same environment as by using Python. Let's install this environment and introduce some Julia ML basics.

Install Julia and Jupyter notebook support

To install Julia, follow this link: https://julialang.org/downloads/, download a Julia package for your operating system and run it. After successful installation, you will be able to run the julia command to enter the Julia REPL environment. Here, you can write and run Julia code. To exit from REPL, enter exit() command.

Also, you can write your code in any text editor and save to files with .jl extension. Then you can run your Julia programs by this command:

julia <filename>.jl

In addition, you can use VSCode to develop on Julia. It has a great extension for this: https://www.julia-vscode.org/.

However, the best option to develop machine learning and data science solutions is Jupyter Notebook, so, ensure that it's installed before continue. Then, install Jupyter support for Julia package using REPL:

Enter REPL using the julia command
Import the Pkg module

using Pkg

Install the IJulia package

Pkg.add("IJulia")

Exit the REPL by exit() command

Then you can run Jupyter and create notebooks with Julia support. For your convenience, the next video shows how to install Julia and integrate it to Jupyter on macOS (assuming that Jupyter itself already installed).

Sometimes the julia command does not work in terminal after installation on MacOS. You can use the following workaround to fix this: https://discourse.julialang.org/t/how-can-i-be-able-to-use-binary-command-julia-in-mac-osx-terminal/22270

Julia basics

Julia has a simple syntax. If you're familiar with Python, then it will be easy to start writing on Julia. You can read more about basic Julia syntax in this article. Here I will only cover features that required for machine learning and only the features which will be used to solve the Titanic Kaggle competition. To learn more about each of these libraries and modules, I will provide useful links.

Create new Jupyter Notebook to enter and run all code samples below.

Linear algebra

Basic linear algebra features already integrated to Julia standard library. Each 1D array is a vector, and each 2D array works as a Numpy array by default. You do not need to include any additional packages for it. For example, if you write and run this code:

A = [
    [1 2 3]
    [4 5 6]
    [7 8 9]
]
B = [
    [7 8 9]
    [4 5 6]
    [1 2 3]
]

A*B

it will do a matrix multiplication and will output the following result:

3×3 Matrix{Int64}:
 18   24   30
 54   69   84
 90  114  138

For additional features, you can import a LinearAlgebra module.

using LinearAlgebra

Then, you can use such functions as det, tr or inv with matrices to get their determinants, traces or inverse matrix:

using LinearAlgebra

A = [
    [1 2 3]
    [4 5 6]
    [7 8 9]
]
println("Determinant: ",det(A))
println("Trace: ",tr(A))
println("Inverse: ")
inv(A)

Find more about linear algebra features in the LinearAlgebra module documentation.

Working with datasets

To work with datasets, you have to install an external Dataframes.jl module. In addition, to load and save datasets to CSV files, you have to add CSV.jl module.

Julia package manager implemented as a Pkg module, so, you have to import it and then use the add method to install required packages. Run this in your Jupyter notebook to install these packages.

using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")

Then, you can import installed modules to your program:

using DataFrames, CSV

DataFrames module imports DataFrame data type, that you will use to construct datasets and manipulate data frame objects.

Create a data frame

This is how you can create a data frame with two columns:

df = DataFrame(name=["Julia", "Robert", "Bob","Mary"], 
age=[12,15,45,32])

This code will create and output the following dataset:

Select data from a data frame

To select data from a data frame, you can use the array syntax:

df[<rows>,<columns>]

You should specify range of rows to select in <rows> and range of columns to select in <columns>. This you can use to select first three rows and only the "age" column:

subs = df[1:3,"age"]

Important to note that array numbering in Julia starts with 1, not with 0 as in most other languages. To select the first three rows and all columns, you can run this:

subs = df[1:3,:]

Also, to select a single column, you can use dot syntax:

names = df.name

As you see, each column is a native Julia array (vector).

You can use conditions to specify row ranges. For example, this can be used to select all persons from dataset that older than 15 years:

older = df[df.age .>15,:]

Sort data in a data frame

To sort data in a data frame, you can use the sort function. This will sort the dataset by age in ascending order:

sort(df,"age")

and next code will sort it in descending order:

sort(df,"age",rev=true)

Add columns to a data frame

To add a new column, just use dot syntax:

df.sex = ["female","male","male","female"]

This added the sex column for persons to the data frame.

Remove columns from a data frame

A select function can be used for more complex data extraction from frames. In particular, it can be used to extract all columns except specified, which is equal to removing these columns:

new_df = select(df,Not("sex"))

This code returns a new data frame by selecting all columns from the original except sex.

Group and summarize data in data frame

A groupby and combine functions are used to group data and show summary information for each group. The former used to group data by specified field or fields and the latter used to add summary columns to it, like number of rows in each group or average value of some column in the group. Next code groups data by sex, calculates number of rows in each group and adds it as a "count" column:

group_df = groupby(df,"sex")
combine(group_df,nrow => "count")

So, the first line of this code creates a GroupDataFrame object with rows, grouped by "sex". The second line creates the "count" column with count of items in each group. There are 2 females and 2 males in this dataset.

Also, a custom function can be used to calculate summary data. For example, this can be used to add both row counts and average ages for each group:

combine(group_df, 
    nrow => "count", 
    "age" => ((rows) -> sum(rows)/length(rows)) => "Average Age"
)

This code adds the "Average Age" column that produced from values of "age" column by applying to it custom anonymous function, that calculates average of values in this group.

It were just a few percents of all possible manipulations that you can do with data using DataFrames.jl library. Read more about it in the documentation.

Vizualizing data

Using Plots.jl, you can create a lot of different graphs to analyze your data, similar to Matplotlib or Seaborn in Python. To use it, you have to install the Plots package to your notebook and import it:

using Pkg
Pkg.add("Plots")
using Plots

Let me provide a few examples of graphs.

Line chart

plot(
    [1,2,3,4,5],
    [3,6,9,15,16],
    title="Basic line chart",label="Line"
)

Scatter plot

plot(
    [1,2,3,4,5],
    [3,6,9,15,16],
    title="Basic scatter plot",
    label="Data",
    seriestype="scatter"
)

Bar chart

The next code generates a bar chart from the df dataset that was created earlier.

plot(
    df.name,
    df.age,
    title="Ages",
    label=nothing,
    seriestype="bar"
)

There are much more that you can do using Plots.js. Read more about it's features in the documentation.

After this short overview of basic data science features of Julia, it's time to create and train the first machine learning model and evaluate its quality on the competition.

Overview of Titanic machine learning problem

The "Titanic - Machine Learning from Disaster" is one of the first educational machine learning problems that you could see in books, articles or courses. In this task you are provided with a dataset of data about Titanic passengers. Each passenger data includes an ID, name, sex, ticket cost, ticket class, cabin number, port of embarkation and number of family members. For passengers in this dataset is known did they survive or not in "Survived" column. If the passenger survived, the value is 1, if not then 0. Formally, this is called a labeled or training dataset. All data columns except one called the "feature matrix", and the "Survived" column called the "labels vector".

There is also the second dataset with the same data about other passengers but without "Survived" column. In other words, this dataset contains only features matrix, but do not have the labels vector. This is called the testing dataset. The task is to train a machine learning model on the training dataset and use this model to predict the "Survived" column values in the testing dataset or, in other words, predict the "labels vector" of the testing dataset based on its "features matrix".

The Kaggle competition is available here: https://www.kaggle.com/competitions/titanic

Read briefly the description, then, open "Evaluation" section to discover how the Kaggle will evaluate the predictions that you submit.

Prepare the training data for machine learning

The "Data" tab on the Kaggle competition page contains training and testing datasets in train.csv and test.csv files, along with descriptions for each data column.

Create new Jupyter notebook with Julia backend and download these files to the same folder with your notebook.

Load train.csv to Data Frame using CSV module:

# Add packages
using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")

# Import modules
using DataFrames, CSV

# Load training data to data frame
train_df = CSV.read("train.csv", DataFrame)

In case of errors, please check that train.csv file exists in a folder where you run your notebook.

If no errors, it will show first rows of the data:

As you see, this dataset has 891 rows and 12 columns. This is the basic data about passenger like "Name", "Sex" and "Age". In addition, we see the "Survived" column, with 0 if passenger did not survive and 1 if survived.

Let's see the summary information about this data using the describe function:

describe(train_df)

This summary table shows info about each column. It shows min, max, mean and median of data in each of them. The basic goal of data preparation is to transform these columns to features matrix and labels vector. The labels vector is ready, this is the "Survived" column with numeric values. All other columns form the features matrix, and not everything ok with them.

Let's look at the nmissing and eltype for each column. The nmissing shows the number of missing values in the appropriate column, and the eltype shows the type of data values in them. The matrix should contain only numbers, but there are many columns of "string" data type. Also, the matrix should not have missing values, but we have some missing values in Age, Cabin and Embarked columns. Let's fix all this.

Fix missing values

As the previous table shows, Age, Embarked and Cabin columns contain missing values. The Embarked absents only in 2 rows, so we will not lose too much data if just remove these rows. DataFrames module has a handy dropmissing function that can be used for this:

train_df = dropmissing(train_df,"Embarked")

This will remove all rows with missing values in the Embarked column.

The Age contains 177 missing values, and it's not a good idea to remove these rows, because we will lose about 20% of data in the dataset. So, let's just fill it with something, for example with median value. The median age is 28 as displayed in the description table. Let's use the replace function of DataFrames to replace missing ages to 28:

train_df.Age = replace(train_df.Age,missing=>28)

The Cabin column contains 687 missing values, which is more than 50% of the dataset. It's too few data in this column to be useful for predictions. Also, it's difficult to predict which data should be in these rows if there are more data is missing than exists. So, let's just drop this column using select function:

train_df = select(train_df, Not("Cabin"))

Finally, all missing data in the dataset has fixed.

Fix non-numeric data

As said before, all data should be encoded to numbers, but we have Name, PassengerId, Sex, Ticket and Embarked as strings.

The Name and the PassengerId values are unique for each passenger, and that is why they can't be used by ML model to split the data to categories or classify it. So, you can just remove these columns:

train_df = select(train_df,Not(["PassengerId","Name"]));

For other string columns, is required to encode all text values to numbers. To do that, need to discover all unique values of these columns. Let's start from the Embarked:

combine(groupby(train_df,"Embarked"),nrow=>"count")

This code grouped dataset by the Embarked column and showed all possible values and their counts. So, here there are "S", "C" and "Q" values only. It's easy to encode them as S=1, C=2 and Q=3. This can be simply done by the following replace function:

train_df.Embarked = Int64.(
    replace(train_df.Embarked, 
        "S" => 1, "C" => 2, "Q" => 3
    )
)

Also, this code converted the column from "String" to "Int64" data type.

Then, repeat the same for the Sex column:

combine(groupby(train_df,"Sex"),nrow=>"count")

and replace female=1 and male=2.

train_df.Sex = Int64.(
    replace(train_df.Sex, 
        "female" => 1, "male" => 2
    )
)

Now it's time to see the summary info for the Ticket column:

combine(groupby(train_df,"Ticket"),nrow=>"count")

Here we see that it has 680 different categories of tickets, which is more than 50% of data. However, we need to predict just two categories, either survived or not survived. Not sure that this data can help the model to make good predictions without additional processing to reduce the number of categories in this column. Although this goes beyond our current basic model, as an additional practice, you can play more with data in this column to improve prediction results, like, try to find how to group tickets to more general categories and encode these categories by unique numbers. For now, let's just remove this column:

train_df = select(train_df,Not("Ticket"))

Now all string data is categorized, and all values replaced to category numbers. Let's describe the dataset again to ensure that all problems with data resolved:

describe(train_df)

You can see that all columns contain only numeric data and there are no missing values in them.

Visual data analysis

Now, the dataset is ready to train a machine learning model on it. Let's visualize this data to find some relations in it.

using Plots

# Group dataset by "Survived" column
survived = combine(groupby(train_df,"Survived"), nrow => "Count")

# Display the data on bar chart
plot(
    survived.Survived, 
    survived.Count, 
    title="Survived Passengers", 
    label=nothing, 
    seriestype="bar", 
    texts=survived.Count
)

# Modify X axis to display text labels 
# instead of numbers
xticks!([0:1:1;],["Not Survived","Survived"])

Here we see that 340 passengers survived. Now let's see how these passengers distributed by sex.

# Group dataset by Sex column 
# and show only rows where Survived=1
survived_by_sex = combine(
    groupby(
        train_df[train_df.Survived .== 1,:],
        "Sex"), 
    nrow => "Count"
)

# Display the data on bar chart 
plot(
    survived_by_sex.Sex, 
    survived_by_sex.Count, 
    title="Survived Passengers by Sex", 
    label=nothing, 
    seriestype="bar", 
    texts=survived_by_sex.Count
)

# Modify X axis to display text 
# labels instead of numbers
xticks!([1:1:2;],["Female","Male"])

Interesting, there are two times more females survived than males in the training dataset. Now let's see the distribution of not survived passengers by ticket class.

# Group dataset by PClass column 
# and show only rows where Survived=0
death_by_pclass = combine(
    groupby(
        train_df[train_df.Survived .== 0,:],
        "Pclass"), 
    nrow => "Count")

# Display the data on bar chart 
plot(
    death_by_pclass.Pclass, 
    death_by_pclass.Count, 
    title="Dead Passengers by Ticket class", 
    label=nothing, 
    seriestype="bar", 
    texts=death_by_pclass.Count
)

# Modify X axis to display 
# text labels instead of numbers
xticks!([1:1:3;],["First","Second","Third"])

This clearly shows that first and second class passengers had more chances to survive than third class ones.

Assuming that data in the training and the testing datasets distributed randomly, it's highly likely that a machine learning model trained on this data should predict that women in first or second class had much more chances to survive than others. Let's remember this finding to check this hypothesis at the end of the article, after train and deploy the ML model.

Finally, let's see the cleaned training dataset again:

train_df

Now it really looks like a matrix, or, to be more precise, like a system of algebraic linear equations written in matrix form. Data in matrix format is exactly what the most machine learning algorithms expect to get as an input. Let's get started.

Train machine learning model

For machine learning, we will use SciKitLearn.jl library, which replicates SciKit-Learn library for Python. It provides an interface for commonly used machine learning models like Logistic Regression, Decission Tree or Random Forest. SciKitLearn.jl is not a single package but a rich ecosystem with many packages, and you need to select which of them to install and import. You can find a list of supported models here. Some of them are built-in Julia models, others are imported from Python. Also, the SciKitLearn.jl has a lot of tools to tune the learning process and evaluate results.

For this "Titanic" task, we will use the RandomForestClassifier model from the DecisionTree.jl package. Usually it works good for classification problems. Also, we will use the Cross Validation to calculate accuracy of model predictions from SciKitLearn.CrossValidation package. You have to install and import these packages before using them:

Pkg.add("DecisionTree")
Pkg.add("SciKitLearn")
using DecisionTree, SciKitLearn.CrossValidation

Then we will implement the training process. First we need to split the training dataset to features matrix and labels vector, then we need to create the RandomForestClassifier model and train it using this data. Finally, we will evaluate a prediction accuracy of this model using cross_val_score function.

# Put "Survived" column to labels vector
y = train_df[:,"Survived"]
# Put all other columns to features 
# matrix (important to convert to "Matrix" data type)
X = Matrix(train_df[:,Not(["Survived"])])

# Create Random Forest Classifier with 100 trees
model = RandomForestClassifier(n_trees=100)

# Train the model, using features matrix 
# and labels vector
fit!(model,X,y)

# Evaluate the accuracy of predictions 
# using Cross Validation
accuracy = minimum(
    cross_val_score(model, X, y, cv=5)
)

The cross validation splits X and y arrays to 5 parts (folds) and return the array of accuracies for each of these parts. Then the minimum function selects the worst accuracy from this array, which means that all others are better than the selected one. Finally, the achieved accuracy is more than 0.78, which is 78% for our training data. It's not bad, but does not guarantee that on the testing dataset the result will be the same. You can try to improve this value by selecting different models, or by tuning their hyperparameters. For example, you can increase the number of trees (n_trees) from 100 to 1000 or reduce to 10 and see how it will change the accuracy. After achieving the best result, it's time to use it for predictions.

Make predictions and submit them to the Kaggle

Now, when the model is ready, it's time to apply it to data from test.csv file which does not have the "survived" labels. First we need to load it and look the summary table as we did for training dataset:

test_df = CSV.read("test.csv",DataFrame)
describe(test_df)

Here you can see the same problems with data: missing values and string columns. You need to apply exactly the same transformations to this data as you did in the training dataset, except removing any rows because the Kaggle requires that you do predictions for each row, so you can only fill missing values, but not remove the rows with them. Fortunately, the Embarked column does not have missing values, so there is no need to fix it. However, this dataset has a single missing value in the Fare column, but we did not have any missing values there in the training set. It's not a big problem, you can just replace this missing value by median 14.4542.

But first thing that needed to do, is to save the PassengerId column to separate variable. It will be required later for the Kaggle submission.

PassengerId = test_df[:,"PassengerId"]

Then, apply all required data fixing:

# Repeat the same transformations as we did for training dataset
test_df = select(test_df,
    Not(
        ["PassengerId","Name","Ticket","Cabin"]
    )
)
test_df.Age = replace(test_df.Age,missing=>28)
test_df.Embarked = replace(
    test_df.Embarked,"S" => 1, "C" => 2, "Q" => 3
)
test_df.Embarked = convert.(Int64,test_df.Embarked)
test_df.Sex = replace(
    test_df.Sex,"female" => 1,"male" => 2
)
test_df.Sex = convert.(Int64,test_df.Sex)

# In addition, replace missing value
# in 'Fare' field with median
test_df.Fare = replace(
    test_df.Fare,
    missing=>14.4542
)

After the testing dataset is clean, you can use the trained model to make predictions:

Survived = predict(model, Matrix(test_df))

This code returns array of predictions for each row of testing dataset matrix and saves it to the Survived variable.

Now it's time to submit it to Kaggle. Before doing it, look again to "Evaluation" tab on the Kaggle Titanic competition page to see the required submission format:

The competition requires the CSV file with two columns: "PassengerId" and "Survived". You already have all this data. Let's create the data frame with these two columns and save it to CSV:

submit_df = DataFrame(PassengerId=PassengerId,Survived=Survived)
CSV.write("submission.csv",submit_df)

The first line of this code constructs the submit_df data frame with the PassengerId column that was saved previously and the Survived column with predictions for each passenger ID. The second line saves this submit_df to the submission.csv file. This is how the content of this file looks:

Finally, go to the Kaggle competition page, press the "Submit Predictions" button, upload the submission.csv file and see your result. When I did this, I received the following:

The prediction accuracy is 0.76555 which is more than 76% and is close to the accuracy that was received on the training dataset. Not bad for the first time, but you can keep going: play with data, try different models, change their hyperparameters, surf Internet for articles and Jupyter notebooks of other people who solved the Titanic competition before. I know that it's possible to achieve up to 98% accuracy using various tricks with models and data.

Deploy the model to production

It's fun to play with machine learning on your computer, but it does not have any sense for the surrounding world. Usually, customers do not have Jupyter Notebooks and they do not train the models. They need to have a simple tools that will help them to make decisions based on predictions from data that they have. That is why the only really important thing is how your models will work in production. In this section, I will explain how to use Julia to create a web application that will load the machine learning model you trained to make predictions online in a web browser.

Export the model to a file

First, you need to save the model from the notebook to a file. For this you can use JLD2.jl module. This module used to serialize Julia object to HDF5-compatible format (which is well known by Python data scientists) and save it to a file.

Install and load the package to the notebook:

Pkg.add("JLD2")
using JLD2

and then save the model variable to the titanic.jld2 file:

save_object("titanic.jld2", model)

The work with Jupyter Notebook ended now. All next code should be written as a separate application. Create a folder for a new application, like titanic for example, and copy the titanic.jld2 file to it.

Now you can create a text file titanic.jl which will contain a code of the web application that you will write soon. Use any text editor for this or VS Code with Julia extension. Enter the following to titanic.jl:

using JLD2, DecisionTree
model = load_object("titanic2.jld2")
survived = predict(model,[1 2 35 0 2 144.5 1])
println(survived)

This code imported required modules first. As you see, just two modules required to run prediction process: the JLD2 to load the model object, and the DecisionTree to run predict function for the RandomForestClassifier. Then, the code loads the model from the file, then it makes predictions for a single row of data. The columns in this row should go in the same order as they passed from the dataset when trained the model: Pclass, Sex, Age, SibSp, Parch, Fare and Embarked. Finally, it prints the array of predictions. In this case, it will print the array with a single item, because only a single row of data passed to the model for predictions.

You can run this code using julia command:

julia titanic.jl

If everything work ok, it should print [0] or [1] to the console depending on prediction result. If you receive errors, then perhaps you need to install JLD2 and DecisionTree packages using Julia REPL environment, as you did it in the Jupyter notebook.

Now, let's refactor this code to a function that will receive the row of data and return a survival prediction (either 0 or 1):

using JLD2, DecisionTree

# Returns 1 if a passenger with
# specified 'data' survived or 0 if not
function isSurvived(data)
    model = load_object("titanic2.jld2")
    survived = predict(model,data)
    return survived[1]
end

Create the frontend

The next step is to create a web interface, that will be used to collect the data for this function. This will look as displayed on the next screenshot:

With this interface, the user can enter the data about a passenger, then press the "PREDICT" button and discover could the passenger with this data survive on Titanic or not. This is an HTML code of this web page:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Titanic</title>
</head>
<body>
    <table>
        <tbody>
            <tr>
                <td>Ticket class</td>
                <td>
                    <select id="pclass">
                        <option value="1">1</option>
                        <option value="2">2</option>
                        <option value="3">3</option>
                    </select>
                </td>
            </tr>
            <tr>
                <td>Sex</td>
                <td>
                    <select id="sex">
                        <option value="1">Female</option>                        
                        <option value="2">Male</option>
                    </select>
                </td>
            </tr>
            <tr>
                <td>Age</td>
                <td>
                    <input id="age" type="number"/>
                </td>
            </tr>
            <tr>
                <td># of Siblings/Spouces</td>
                <td>
                    <input id="sibsp" type="number"/>
                </td>
            </tr>
            <tr>
                <td># of Parents/children</td>
                <td>
                    <input id="parch" type="number"/>
                </td>
            </tr>
            <tr>
                <td>Fare</td>
                <td>
                    <input id="fare"/>
                </td>
            </tr>
            <tr>
                <td>Embarked</td>
                <td>
                    <select id="embarked">
                        <option value="1">S</option>
                        <option value="2">C</option>
                        <option value="3">Q</option>
                    </select>
                </td>
            </tr>
            <tr>
                <td>Survived</td>
                <td id="survived"></td>
            </tr>
            <tr>
                <td colspan="2">
                    <div>
                        <button id="submit" type="button">PREDICT</button>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
    <script>
        document.getElementById("survived").innerHTML = "";
        document.getElementById("submit").addEventListener("click",async() => {
            response = await fetch("http://localhost:8080",{
                method:"POST",
                body: JSON.stringify({
                    "pclass":parseInt(document.getElementById("pclass").value),
                    "sex":parseInt(document.getElementById("sex").value),
                    "age":parseFloat(document.getElementById("age").value),
                    "sibsp":parseInt(document.getElementById("sibsp").value),
                    "parch":parseInt(document.getElementById("parch").value),
                    "fare":parseFloat(document.getElementById("fare").value),
                    "embarked":parseInt(document.getElementById("embarked").value),
                })
            });
            const survivedCode =  parseInt(await response.text());
            document.getElementById("survived").innerHTML = survivedCode ? "YES" : "NO"
        })
    </script>
    <style>
        input,select {
            width:100%;
        }
        td {
            padding:5px;
        }
        td > div {
            text-align: center;
        }
        #survived {
            font-weight: bold;
            color:green;
        }
    </style>
</body>
</html>

Create an index.html file in the same folder and copy this code to it. The HTML part of the file contains a simple form with all data fields. As you see, all values encoded to the same numbers as we did with data in training and test datasets. Then, the JavaScript part of this code defines the handler of the "PREDICT" button. When the user clicks on it, the script collects all entered data and saves it as a JSON string. Then it makes an AJAX request to the web service running on port 8080 of the localhost (which have not created yet) and sends this JSON to the web service. So, the web service should be able to receive HTTP POST requests with JSON body in the following format:

{
     "pclass": 1,
        "sex": 1,
        "age": 32,
      "sibsp": 5,
      "parch": 6,
       "fare": 123.44,
   "embarked": 1
}

Create the backend

Now it's time to modify the titanic.jl file to make it work as a web server, that can display the index.html page, receive POST request from it, parse the body of this request to JSON, make prediction based on this JSON data and return this prediction to the web page.

Creating a web server on Julia is the same simple as on Python, Go, or Node.js. By using HTTP.jl package, you can create and run a web server by a single line of code:

using HTTP

HTTP.serve(handler,8080)

function handler(req)
    # handle HTTP request
end

The HTTP.serve function runs the web server on the specified port. Each time when the web server receives a client request, it calls the specified handler function and sends an HTTP request object to it as a req argument. The function should read this request, process it and write a response to the calling client.

The req.url field contains the URL of the received request, the req.method field contains request method, like GET or POST, the req.body field contains the POST body of the request in binary format. HTTP request object contains much other information. All this you can find in HTTP.jl documentation. Our web application will only check the request method. If the received request is a POST request, it will parse req.body to JSON object and send the data from this object to the isSurvived function to make a prediction and return it to the client browser. For all other request types, it will just return the content of the index.html file, to display the web interface. This is how the whole source of titanic.jl web service looks:

using JLD2, DecisionTree

# Returns 1 if a passenger with
# specified 'data' survived or 0 if not
function isSurvived(data)
    model = load_object("titanic.jld2")
    survived = predict(model,data)
    return survived[1]
end

using HTTP,JSON3

function handle(req)
    if req.method == "POST"
        form = JSON3.read(String(req.body))
        survived = isSurvived([
            form.pclass
            form.sex
            form.age
            form.sibsp
            form.parch
            form.fare
            form.embarked
        ])
        return HTTP.Response(200,"$survived")
    end
    return HTTP.Response(200,read("./index.html"))
end

HTTP.serve(handle, 8080)

Before running it, you need to install the HTTP.jl package by running Pkg.add("HTTP") in the julia REPL environment.

The web service code goes right after isSurvived function. First, the required modules imported: HTTP to create a web server and JSON3 to parse JSON from request body. Then, the handler function defined. The function checks request method of received requests and if it equals to POST, it converts the stringified JSON body of this request to the form object. Then, using fields of this object, the isSurvived function called. It's important to put array items in correct order here. Finally, the prediction result is returned to the client using the HTTP.Response function.

For all other request types, the function returns the body of index.html file in the HTTP.Response(200,read("./index.html")) line.

Finally, HTTP.serve function starts a web server on port 8080 that waits for the HTTP requests and handles them using the handle function, defined above.

Now you can run this by typing julia titanic.jl in terminal or by pressing Ctrl+F5 in VSCode. Then you can access the web interface from a web browser on http://localhost:8080 and play with the service by entering data in the form, press the PREDICT button and see either YES or NO on the Survived line depending on the prediction result. You can check the hypothesis which we made from bar charts: the women in 1 or 2 class have more chances to survive than others.

Conclusion

In this article, I introduced the Julia programming language along with its ecosystem and explained why it's so great for machine learning. I showed how to set up a comfortable development environment and gave a brief overview of the common Julia modules used for data science. Then I guided you through the process of training the machine learning model for the Titanic competition and showed how to make predictions and submit them to the Kaggle platform for scoring. Finally, I showed how to export this model to an external application, create the web service with this model and the web interface to enter data to the form and predict could the human with this data survive on the Titanic or not.

For all topics that explained briefly, I provided the links with more thorough documentation. In addition, I would highly recommend reading the Julia Data Science online book and learn the great set of machine learning examples in Julia Academy Data Science GitHub repository.

See the source code of this article including the Jupyter Notebook and the web service in this repository:

https://github.com/AndreyGermanov/julia_titanic_model

Have a fun coding and never stop learning!

Subscribe to the newsletter on my website: https://germanov.dev/#newsletter and follow me on social networks to know first about new articles like this one and other software development news:

LinkedIn: https://www.linkedin.com/in/andrey-germanov-dev/
Twitter: https://twitter.com/GermanovDev
Facebook: https://www.facebook.com/AndreyGermanovDev

Efficient string building in JavaScript

Andrey Germanov — Fri, 25 Nov 2022 06:14:32 +0000

Everything that we see in browser except images and videos are strings, that is why if work with them wisely, you can dramatically increase the performance of your web applications both on a frontend and on a backend.

What should you know about strings in programming? The string is a primitive data type that holds an array of characters. Values of primitive data types are immutable, so a string's value cannot be changed after instantiation. This is true for most programming languages including JavaScript. But wait, when you do this:



let hello = "Hello";
hello += " world";
console.log(hello);

It's obvious that you'll see Hello world on the console, which means that the value of the hello variable has changed. How is it possible? How can Javascript change the value of a string variable and keep it immutable at the same time?

It happens because Javascript does not add the second string to the first string directly, but instead, it creates a third empty string, then copies the values of both strings to it and finally, reassigns the "hello" variable to this third string. In this way, the value of the third string is set only once and values of two initial strings stay unchanged to meet the immutability rule. This is how the whole string concatenation process looks:

Do you see any problem here? What can be said about the performance of this operation? It seems that it does up to five times more operations than it should and it uses two times more memory in step 3 to hold the same data.

On the one hand it's not a big issue if we just want to concatenate two strings, because computers can do millions of operations in a second. However, the problem becomes more serious if we need to build long strings. Let's say that we need to construct a big portion of HTML content from an external data array in a loop. In this case the HTML string can become huge during this process and Javascript will create a copy of this string on each iteration of loop.

As an example, let's see the code that builds a huge string in a loop, by concatenating the initial string hundred of millions of times.



let str = "Hello";

console.log("START",new Date().toUTCString());

for (let index=0;index<100000000;index++) {
    str += "!";
}

console.log("END",new Date().toUTCString());
console.log(str.length);

This code appends the "!" symbol to the string a hundred million times. In a real world example you can assume that instead of '!' symbol it could be a real data from external source that should be displayed later.

Also, this code outputs the current date and time before and after the loop which helps to measure how long it takes. Finally it displays the length of the constructed string.

When I ran this in my Google Chrome browser it took a while to complete. Finally it displayed the following on the Javascript console:

As you can see, it took 1 minute 26 seconds and output the correct length of the concatenated string. However, when I ran this on another computer, this code crashed the browser and I saw the following output:

If remember the basic algorithm of string concatenation, described above, it should be clear why it could happen. The default string concatenation algorithm is too inefficient and wastes a lot of memory. In this example, it copies from 1 to hundred of millions of chars hundred of millions of times while iterating through the loop. The amount of memory that can be used for this is even difficult to realize. This means that whether it crashes depends on the amount of available free memory and how the memory garbage collector works in a concrete JavaScript engine implementation to erase unused temporary strings.

The JavaScript string concatenation algorithm we discussed above does not claim to be academically accurate. Various implementations of JavaScript engines may use different string handling optimizations and memory handling mechanisms.

But you should not count on the fact that your code will always run in such engines.

For example, in the latest version of Google Chrome at the time of this writing, string concatenation worked as shown in the screenshots above. So the purpose of this article is to show how to work with strings more efficiently, regardless of how it is implemented by default.

Definitely we should find a way to do exactly what we need by concatenating two strings using a single operation. Many other programming languages, like Java or Go, which also use immutable strings, have a tool called StringBuilder. This is a helper object that allows you to construct a string from elements of array or from other mutable object. However, JavaScript does not have this built-in feature. Thus, we are here today to return to the beginning and fix this flaw.

You can write the same string in a different way:



let hello = ["Hello"];

This is not a string, but this is an array with string. Instead of strings, arrays are mutable and you can just change them by adding items. It means, that if you run this:



hello.push(" world");

Javascript will just mutate the array by appending the " world" item to the end. This will be done in a single operation after which the array will contain the following:



["Hello"," world"]

This way you can concatenate as many strings as you need to this array in a very low cost. Finally, to create the string from it, you can run the join operation on the array:



hello = hello.join("");
console.log(hello);

After this the output of "hello" variable will contain the "Hello world" string. Actually, the join operation also creates an empty string and then copies items from the array to it. However, it only happens once, instead of every time when concatenating strings.

This approach dramatically increases string concatenation speed in a loop. Let's change the loop example to use the array instead of string:



let str = ["Hello"];

console.log("START",new Date().toUTCString());

for (let index=0;index<100000000;index++) {
    str.push("!");
}

str = str.join("");

console.log("END",new Date().toUTCString());
console.log(str.length);

After running this on the same browser, I received the following output:

As you can see, the same result was achieved in 8 seconds, which is 10 times faster than regular string concatenation.

For Javascript we constructed the concept of custom StringBuilder that can only append strings. As a homework, you can extend it and add different methods to "append", "insert" or "remove" strings from an array. You could create a class that incapsulates an array variable and contains functions to manipulate strings in this array and construct the string from it when required.

When adding elements to an array, it is important to keep in mind the existing limits on the number of array elements. If you do not take them into account, you may encounter the "RangeError: invalid array range" error. You can learn more about the limits here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Errors/Invalid_array_length.

If the number of lines to be added in the loop exceeds these limits, then you will have to periodically flush the array into temporary string buffers and then merge these buffers.

To help you to work with strings even more efficiently, there are more great string handling algorithms available.

One of the fastest of them based on a data structure called "Rope". It was invented to efficiently handle operations on huge strings: https://en.wikipedia.org/wiki/Rope_(data_structure). This is more complex than the method discussed above, but you can start from reusing one of the Javascript implementations of the Rope in your projects:

https://github.com/component/rope
https://github.com/josephg/jumprope

Thus, by changing just three lines of code, you can significantly increase the performance of your data processing pipeline. You can use this method when building strings in a loop from external data streams of variable size in JavaScript. Just add strings to an array one by one and finally join them to a string before output. Other programming languages recommend using internal StringBuilder or StringBuffer objects for string concatenation.

As part of my practice, I had a client whose website was experiencing slowdowns due to ineffective string handling that he attempted to resolve by caching content in CloudFlare. He also seriously considered moving to AWS to increase data throughput to resolve these issues. But it was enough to do a code review to fix it.

Good luck and happy coding guys!

Feel free to connect and follow me on social networks where I publish announcements about my upcoming software, articles, similar to this one and other software development news:

LinkedIn: https://www.linkedin.com/in/andrey-germanov-dev/
Facebook: https://web.facebook.com/AndreyGermanovDev
Twitter: https://twitter.com/GermanovDev

My online services website: https://germanov.dev

Frontend vs Backend (To Go or not to Go)

Andrey Germanov — Mon, 21 Nov 2022 08:03:42 +0000

Last few weeks, I tried to speed up Javascript. It could sound funny and weird, because many others, like Google, have been doing it for years. However, I do not delve to this too deeply. I've just investigated how to speed up my SmartShape Javascript library (https://www.npmjs.com/package/smart_shape) to load and manipulate large interactive vector shapes in real time on a web page. As an example for testing, I selected a "Natural Earth" database of countries - https://www.naturalearthdata.com/downloads/. It contains the countries' borders at 3 different scales: 1:110m, 1:50m and 1:10m. To make things simple, I exported it to GeoJSON format. So, the task is to load all countries from GeoJSON and convert them to SmartShape objects. Then assemble all countries into a single world map and convert this map to an SVG vector graphics file.

The input is an excerpt of the Natural Earth countries database that consists of 684 GeoJSON files. It's about 28MB in total size. Each GeoJSON file contains one or more polygons. Each polygon is defined by points in polar coordinates (latitude and longitude). The whole database for the experiment consists of 6188 polygons and 658287 points. So the goal is to load all these polygons to a web browser, convert their polar coordinates to screen coordinates, assemble them into a SmartShape object as a single world map, then scale it to specified width and height and render an SVG file from this data.

Actually this job was done successfully a week ago and I shared the resulting SVG file in the previous post on Linkedin https://www.linkedin.com/feed/update/urn:li:activity:6997894438991626240/. The Javascript code did all this loading and conversion work in 6 minutes and 21 seconds.

However, this process should not be done only once. It should work in real time. For example, when the user wants to resize the world map using the mouse, or add/remove countries from it. Therefore, this timing is unacceptable and I had been trying different optimizations, until finally decided to use Go. This is my favorite programming language for web backends and it performed really well. The Go code completed the same task on the same computer in less than a second (in 0.82 of a second, precisely), which is 465 times faster than Javascript. Go is wonderful, I always knew that. However, the idea to do everything in frontend, without needing to exchange data between backend and browser looked very attractive for this task. However, in-browser Javascript is TOOOO SLOOOOWWW for this. So, even if the Internet connection will be extremely slow when the generated SVG is sent from the Go backend to the browser, it will still work many times faster than an in-browser solution. On the one hand I think that it is possible to optimize JS code to perform slightly better. However, it will never be as fast as Go. Finally, I think that it's not an option to use a single tool for everything. Each task should be handled with the tool that will work best for it.

As a result, the frontend library has now become a full-stack solution. Now SmartShape has an additional option for huge shapes: to not render SVG drawing of shape by itself, but request the rendering process from an external source (like backend web service) and then display and manipulate already pre-rendered shape in a web browser. My research has some practical output. Let me introduce a new online service, which can be used to generate different vector maps and save them to SVG:

https://maps.germanov.dev

The source of backend service, that used to generate SVG maps is available here: https://github.com/AndreyGermanov/mapbuilder_backend

Generated SVGs contain only shapes without any styling. But it's easy to open it in any text editor and style it using CSS as needed. In addition, any visual editor that can open SVG files can be used for this.

Also, in the coming future I will add a feature to create and design vector maps visually in SmartShape Studio: https://shapes.germanov.dev.

Follow me to not miss anything.

LinkedIn: https://www.linkedin.com/in/andrey-germanov-dev/
Facebook: https://web.facebook.com/AndreyGermanovDev
Twitter: https://twitter.com/GermanovDev

My online services website: https://germanov.dev

Simple way to add custom context menus to web pages.

Andrey Germanov — Wed, 05 Oct 2022 19:38:27 +0000

In this tutorial, I'll explain how to add context menus to elements of a web page using my newly developed component. This menu can appear when a user right-clicks on an element of a web page or just puts the mouse cursor on that element. This is an example of how the menus, created using this component work in practice.

In the next sections I will show how to setup the menu in basic mode and then customize it.

Install
Create basic menu
React to user actions
Add images to menu items
Styling the menu
Menu API
Conclusion

Install

To include the component to your web page, download context_menu.umd.cjs script and include it to your web page:



<script src="context_menu.umd.cjs"></script>

Also, if you use some JavaScript framework, you can install the component using NPM:



npm install simple_js_menu

and then import Menus factory object to your JavaScript code:



import {Menus} from "simple_js_menu";

Create basic menu

When you install the component using one of the methods described above, you get a Menus factory object that you can use to create context menus and connect them to objects on your web page. Let's assume that we have some <div> element on a web page and want to have a context menu that appears when the user right-clicks on it. Next example shows how to implement this:



<html>
<head>
    <title>Context Menu Demo</title>
    <script src="https://code.germanov.dev/context_menu/context_menu.umd.cjs"></script>
</head>
<body>
<div id="myDiv">My element</div>

<script>
    // obtain a link to the element 
    const div = document.getElementById("myDiv");
    // define menu items
    const items = [
        {id:"item1", title: "Item 1"},
        {id:"item2", title: "Item 2"},
        {id:"item3", title: "Item 3"}
    ];
    // create menu with specified items
    // and bind it to specified DIV element
    const menu = Menus.create(items,div);
</script>
</body>
</html>

When you open this page, there will be "My element" text inside the DIV. If you right-click on it, you'll see the context menu with three items, as displayed below:

This is how it basically works. To initialize menu, we executed the following code:



const menu = Menus.create(items,div)

Menus.create method receives items and div arguments, constructs a menu and returns the menu object.

items argument defines an array of menu items. Each item is an object with 3 fields: id - unique ID of menu item, title - the text inside menu item, image - URL of image which appears on the left side of menu item.
div argument defines an HTML element to which this menu connects. It can be any HTML element, not only a div.

Also, the create method has the third optional argument eventName which defines an event which should happen with div to display the menu. By default eventName equals to contextmenu, which shows a menu when the user right-clicks on the div, but you can change it to show a menu when the user hovers over the element, for example. The standard Javascript mouse events are supported (mouseover, mousedown, mouseup, mousemove, click and so on). For example, this code will show menu, if user moves cursor to the area of the element:



const menu = Menus.create(items,div,'mouseover')

So, using Menus.create we created a basic context menu and received a 'menu' object. In the next sections I will show how you can extend features of the menu using that menu object.

React to user actions

Obviously, the next step is to perform some actions when the user selects items from the menu. In most cases, it means that we have to reply to click events. You can, however, react to any events that the user can initiate using the mouse. To react to user actions with the menu use the menu.on(eventName,handler) method where eventName is an event, to which you want to react. For example you can add to the previous code:



    menu.on('click', (event) => {
        switch (event.itemId) {
            case "item1":
                console.log("User clicked first item");
                console.log("Mouse cursor position X:",
                    event.cursorX,"Y:",event.cursorY);
                break;
            case "item2":
                console.log("User clicked second item");
                console.log("Mouse cursor position X:",
                    event.cursorX,"Y:",event.cursorY);
                break;
            case "item3":
                console.log("User clicked third item");
                console.log("Mouse cursor position X:",
                    event.cursorX,"Y:",event.cursorY);
                break;
        }
    })

Then, when you run and click any of the menu items, you'll see similar lines in the Javascript console:



User clicked first item Mouse cursor position X: 78 Y: 17
User clicked second item Mouse cursor position X: 68 Y: 16
User clicked third item Mouse cursor position X: 64 Y: 18
User clicked third item Mouse cursor position X: 13 Y: 16

Let's discuss the on method more. Basically, it works the same as the 'on' method in JQuery. It receives as arguments a JavaScript event name and a callback function that will be called when this event occurs on the menu object. The callback function receives an event argument that can be used to discover the properties of the event that triggered. This is an MouseEvent object extending by the following properties:

itemId - ID of menu item that triggered this event
cursorX - the X position of the mouse cursor when the user displays the menu
cursorY - the Y position of the mouse cursor when the user displays the menu

So, this way using a single method, you can process clicks from all menu items. Furthermore you can react not only to clicks, but to other mouse events: mouseover,mouseout,dblclick,mousedown,mouseup,mousemove:



menu.on('mouseover', (event) => {
...
}

menu.on('mouseout', (event) => {
...
}

menu.on('mousedown', (event) => {
...
}

Add images to menu items

Menu items may have images. To add an image to each item, need to specify the 'image' field for these items when defining the items array. Let's redefine items array this way:



    const items = [
        {
            id:"item1",
            title: "Batman",
            image:"https://code.germanov.dev/context_menu/assets/batman.svg"
        },
        {
            id:"item2",
            title: "Hacker",
            image:"https://code.germanov.dev/context_menu/assets/hacker.svg"
        },
        {
            id:"item3",
            title: "Santa",
            image:"https://code.germanov.dev/context_menu/assets/santa.svg"
        }
    ];

After run the previous menu example with menu items defined this way you will see this:

You need not worry about the size of images, because they will be automatically resized to a height of text and aligned properly.

Styling the menu

By default the menu is a gray panel with a gray border. Menu items are displayed as transparent DIVs with black text in the default font and optional images. However, you can redefine the styling by providing your own CSS class to any or all of these parts using the following methods:

menu.setPanelClass(className) - override CSS class for menu panel DIV
menu.setItemClass(className, id) - override CSS class for menu items DIVs. If id parameter specified, then override CSS class only for item with specified ID.
menu.setTextClass(className, id) - override CSS class for text of menu items DIVs. If id parameter specified, then override CSS class only for text of item with specified ID.
menu.setImageClass(className, id) - override CSS class for image of menu items DIVs. If id parameter specified, then override CSS class only for image of item with specified ID.

Let's add some CSS classes to redesign the menu:



.panel {
    background-color: #454d55;
    border-width:0;
    color: white;
    box-shadow: 5px 5px #88888855;
    border-radius: 10px;
}
.item {
    border-bottom-style: outset;
    padding:3px;
    border-bottom-width: 1px;
    font-weight:bold;
    font-size:15px;
    text-transform: uppercase;
}
.item:hover {
    background-color: yellow;
    color: black;
}

.last_item {
    padding:3px;
    border-bottom-width: 0;
    font-weight:bold;
    font-size:15px;
    text-transform: uppercase;
}

.last_item:hover {
    background-color: yellow;
    color: black;
}

Then, let's apply these styles to various parts of the menu:



// Set class for menu panel
menu.setPanelClass("panel");
// Set class for all menu items
menu.setItemClass("item");
// Set specific class for last menu item (remove bottom border)
menu.setItemClass("last_item","item3");
// Reset default styles of text to use text color from panel
menu.setTextClass(" ");

If you apply these styles and add this Javascript code to the previous menu, you'll get the following result:

Notice that I also added :hover styles for items, because it's required to show selected item status when the mouse enters the menu item.

Menu API

In addition to the basic features, described above, the Menu object has other helpful properties and methods. For example, the panel property contains a generated menu panel HTML node. Method .setId(id) can be used to set ID for this HTML node. The method .addItem(id,title,image) can be used to add new items to a menu. Method .removeItem(id) allows to remove menu item with specified ID.

You can find full description of all properties and methods of Menu object in the API documentation.

Conclusion

There are many wonderful UX libraries like MaterialUI, Ant Design or JQuery UI that contains a lot of widgets, including menus. However, if you integrate one of them to your project, it could increase the size of your site up to several megabytes. Also, you can face integration problems between CSS styles of your site and CSS styles of the library. So, if you have a simple project and if you only need a context menu, you can use my new component, and I would be happy if you found this tutorial useful.

The idea for creating this menu component came from Aneta Chwała. She is a frontend developer, the founder of Rock JavaScript Facebook group where I have the honor of being a group expert: https://www.facebook.com/groups/251411400345198

Aneta hosts a YouTube channel where you can learn JavaScript and other frontend technologies:

https://www.youtube.com/c/RockJavaScript.

The full source code for this tutorial you can find here:

https://github.com/AndreyGermanov/context_menu/blob/main/tests/demo_prod.html

GitHub repository of this component: https://github.com/AndreyGermanov/context_menu

Context menu component on NPM: https://www.npmjs.com/package/simple_js_menu

Try these context menus in SmartShape Studio: https://shapes.germanov.dev . Here you can use context menus to work with vector shapes, as shown in the animation at the beginning of this article.

Feel free to connect and follow me on social networks where I publish announcements about my upcoming software, articles, similar to this one and other software development news:

LinkedIn: https://www.linkedin.com/in/andrey-germanov-dev/
Facebook: https://web.facebook.com/AndreyGermanovDev
Twitter: https://twitter.com/GermanovDev

My online services website: https://germanov.dev

Happy coding guys!

DEV Community: Andrey Germanov

A simple way to extract all detected objects from image and save them as separate files using YOLOv8.2 and OpenCV

Table of Contents

Introduction

Sample image

Detect objects using YOLOv8

More about different YOLOv8 models

Run the model to detect objects

Parse detection results

Extract objects with background

Extract objects without background

Conclusion

Teeth caries detection using YOLOv8 neural network

Table of contents

Introduction

Prepare the dataset

The source dataset format

The YOLOv8 dataset format

Convert the dataset

Create the YOLOv8 dataset folder structure

Generate the data.yaml file

Copy images from source to destination datasets

Convert annotations

Train the caries detector model

Detect caries on custom image

Create a web-service to detect caries

Conclusion

Export Segment Anything neural network to ONNX: the missing parts

Table of Contents

Introduction

What is a problem ?

Diving to the SAM model structure

Export SAM to ONNX - the right way

Export the image encoder

Export the mask decoder

Produce image segmentation masks using ONNX

Preprocess input image

Generate embeddings from input image

Encode the prompt

Run the mask decoder

Post-process and visualize segmentation mask

Conclusion

How to implement instance segmentation using YOLOv8 neural network

Table of Contents

Introduction

Getting started with YOLOv8 segmentation

Train the YOLOv8 model for image segmentation

Using YOLOv8 segmentation model in production

Export the YOLOv8 segmentation model to ONNX

Load the model using ONNX

Prepare the input

Run the model

Process the output

Join bounding boxes and masks

Parse the combined output

Process segmentation masks

Calculate bounding polygons

Draw bounding polygons on the image

Create a segmentation web application

Create a backend

Create a frontend

Conclusion

How to detect objects in videos in a web browser using YOLOv8 neural network and JavaScript

Table of Contents

Introduction

Adding a video component to a web page

Capture video frames for object detection

Detect objects in video

Prepare the input

Run the model

Process the output

Draw bounding boxes

Running several tasks in parallel in JavaScript

Running the model in a background thread

Conclusion

How to create YOLOv8-based object detection web service using Python, Julia, Node.js, JavaScript, Go and Rust

Table of contents

Introduction

YOLOv8 deployment options

Export YOLOv8 model to ONNX

Generate the `data.yaml` file