DEV Community

Cover image for How to create YOLOv8-based object detection web service using Python, Julia, Node.js, JavaScript, Go and Rust
Andrey Germanov
Andrey Germanov

Posted on • Updated on

How to create YOLOv8-based object detection web service using Python, Julia, Node.js, JavaScript, Go and Rust

Table of contents

Introduction
YOLOv8 deployment options
Export YOLOv8 model to ONNX
Explore object detection on image using ONNX
    Prepare the input
    Run the model
    Process the output
        Intersection over Union
        Non-maximum Suppression
Create a web service on Python
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on Julia
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on Node.js
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on JavaScript
    Setup the project
    Prepare the input
    Run the model and process the output
Create a web service on Go
    Setup the project
    Prepare the input
    Run the model
    Process the output
Create a web service on Rust
    Setup the project
    Prepare the input
    Run the model
    Process the output
Conclusion

Introduction

This is a second part of my article about the YOLOv8 neural network. In the previous article I provided a practical introduction to this model, and it's common API. Then I showed how to create a web service that detects objects on images using Python and official YOLOv8 library based on PyTorch.

In this article, I am going to show how to work with the YOLOv8 model in low level, without the PyTorch and the official API. It will open a lot of new opportunities for deployment. Using concepts and examples of this post you will be able to create an AI powered object detection services that use ten time less resources, and you will be able to create these services not only on Python, but on most of the other programming languages. In particular, I will show how to create the YOLOv8 powered web service on Julia, Node.js, JavaScript, Go and Rust.

As a base, we will use the web service, developed in the previous article, which is available in this repository. We will just rewrite the backend of this web service on different languages. That is why it's required to read the first article before continue reading this.

YOLOv8 deployment options

The YOLOv8 neural network, initially created using the PyTorch framework and exported as a set of ".pt" files. We used the Ultralytics API to train these models or make predictions based on them. To run them, it's required to have an environment with Python and PyTorch.

PyTorch is a great framework to design, train and evaluate neural network models. In addition, it has tools to prepare or even generate the datasets to train the models and many other great utils. However, we do not need all this in production. If we talk about YOLOv8, then all that you need in production is to run the model with input image and receive resulting bounding boxes. However, the YOLOv8 implemented on Python. Does it mean that all programmers who want to use this great object detector must become the Python programmers? Does it mean that they must rewrite their applications on Python or integrate them with Python code? Fortunately not. The Ultralytics API has a great export function to convert any YOLOv8 model to a format, that can be used by external applications.

The following formats are supported at the moment:

Format format Argument
TorchScript torchscript
ONNX onnx
OpenVINO openvino
TensorRT engine
CoreML coreml
TF SavedModel saved_model
TF GraphDef pb
TF Lite tflite
TF Edge TPU edgetpu
TF.js tfjs
PaddlePaddle paddle

For example, the CoreML is a neural network format, that can be used in iOS applications that run on iPhone.

Using the links in this table, you can read an overview of each of these formats.

The most interesting of them for us today is ONNX which is a lightweight runtime, created by Microsoft, that can be used to run neural network models on a wide range of platforms and programming languages. This is not a framework, but it's just a shared library written in C. It's just 16 MB in size for Linux, but it has interface bindings for most programming languages, including Python, PHP, JavaScript, Node.js, C++, Go and Rust. It has a simple API and if you wrote an ONNX code to run a model on one programming language, then it will not be difficult to rewrite it and use on other, which we will see today.

To follow the sections started from this one, you need to have Python and Jupyter Notebooks installed.

Export YOLOv8 model to ONNX

First, let's load the YOLOv8 model and export in to ONNX format to make it usable. Run the Jupyter notebook and execute the following code in it.

from ultralytics import YOLO
model = YOLO("yolov8m.pt")
model.export(format="onnx")
Enter fullscreen mode Exit fullscreen mode

In the code above, you loaded the middle-sized YOLOv8 model for object detection and exported it to the ONNX format. This model is pretrained on COCO dataset and can detect 80 object classes.

After running this code, you should see the exported model in a file with the same name and the .onnx extension. In this case, you will see the yolov8m.onnx file in a folder where you run this code.

Before writing a web service based on ONNX, let's discover how this library works in Jupyter Notebook to understand the main concepts.

Explore object detection on image using ONNX

Now when you have a model, let's use ONNX to work with it. For simplicity, we will start with Python, because we already have a Python web application, that uses PyTorch and Ultralytics APIs. So, it will be easier to move it to ONNX.

Install the ONNX runtime library for Python by running the following command in your Jupyter notebook:

!pip install onnxruntime
Enter fullscreen mode Exit fullscreen mode

and import it:

import onnxruntime as ort
Enter fullscreen mode Exit fullscreen mode

We set the ort alias to it. Remember this abbreviation because in other programming languages you will often see ort instead on ONNX runtime.

The ort module is a root of the ONNX API. The main object of this API is the InferenceSession which used to instantiate a model to run prediction on it. Model instantiation works very similar to what we did before with Ultralytics:

model = ort.InferenceSession("yolov8m.onnx", providers=['CPUExecutionProvider'])
Enter fullscreen mode Exit fullscreen mode

Here we loaded the model, but from ".onnx" file instead on ".pt". And now it's ready to run.

And this is a moment when similarities between Ultralytics and ONNX end. If you remember, with Ultralytics you just run: outputs = model.predict("image_file") and received result. The smart predict method did the following for you automatically:

  1. Read the image from file
  2. Convert it to the format of the YOLOv8 neural network input layer
  3. Pass it through the model
  4. Receive the raw model output
  5. Parse the raw model output
  6. Return structured information about detected objects and their bounding boxes

The ONNX session object has a similar method run, but it implements only steps 3 and 4. Everything else is up to you, because ONNX does not know that this is the YOLOv8 model. It does not know which input this neural network expects to get and what the raw output of this model means. This is universal API for any kind of neural networks, it does not know about concrete use cases like object detection on images.

In terms of ONNX, the neural network is a black box that receives a multidimensional array of float numbers as an input and transforms it to other multidimensional array of numbers. Which numbers should be in the input and what mean the numbers in the output, it does not know. So, and what we can do with it?

Image description

Fortunately, the things are not so worst and something we can research. The shapes of input and output layers of a neural network are fixed, they are defined when neural network created and information about them exists in a model.

The ONNX session object has a helpful method get_inputs() to get the information about inputs that this model expects to receive and the get_outputs() to get the information about the outputs, that the model returns after processing the inputs.

Let's get the inputs first:

inputs = model.get_inputs();
len(inputs)
Enter fullscreen mode Exit fullscreen mode
1
Enter fullscreen mode Exit fullscreen mode

Here we got the array of inputs and displayed the length of this array. The result is obvious: the network expects to get a single input. Let's get it:

input = inputs[0]
Enter fullscreen mode Exit fullscreen mode

The input object has three fields: name, type and shape. Let's get these values for our YOLOv8 model:

print("Name:",input.name)
print("Type:",input.type)
print("Shape:",input.shape)
Enter fullscreen mode Exit fullscreen mode

And this is the output that you will get:

Name: images
Type: tensor(float)
Shape: [1, 3, 640, 640]
Enter fullscreen mode Exit fullscreen mode

This is what we can discover from this:

  • The name of expected input is images which is obvious. The YOLOv8 model receives the images as an input
  • The type of input is tensor of float numbers. The tensor can have many definitions, but from practical point of view which is important for us now, this is a multidimensional array of numbers, the array of float numbers. So, we can deduce that we need to convert our image to a multidimensional array of float numbers.
  • The shape shows the dimensions of this tensor. Here, you see that this array should be four dimensional. This should be a single image (1), that contains 3 matrices of 640x640 float numbers. What numbers should be in these matrices? The matrix of color components. As you should know, each color pixel has Red, Green and Blue components. Each color component can have values from 0 to 255. Also, you can deduce that the image must have 640x640 size. Finally, there should be 3 matrices: one 640x640 matrix that contain red component of each pixel, one for green and one for blue.

Now you have enough observations to understand what need to do in the code to prepare the input data.

Image description

Prepare the input

We need to load an image, resize it to 640x640, extract information about Red, Green and Blue component of each pixel and construct 3 matrices of intensities of appropriate colors.

Let's just do it using the Pillow python package, that we already used before. Ensure that it's installed:

!pip install pillow
Enter fullscreen mode Exit fullscreen mode

For example, we will use the cat_dog.jpg image, that we used in the previous article:

Image description

Let's load and resize it:

from PIL import Image

img = Image.open("cat_dog.jpg")
img_width, img_height = img.size;
img = img.resize((640,640))
Enter fullscreen mode Exit fullscreen mode

First, you loaded the Image object from the Pillow library. Then you created the img object from the cat_dog.jpg file. Then we saved the original size of the image to the img_width and img_height variables, that will be needed later. Finally, we resized it, providing the new size as a (640,640) tuple.

Now we need to extract each color component of each pixel and construct 3 matrices from them. But here we have one thing that can lead to inconsistencies in the future. Each pixel has four color channels: Red, Green, Blue and Alpha. The alpha channel describes the transparency of a pixel. We do not need Alpha channel in the image for YOLOv8 predictions. Let's remove it:

img = img.convert("RGB");
Enter fullscreen mode Exit fullscreen mode

By default, the image with Alpha channel has "RGBA" color model. By this line, you converted it to "RGB". This way, you've removed the alpha channel.

Now it's time to create 3 matrices of color channel values. We can do this manually, but Python has a great interoperability between libraries. The NumPy library, that usually used to work with multidimensional arrays, can just load the Pillow image object as an array as simple as this:

import numpy as np

input = np.array(img)
Enter fullscreen mode Exit fullscreen mode

Here, you imported NumPy and just loaded the image to the input NumPy array. Let's see the shape of this array now:

input.shape
Enter fullscreen mode Exit fullscreen mode
(640, 640, 3)
Enter fullscreen mode Exit fullscreen mode

Almost fine, but the dimensions go in wrong order. We need to put 3 in the beginning. The transpose function can switch dimensions of NumPy array:

input = input.transpose(2,0,1)
input.shape
Enter fullscreen mode Exit fullscreen mode
(3,640,640)
Enter fullscreen mode Exit fullscreen mode

The numbering of dimensions starts from 0. So, we had 0=640, 1=640, 2=3. Then, using the transpose function, we moved the dimension number 2 to the first place. Finally, received the shape (3,640,640).

But we need to add one more dimension to the beginning to make it (1,3,640,640). The reshape function can do this:

input = input.reshape(1,3,640,640)
Enter fullscreen mode Exit fullscreen mode

Now we have correct input shape, but if you try to see contents of this array, like for example, the red component of the first pixel:

input[0,0,0,0]
Enter fullscreen mode Exit fullscreen mode

you'll probably see the integer:

71
Enter fullscreen mode Exit fullscreen mode

but the float numbers required. Moreover, as a rule, the numbers for machine learning must be scaled, e.g. scaled to a range from 0 to 1. Having a knowledge, that the color value can be in a range from 0 to 255, we can scale all pixels to a 0-1 range if divide them by 255.0. The NumPy allows doing this in a single line of code:

input = input/255.0

input[0,0,0,0]
Enter fullscreen mode Exit fullscreen mode
0.2784313725490196
Enter fullscreen mode Exit fullscreen mode

In the code above, you divided all numbers in array and displayed the first of them: the red color component intensity for the first pixel. So, this is how the input data should look.

Run the model

Now, before running the prediction process, let's see, which output the YOLOv8 model should return. As said above, this can be done using the get_outputs() method of ONNX session object. The result value of this method has the same type as the value of the get_inputs(), because as I said before: "the only work of neural network is to transform one array of numbers provided as an input to other array of numbers". So, let's see the form of the output of pretrained YOLOv8 model:

outputs = model.get_outputs()
output = outputs[0]
print("Name:",output.name)
print("Type:",output.type)
print("Shape:",output.shape)
Enter fullscreen mode Exit fullscreen mode
Name: output0
Type: tensor(float)
Shape: [1, 84, 8400]
Enter fullscreen mode Exit fullscreen mode

The ONNX is a universal platform to run neural networks of any kind. That is why it assumes, that the network can have many inputs and many outputs, and it accepts array of inputs and array of outputs, even if these arrays have only single item. YOLOv8 has a single output, which is a first item of the outputs object.

Here you see that the output has an output0 name, it also has a form of tensor of float numbers and a shape of this output is [1,84,8400] which means that this is a single 84x8400 matrix, that nested to a single array. In practice, it means that the YOLOv8 network returns, 8400 bounding boxes and each bounding box has 84 parameters. It's a little bit ugly that each bounding box is column here, but not row. It's a technical requirement of neural network algorithm. I think it would be better to transpose it to 8400x84, so, it will be clear that there are 8400 rows that match detected objects and that each row is a bounding box with 84 parameters.

We will discuss why there are so many parameters for a single bounding box later. First, we should run the model to get the data for this output. We have everything for this now.

To run prediction for YOLOv8 model, we need to execute the run method, which has the following signature:

model.run(output_names,inputs)
Enter fullscreen mode Exit fullscreen mode
  • output_names - the array of names of outputs that you want to receive. In YOLOv8 model, it will be an array with a single item.
  • inputs - the dictionary of inputs, that you pass to the network in a format {name:tensor} where name is a name of input and the tensor is an image data array that we prepared before.

To run the prediction for the data that you prepared, you can run the following:

outputs = model.run(["output0"], {"images":input})
len(outputs)
Enter fullscreen mode Exit fullscreen mode
1
Enter fullscreen mode Exit fullscreen mode

As you seen earlier, the only output of this model has a name output0 and the name of the only input is images. The data tensor for the input you prepared in the input variable.

If everything went well, it will display that the length of received outputs array is 1 which means that you have only single output. However, if you receive the error that says that the input must be in float format, then convert it to float32 using the following line:

input = input.astype(np.float32)
Enter fullscreen mode Exit fullscreen mode

and then run again.

Then we are close to the most interesting part of the work: process the output.

Process the output

There is an only single output, so we can extract it from outputs:

output = outputs[0]
output.shape
Enter fullscreen mode Exit fullscreen mode
(1, 84, 8400)
Enter fullscreen mode Exit fullscreen mode

So, as you see, it returned the output of correct shape. As the first dimension has only single item, we can just get it:

output = output[0]
output.shape
Enter fullscreen mode Exit fullscreen mode
(84, 8400)
Enter fullscreen mode Exit fullscreen mode

We turned it out to a matrix with 84 rows and, 8400 columns. As I said before, it has a transposed form which is not very suitable for work, let's transpose it again:

output = output.transpose()
Enter fullscreen mode Exit fullscreen mode
(8400, 84)
Enter fullscreen mode Exit fullscreen mode

Now it's more clear: 8400 rows with 84 parameters. 8400 is a maximum number of bounding boxes that the YOLOv8 model can detect, and it returns 8400 lines for any image regardless of how many objects really detected on it, because the output of the neural network is fixed and defined during the neural network design. It can't be variable. So, it returns 8400 rows every time, but the most of these rows contain just garbage. How to detect, which of these rows have meaningful data and which of them are garbage? To do that, we need to discover 84 parameters that each of these row has.

The first 4 elements are coordinates of the bounding box, and all others are the probabilities of all object classes that this model can detect. The pretrained model that you use in this tutorial can detect 80 object classes, that is why, each bounding box has 84 parameters: 4+80. If you use another model, that, for example, trained to detect 3 object classes, then it will have 7 parameters in a row because of 4+3.

Let's for example display the row number 0:

row = output[0]
print(row)
Enter fullscreen mode Exit fullscreen mode
[     5.1182      8.9662      13.247      19.459  2.5034e-06  2.0862e-07  5.6624e-07  1.1921e-07  2.0862e-07  1.1921e-07  1.7881e-07  1.4901e-07  1.1921e-07  2.6822e-07  1.7881e-07  1.1921e-07  1.7881e-07  4.1723e-07  5.6624e-07  2.0862e-07  1.7881e-07  2.3842e-07  3.8743e-07  3.2783e-07  1.4901e-07  8.9407e-08
  3.8743e-07  2.9802e-07  2.6822e-07  2.6822e-07  2.3842e-07  2.0862e-07  5.9605e-08  2.0862e-07  1.4901e-07  1.1921e-07  4.7684e-07  2.6822e-07  1.7881e-07  1.1921e-07  8.9407e-08  1.4901e-07  1.7881e-07  2.6822e-07  8.9407e-08  2.6822e-07  3.8743e-07  1.4901e-07  2.0862e-07  4.1723e-07  1.9372e-06  6.5565e-07
  2.6822e-07  5.3644e-07  1.2815e-06  3.5763e-07  2.0862e-07  2.3842e-07  4.1723e-07  2.6822e-07  8.3447e-07  8.9407e-08  4.1723e-07  1.4901e-07  3.5763e-07  2.0862e-07  1.1921e-07  5.9605e-08  5.9605e-08  1.1921e-07  1.4901e-07  1.4901e-07  1.7881e-07  5.9605e-08  8.9407e-08  2.3842e-07  1.4901e-07  2.0862e-07
  2.9802e-07  1.7881e-07  1.1921e-07  2.3842e-07  1.1921e-07  1.1921e-07]
Enter fullscreen mode Exit fullscreen mode

Here you see that this row represents a bounding box with coordinates [5.1182, 8.9662, 13.247, 19.459]. These values are coordinates of a center of this bounding box, the width and the height:

x_center = 5.1182
y_center = 8.9662
width = 13.247
height = 19.459

Let's slice out these variables from the row:

xc,yc,w,h = row[:4]
Enter fullscreen mode Exit fullscreen mode

All other values are the probabilities that the detected object belongs to each of 80 classes. So, assuming that the array numbering starts from 0, the item number 4 contains the probability that the object belongs to class 0 (2.5034e-06), item number 5 contains the probability that the object belongs to class 1 (2.0862e-07) etc.

Now lets remove all garbage and parse this row to a format, that we got in the previous article: [x1,y1,x2,y2,class_label,probability].

To calculate coordinates of bounding box corners you can use the following formulas:

x1 = xc-w/2
y1 = yc-h/2
x2 = xc+w/2
y2 = yc+h/2
Enter fullscreen mode Exit fullscreen mode

but there is a very important reminder: do you remember that we scaled the image to 640x640 in the beginning? It means that these coordinates returned in assumption that the image has this size. To get coordinates of this bounding box for the original image, we need to scale them in proportion to the dimensions of the original image. We saved the original width and height to the img_width and img_height variables, and to scale the corners of the bounding box, we need to modify the formulas:

x1 = (xc - w/2) / 640 * img_width
y1 = (yc - h/2) / 640 * img_height
x2 = (xc + w/2) / 640 * img_width
y2 = (yc + h/2) / 640 * img_height
Enter fullscreen mode Exit fullscreen mode

Then you need to find the object with a maximum probability. On the one hand you can do this in a loop, iterating from 4 to 84 items of this array and select the item index with maximum probability value, but the NumPy has the convenient methods for this:

prob = row[4:].max()
class_id = row[4:].argmax()

print(prob, class_id)
Enter fullscreen mode Exit fullscreen mode
2.503395e-06 0
Enter fullscreen mode Exit fullscreen mode

The first line returns the maximum value of subarray from 4 until the end of the row. The second line returns the index of the element with this maximum value. So, here you see that the first probability has a maximum value, and it means that this bounding box belongs to class 0.

To replace class ID with class label, you should have an array of classes, that the model can predict. In case of this model, this is 80 classes from the COCO dataset. Here they are:

yolo_classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]
Enter fullscreen mode Exit fullscreen mode

In case if you use other custom trained model, then you can get this array from the YAML file, that used for training. You can find about YAML files that used to train YOLOv8 models in my previous article.

Then you can just get a class label by ID:

label = yolo_classes[class_id]
Enter fullscreen mode Exit fullscreen mode

This is how you should parse each row of YOLOv8 model output.

However, this probability is too low, because 2.503395e-06 = 2.503395 / 1000000 = 0.000002503. So, this bounding box, perhaps just garbage that should be filtered out. I recommend filtering out all bounding boxes with probability less than 0.5.

Let's write all the row parsing code above as a function, to parse any row this way:

def parse_row(row):
    xc,yc,w,h = row[:4]
    x1 = (xc-w/2)/640*img_width
    y1 = (yc-h/2)/640*img_height
    x2 = (xc+w/2)/640*img_width
    y2 = (yc+h/2)/640*img_height
    prob = row[4:].max()
    class_id = row[4:].argmax()
    label = yolo_classes[class_id]
    return [x1,y1,x2,y2,label,prob]
Enter fullscreen mode Exit fullscreen mode

Now you can write a code that parses and filter outs all rows from output:

boxes = [row for row in [parse_row(row) for row in output] if row[5]>0.5]
len(boxes)
Enter fullscreen mode Exit fullscreen mode
20
Enter fullscreen mode Exit fullscreen mode

Here I used the Python list comprehensions. The internal list:

[parse_row(row) for row in output]
Enter fullscreen mode Exit fullscreen mode

used to parse each row and return an array of parsed rows in
a format [x1,y1,x2,y2,class_id,prob].

and then, the external list used to filter all of these rows if their probability is less than 0.5

[row for row in [((parsed_rows))] in row[5]>0.5]
Enter fullscreen mode Exit fullscreen mode

After this, the len(boxes) shows that only 20 boxes left after filtering. Much closer to expected result than 8400, but still it's too much, because we have an image with only one cat and one dog. Curious, what else detected? Let's show this data:

[261.28302669525146, 95.53291285037994, 461.15666942596437, 313.4492515325546, 'dog', 0.9220365]
[261.16701192855834, 95.61400711536407, 460.9202187538147, 314.0579136610031, 'dog', 0.92195505]
[261.0219168663025, 95.50403118133545, 460.9265221595764, 313.81584787368774, 'dog, 0.9269446]
[260.7873046875, 95.70514416694641, 461.4101188659668, 313.7423722743988, 'dog', 0.9269207]
[139.5556526184082, 169.4101345539093, 255.12585411071777, 314.7275745868683, 'cat', 0.8986903]
[139.5316062927246, 169.63674533367157, 255.05698356628417, 314.6878091096878, 'cat', 0.90628827]
[139.68495998382568, 169.5753903388977, 255.12413234710692, 315.06962299346924, 'cat', 0.88975877]
[261.1445414543152, 95.70124578475952, 461.0543995857239, 313.6095304489136, 'dog', 0.926944]
[260.9405124664307, 95.77976751327515, 460.99450263977053, 313.57664155960083, 'dog', 0.9247296]
[260.49400663375854, 95.79500484466553, 461.3895306587219, 313.5762457847595, 'dog', 0.9034922]
[139.59658827781678, 169.2822597026825, 255.2673086643219, 314.9018738269806, 'cat', 0.88215613]
[139.46405625343323, 169.3733571767807, 255.28112654685975, 314.9132820367813, 'cat', 0.8780577]
[139.633131980896, 169.65343713760376, 255.49261894226075, 314.88970375061035, 'cat', 0.8653987]
[261.18754177093507, 95.68838310241699, 461.0297842025757, 313.1688747406006, 'dog', 0.9215225]
[260.8274451255798, 95.74608707427979, 461.32597131729125, 313.3906273841858, 'dog', 0.9093932]
[260.5131794929504, 95.89693665504456, 461.3481791496277, 313.24405217170715, 'dog', 0.8848127]
[139.4986301422119, 169.38371658325195, 255.34583129882813, 314.9019331932068, 'cat', 0.836439]
[139.55282192230223, 169.58951950073242, 255.61378440856933, 314.92880630493164, 'cat', 0.87574947]
[139.65414333343506, 169.62119138240814, 255.79856758117677, 315.1192432641983, 'cat', 0.8512477]
[139.86577434539797, 169.38782274723053, 255.5904968261719, 314.77193105220795, 'cat', 0.8271704]
Enter fullscreen mode Exit fullscreen mode

All these boxes have high probability and their coordinates overlap each other. Let's draw these boxes on the image to see why is it.

The PIL package has the ImageDraw module, that allows to draw rectangles or other figures on top of images. Let's load the image using this object:

from PIL import ImageDraw
img = Image.open("cat_dog.jpg")
draw = ImageDraw(img)
Enter fullscreen mode Exit fullscreen mode

and draw each bounding box on the image using the created draw object in a loop:

for box in boxes:
    x1,y1,x2,y2,class_id,prob = box
    draw.rectangle((x1,y1,x2,y2),None,"#00ff00")

img
Enter fullscreen mode Exit fullscreen mode

This code draws the green rectangles for each bounding box and displays the resulting image, which will look like this:

Image description

It draws all these 20 boxes on top of each other, so they look like just 2 boxes. As a human, you can see that all these 20 boxes belong to the same 2 objects. However, the neural network is not a human, and it thinks that it found 20 different cats and dogs that overlap each other, because it's theoretically possible that different objects on the image can overlap each other. Perhaps it sounds crazy, but this is how it works.

It's up to you to select which of these boxes should stay and which to filter out. How you can do this? On the one hand, you can select the box with the highest probability for dog and the box with the highest probability for cat and remove all others. However, it's not a useful solution for all cases, because you can have images with several dogs and several cats at the same time. You should find and use some general purpose algorithm that removes all boxes that closely overlap each other. Fortunately, this algorithm already exists and it's called the Non-maximum suppression. These are the steps that you should implement to make it working:

  1. Create an empty resulting array that will contain a list of boxes that you want to keep.
  2. Start a loop
  3. From source boxes array, select the box with the highest probability and move it to the resulting array.
  4. Compare the selected box with each other box from the source array and remove all of them that overlap the selected one too much.
  5. If the source array contains more boxes, move to step 2 and repeat

After loop finished, the source boxes array will be empty, and the resulting array will contain only different boxes. Now let's understand how to implement step 4, how to compare two boxes and find that they overlap each other too much. To find it, we will use other algorithm - "Intersection over Union" or IoU. This algorithm is actually a formula:

Image description

The idea of this algorithm is:

  1. Calculate the area of intersection of two boxes.
  2. Calculate the area of their union.
  3. Divide first by second.

The closer the result to 1, the more two boxes overlap each other. You can see this visually: the closer the area of intersection of two boxes to the area of their union, the more it looks like the same box. In the left box below the formula these boxes overlap each other, but not too much, and the IoU in this case could be about 0.3. Definitely, these two boxes can be treated as different objects, even if they overlap. On the second example it's clear that the area of intersection is much closer to the area of their union, perhaps the IoU will be about 0.8 here. Highly likely that one of these boxes should be removed. Finally, the boxes on the right sample represent almost the same area and definitely only one of them should stay.

Now let's implement both IoU and Non-Maximum suppression in code.

Intersection over union

1 Calculate the area of intersection

def intersection(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    x1 = max(box1_x1,box2_x1)
    y1 = max(box1_y1,box2_y1)
    x2 = min(box1_x2,box2_x2)
    y2 = min(box1_y2,box2_y2)
    return (x2-x1)*(y2-y1) 
Enter fullscreen mode Exit fullscreen mode

Here, we calculate the area of intersection rectangle using its width (x2-x1) and height (y2-y1).

2 Calculate the area of union

def union(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)
Enter fullscreen mode Exit fullscreen mode

3 Divide first by second

def iou(box1,box2):
    return intersection(box1,box2)/union(box1,box2)
Enter fullscreen mode Exit fullscreen mode

Non-maximum suppression

So, we have an array of boxes in the boxes variable, and we need to leave only different items in it, using the created iou function as a criterion of difference. Let's say that if IoU of two boxes less than 0.7, then they both should stay. Otherwise, one of them with lesser probability should leave. Let's implement it:

boxes.sort(key=lambda x: x[5], reverse=True)
result = []
while len(boxes)>0:
    result.append(boxes[0])
    boxes = [box for box in boxes if iou(box,boxes[0])<0.7]
Enter fullscreen mode Exit fullscreen mode

For convenience, in the first line, we sorted all boxes by probability in reverse order to move the boxes with the highest probabilities to the top.

Then the code defines the array for resulting boxes. In a loop it puts the first box (which is a box with the highest probability) in the resulting array and on the next line it overwrites the boxes array with only boxes, that have the 'IoU' with selected box that is less than 0.7.

It continues doing that in a loop until the boxes contains no items.

After running it, you can print the result array:

print(result)
Enter fullscreen mode Exit fullscreen mode
[
[261.0219168663025, 95.50403118133545, 460.9265221595764, 313.81584787368774, 'dog', 0.9269446],
[139.5316062927246, 169.63674533367157, 255.05698356628417, 314.6878091096878, 'cat', 0.90628827]
]
Enter fullscreen mode Exit fullscreen mode

Now it has just 2 items, as it should. The IoU did it magic work and selected the best boxes for cat and dog with the highest probabilities.

So, finally, you did it! Can you realize how much code you had to write instead of single model.predict() line in Ultralytics API? However, now you have a knowledge how it really works, and awareness of these algorithms makes you independent of PyTorch environment. Now you can create applications which use the YOLOv8 models using any programming language supported by ONNX and I will show you how to do this.

In the next sections we will refactor the object detection web service, written in the previous article, to use ONNX instead of PyTorch. We will rewrite it on Python, Julia, Node.js, JavaScript, Go and Rust.

The first section with Python defines the project structure, the functions, and their relations, and then we will rewrite all these functions in other programming languages without changing the structure of the project.

The Python section is recommended for everyone, then you can move on to sections related to your chosen language. Using the defined project structure and algorithms, you will be able to write the web service on any other language, that supports ONNX.

I assume that you are familiar with all languages that you choose and have all required IDE's and tools to write, compile and run that code. I will focus only on ONNX and algorithms, described above, and will not teach you programming on these languages. Furthermore, I will not dive to their standard libraries. However, I will provide links to API docs of all external packages and frameworks that we will use, and you should either know APIs of these libraries or be able to learn them using that documentation.

Create a web service on Python

Setup the project

We will use the project, created in the previous article as a base. You can get it from this repository.

Create a new folder and copy the following files to it from the project above:

  • index.html - frontend
  • object_detector.py - backend
  • requirements.txt - list of external dependencies

also copy the ONNX model yolov8m.onnx that you exported in the beginning of the article.

Then, open the requirements.txt file and replace the ultralytics dependence to onnxruntime. Also, add the numpy package to the list. It will be used to convert image to array. Finally, the list of dependencies should look like this:

onnxruntime
flask
waitress
pillow
numpy

Ensure that all these packages installed: you can install them one by one using PIP, or the better option is to install all them at once:

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

We will not change frontend, so index.html will stay the same. The only file that we will change is the object_detector.py, where we will rewrite the object detection code, that previously used Ultralytics APIs to use ONNX runtime.

Let's make a few changes to the structure of this file:

import onnxruntime as ort
from flask import request, Flask, jsonify
from waitress import serve
from PIL import Image
import numpy as np
import json

app = Flask(__name__)


def main():
    serve(app, host='0.0.0.0', port=8080)


@app.route("/")
def root():
    with open("index.html") as file:
        return file.read()


@app.route("/detect", methods=["POST"])
def detect():
    buf = request.files["image_file"]
    boxes = detect_objects_on_image(buf.stream)
    return jsonify(boxes)


def detect_objects_on_image(buf):
    model = YOLO("best.pt")
    results = model.predict(buf)
    result = results[0]
    output = []
    for box in result.boxes:
        x1, y1, x2, y2 = [
            round(x) for x in box.xyxy[0].tolist()
        ]
        class_id = box.cls[0].item()
        prob = round(box.conf[0].item(), 2)
        output.append([
            x1, y1, x2, y2, result.names[class_id], prob
        ])
    return output


main()
Enter fullscreen mode Exit fullscreen mode

If you compare this listing with the original object_detector.py, you'll see that I removed the ultralytics package and put the line that imports the ONNX runtime: import onnxruntime as ort. Also, I've imported numpy as np.

Then, I put the code that runs a web server to the main function and put it to the beginning. Finally, I call the main() as a last line.

We will not change the routes inside the main function, so the root and detect functions will remain the same. We will rewrite only the detect_objects_on_image to use ONNX runtime instead of Ultralytics. The implementation will be more complex than now, but you already know everything if followed the previous section of this article.

We will split the dected_objects_on_image function to three parts:

  • Prepare the input
  • Run the model
  • Process the output

Each phase we will put to a separate function, which the detect_objects_on_image will call. Replace the content of this function to the following:

def detect_objects_on_image(buf):
    input, img_width, img_height = prepare_input(buf)
    output = run_model(input)
    return process_output(output,img_width,img_height)

def prepare_input(buf):
    pass

def run_model(input):
    pass

def process_output(output,img_width,img_height):
    pass
Enter fullscreen mode Exit fullscreen mode
  • In the first line, the prepare_input function receives the uploaded file content, converts it to the input array and returns it. In addition, it returns the original dimensions of the image: image_width and image_height, that will be used later to scale detected bounding boxes.
  • Then, the run_model function receives the input and runs the ONNX session with it. It returns the output which is an array with (1,84,8400) shape.
  • Finally, the output passed to the process_output function, along with the original image size (img_width, img_height). This function should return the array of bounding boxes. Each item of this array has the following format: [x1,y1,x2,y2,class_label,prob].

Let's write these functions one by one.

Prepare the input

The prepare_input function uses the code that you have written in the Prepare the input section. This is how it looks:

def prepare_input(buf):
    img = Image.open(buf)
    img_width, img_height = img.size
    img = img.resize((640, 640))
    img = img.convert("RGB")
    input = np.array(img)
    input = input.transpose(2, 0, 1)
    input = input.reshape(1, 3, 640, 640) / 255.0
    return input.astype(np.float32), img_width, img_height
Enter fullscreen mode Exit fullscreen mode
  • This code loads the image, saves its size to img_width and img_height variables.
  • Then it resizes it, removes the transparency by converting to RGB, and converts to a tensor of pixels by loading as an np.array().
  • Then it transposes and reshapes the array to convert it from (640,640,3) shape to the (1,3,640,640) shape, divides all values by 255.0 to scale it and make compatible with ONNX model input format.
  • Finally, it returns the input array converted to "Float32" data type along with original img_width and img_height. It's important here to convert to np.float32, because by default, Python uses the double as a type for floating point numbers, but ONNX runtime model requires the Float32.

Run the model

In this function you can reuse the code, that we wrote in the Run the model section.

def run_model(input):
    model = ort.InferenceSession("yolov8m.onnx", providers=['CPUExecutionProvider'])
    outputs = model.run(["output0"], {"images":input})
    return outputs[0]
Enter fullscreen mode Exit fullscreen mode

First, you load the model from the yolov8m.onnx file and then use the run method to process the input and return the outputs. Finally, it returns the first output which is an array of (1,84,8400) shape.

Now, it's time to process and convert this output to the array of bounding boxes.

Process the output

The code to process the output will include the functions from the Process the output section to filter out all overlapping boxes using the "Intersection over Union" algorithm. Also, it will use the array of YOLO classes to obtain the labels for each detected object. This code you can just copy/paste from the appropriate places:

def iou(box1,box2):
    return intersection(box1,box2)/union(box1,box2)

def union(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)

def intersection(box1,box2):
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
    x1 = max(box1_x1,box2_x1)
    y1 = max(box1_y1,box2_y1)
    x2 = min(box1_x2,box2_x2)
    y2 = min(box1_y2,box2_y2)
    return (x2-x1)*(y2-y1)

yolo_classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]
Enter fullscreen mode Exit fullscreen mode

This is the iou function and it's dependencies to calculate the intersection over union coefficient. Also, there is an array of YOLO classes, that the model can detect.

Now, having all that, you can implement the process_output function:

def process_output(output, img_width, img_height):
    output = output[0].astype(float)
    output = output.transpose()

    boxes = []
    for row in output:
        prob = row[4:].max()
        if prob < 0.5:
            continue
        class_id = row[4:].argmax()
        label = yolo_classes[class_id]
        xc, yc, w, h = row[:4]
        x1 = (xc - w/2) / 640 * img_width
        y1 = (yc - h/2) / 640 * img_height
        x2 = (xc + w/2) / 640 * img_width
        y2 = (yc + h/2) / 640 * img_height
        boxes.append([x1, y1, x2, y2, label, prob])

    boxes.sort(key=lambda x: x[5], reverse=True)
    result = []
    while len(boxes) > 0:
        result.append(boxes[0])
        boxes = [box for box in boxes if iou(box, boxes[0]) < 0.7]
    return result
Enter fullscreen mode Exit fullscreen mode
  • First two lines convert the output shape from (1,84,8400) to (8400,84) which is 8400 rows with 84 columns. Also, it converts the values of array from np.float32 to float data type. It's required to serialize result to JSON finally.
  • The first loop used to go through the rows. For each row, it calculates the probability of this prediction and skips all rows if the probability less than 0.5.
  • For rows that passed the probability check, it determines the detected object class_id and the text label of this class, using the yolo_classes array.
  • Then it calculates the corner coordinates of the bounding box using coordinates of its center, width and height. Also, it scales it to the original image size using the img_width and img_height parameters.
  • Then it appends the calculated bounding box to the boxes array.
  • The last part of the function filters the detected boxes using the "Non-maximum suppression" algorithm. It filters all boxes that overlap the box with the highest probability, using the iou function to determine the overlapping criteria value.
  • Finally, all boxes that passed the filter returned as a result array.

That is it for Python implementation.

If everything implemented without mistakes, you can run this web service this way:

python object_detector.py
Enter fullscreen mode Exit fullscreen mode

then open http://localhost:8080 in a web browser, and it should work exactly the same, as an original service, implemented using the PyTorch version of YOLOv8 model.

The ONNX runtime is a low level library, so it requires much more code to make the model work, however, the solution built this way is better to deploy in production, because it requires 10 times less hard disk space.

You can find the whole project with comments in this GitHub repository.

The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.

We used only a small subset of ONNX runtime Python API required for basic operations. Full reference available here.

If you followed this guide step by step and implemented this web service on Python, then by this moment you know the foundational algorithm on how the ONNX runtime works in general and ready to try implementing this on other languages.

In the sections below, we will implement the same projects with the same functions on other programming languages. If curious, you can read all next sections or move directly to the language that is interesting for you the most.

Create a web service on Julia

Julia is a modern programming language well suited for data science and machine learning. It combines simple syntax with superfast runtime performance. Sometimes it's stated as a future of machine learning and the most natural replacement for Python in this field.

The Julia has good libraries for machine learning and deep learning. You can read my articles which introduces these libraries to create and run classical machine learning models and neural networks.

Furthermore, having a binding to the ONNX runtime library, you can use any machine learning model, created using Python, including neural networks, created in PyTorch and TensorFlow. The YOLOv8 is not an exception, and you can run that models, exprorted to ONNX format in Julia.

Below, we will implement the same object detection project on Julia.

Setup the project

Enter the Julia REPL by running the following command:

julia
Enter fullscreen mode Exit fullscreen mode

In the REPL, switch to pkg mode by pressing the ] key and then, enter this command:

generate object_detector
Enter fullscreen mode Exit fullscreen mode

This command will create a folder object_detector and will generate the new project in it.

Enter the shell mode by pressing the ; key and move to the project folder by running the following command:

cd object_detector
Enter fullscreen mode Exit fullscreen mode

Return to the pkg mode by pressing Esc and then press the ] key. Then exec this command to activate the project:

activate .
Enter fullscreen mode Exit fullscreen mode

Then you need to install dependencies that will be used. They are ONNX runtime, the Images package and the Genie web framework.

add ONNXRunTime
add Images
add Genie
Enter fullscreen mode Exit fullscreen mode
  • ONNXRuntime - this is the Julia bindings for ONNX runtime library.
  • Images - this is the Julia Images package, which we will use to read images and convert them to pixel color arrays.
  • Genie - this is a web framework for Julia, similar to Flask in Python.

Then you can exit the Julia REPL by pressing Ctrl+D.

Open the project folder to see what is there:

  • src - the folder with Julia source code
  • Project.toml - the project properties file
  • Manifest.toml - the project package cache file

Also, it already generated the template source code file object_detector.jl in the src folder. In this file we will do all the work. However, before we start, copy the index.html and the yolov8m.onnx files from Python project to this project root. The frontend will be the same.

After you've done that, open the src/object_detector.jl, erase all content from it and add the following boilerplate code:

using Images, ONNXRunTime, Genie, Genie.Router, Genie.Requests, Genie.Renderer.Json

function main()    
    route("/") do 
        String(read("index.html"))
    end 

    route("/detect", method=POST) do
        buf = IOBuffer(filespayload()["image_file"].data)
        json(detect_objects_on_image(buf))
    end

    up(8080, host="0.0.0.0", async=false)
end

function detect_objects_on_image(buf)
    input, img_width, img_height = prepare_input(buf)
    output = run_model(input)
    return process_output(output, img_width,img_height)
end

function prepare_input(buf)
end

function run_model(input)
end

function process_output(output, img_width, img_height)
end

main()
Enter fullscreen mode Exit fullscreen mode

This is a template of the whole application. You can compare this with the Python project and see that it has almost the same structure.

  • First you import dependencies, including ONNX Runtime, Genie Web framework and Images library.
  • Then, in the main function, you create two endpoints: one for main index.html page and one /detect, which will receive the image file and pass it to the detect_objects_on_image function. Then you start the web server on port 8080 which serves these two endpoints.
  • The detect_objects_on_image has exactly the same content as the Python one. It prepares input from the image, passes it through the model, processes the model output and returns the array of bounding boxes.
  • Then, the processed output returned to client as a JSON.

In the next sections we will implement prepare_input, run_model and process_output functions one by one.

Prepare the input

function prepare_input(buf)
    img = load(buf)
    img_height, img_width = size(img)
    img = imresize(img,(640,640))
    img = RGB.(img)
    input = channelview(img)
    input = reshape(input,1,3,640,640)
    return Float32.(input), img_width, img_height    
end
Enter fullscreen mode Exit fullscreen mode
  • This code loads the image, saves its size to img_width and img_height variables.
  • Then it resizes it, removes the transparency by converting to RGB, and converts to a tensor of pixels using the channelview function.
  • Then it reshapes the array to convert it from (640,640,3) shape to the (1,3,640,640) shape, that required for the ONNX model.
  • Finally, it returns the input array converted to "Float32" data type along with original img_width and img_height.

Run the model

function run_model(input)
    model = load_inference("yolov8m.onnx")
    outputs = model(Dict("images" => input))
    return outputs["output0"]
end
Enter fullscreen mode Exit fullscreen mode

This code is almost the same as appropriate Python code.

First, you load the model from the yolov8m.onnx file and then run this model to process the input and return the outputs. Finally, it returns the first output which is an array of (1,84,8400) shape.

Now, it's time to process and convert this output to the array of bounding boxes.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Julia. Include them to your code below the process_output function:

function iou(box1,box2)
    return intersection(box1,box2) / union(box1,box2)
end

function union(box1,box2)
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[1:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[1:4]
    box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)
end

function intersection(box1,box2)
    box1_x1,box1_y1,box1_x2,box1_y2 = box1[1:4]
    box2_x1,box2_y1,box2_x2,box2_y2 = box2[1:4]
    x1 = max(box1_x1,box2_x1)
    y1 = max(box1_y1,box2_y1)
    x2 = min(box1_x2,box2_x2)
    y2 = min(box1_y2,box2_y2)
    return (x2-x1)*(y2-y1)
end
Enter fullscreen mode Exit fullscreen mode

Also, include the array of YOLOv8 class labels, which will be used to convert class IDs to text labels:

yolo_classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]
Enter fullscreen mode Exit fullscreen mode

Now, it's time to write the process_output function:

function process_output(output, img_width, img_height)
    output = output[1,:,:]
    output = transpose(output)

    boxes = []
    for row in eachrow(output)        
        prob = maximum(row[5:end])
        if prob < 0.5
            continue
        end
        class_id = Int(argmax(row[5:end]))
        label = yolo_classes[class_id]
        xc,yc,w,h = row[1:4]
        x1 = (xc-w/2)/640*img_width
        y1 = (yc-h/2)/640*img_height
        x2 = (xc+w/2)/640*img_width
        y2 = (yc+h/2)/640*img_height
        push!(boxes,[x1,y1,x2,y2,label,prob])
    end

    boxes = sort(boxes, by = item -> item[6], rev=true)
    result = []
    while length(boxes)>0
        push!(result,boxes[1])
        boxes = filter(box -> iou(box,boxes[1])<0.7,boxes)
    end
    return result
end
Enter fullscreen mode Exit fullscreen mode

As a python version, it consists of three parts.

  • In the first two lines it converts the output array from (1,84,8400) shape to the (8400,84).
  • The first loop used to go through the rows. For each row, it calculates the probability of this prediction and skips all rows if the probability less than 0.5.
  • For rows that passed the probability check, it determines the class_id of the detected object and the text label of this class, using the yolo_classes array.
  • Then it calculates the corner coordinates of the bounding box from coordinates of its center, width and height. Also, it scales it to the original image size using the img_width and img_height parameters.
  • Then it appends the calculated bounding box to the boxes array.
  • The last part of the function filters the detected boxes using the "Non-maximum suppression" algorithm. It filters all boxes that overlap the box with the highest probability, using the iou function to determine the overlapping criteria value.
  • Finally, all boxes that passed the filter returned as a result array.

That is it for Julia implementation.

If everything implemented without mistakes, you can run this web service from the project folder using the following command:

juila src/object_detector.py
Enter fullscreen mode Exit fullscreen mode

then open http://localhost:8080 in a web browser, and it should work exactly the same, as Python version.

The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.

We used only a small subset of ONNX runtime Julia API required for basic operations. Full reference available here.

You can find the source code of the Julia project in this repository.

Create a web service on Node.js

The Node.js needs no introduction. This is the most used platform to develop server side JavaScript applications, including backends for web services. Obviously, it would be great to have a feature to use neural networks in it. Fortunately, the ONNX runtime for Node.js opens the door to all machine learning models trained on PyTorch, TensorFlow and other frameworks. The YOLOv8 is not an exception. In this section, I will show how to rewrite our object detection web service on Node.js, using the ONNX runtime.

Setup the project

Create new folder for the project like object_detector, open it and run:

npm init
Enter fullscreen mode Exit fullscreen mode

to create new Node.js project. After answering all questions about project, install required dependencies:

npm i --save onnxruntime-node
npm i --save express
npm i --save multer
npm i --save sharp
Enter fullscreen mode Exit fullscreen mode
  • onnxruntime-node - The Node.js library for ONNX Runtime
  • express - Express.js web framework
  • multer - Middleware for Express.js to handle file uploads
  • sharp - An image processing library

We are not going to change frontend, so you can copy the index.html file from the previous project as is to the folder of this project. Also, copy the model file yolov8m.onnx.

Create a object_detector.js file in which you will write the whole backend. Add the following boilerplate code to it:

const ort = require("onnxruntime-node");
const express = require('express');
const multer = require("multer");
const sharp = require("sharp");
const fs = require("fs");

function main() {
    const app = express();
    const upload = multer();

    app.get("/", (req,res) => {
        res.end(fs.readFileSync("index.html", "utf8"))
    })

    app.post('/detect', upload.single('image_file'), async function (req, res) {
        const boxes = await detect_objects_on_image(req.file.buffer);
        res.json(boxes);
    });

    app.listen(8080, () => {
        console.log('Server is listening on port 8080')
    });
}

async function detect_objects_on_image(buf) {
    const [input,img_width,img_height] = await prepare_input(buf);
    const output = await run_model(input);
    return process_output(output,img_width,img_height);
}

async function prepare_input(buf) {

}

async function run_model(input) {

}

async function process_output(output, img_width, img_height) {

}

main()

Enter fullscreen mode Exit fullscreen mode
  • In the first block of require lines you import all required external modules: ort for ONNX runtime, express for web framework, multer to support file uploads in Express framework, sharp to load the uploaded file as an image and convert it to array of pixel colors and fs to read static files.
  • In the main function, it creates a new Express web application in the app variable and instantiates the uploads module for it.
  • Then it defines two routes: the root route that reads and returns a content of the index.html file and the /detect route that used to get uploaded file, to pass it to the detect_objects_on_image function and to return bounding boxes of detected objects to client.
  • The detect_objects_on_image looks almost the same as in Python and Julia projects: first it converts the uploaded file to the array of numbers, passes it to the model, processes the output and returns the array of detected objects.
  • Then function stubs for all actions defined
  • Finally, the main() function called to start a web server on port 8080.

The project is ready, and it's time to implement the prepare_input, run_model and process_output functions one by one.

Prepare the input

We will use the Sharp library to load the image as an array of pixel colors. However, JavaScript does not have such packages as NumPy, which support multidimensional arrays. All arrays in JavaScript are flat. We can make "array of arrays", but it's not true multidimensional array with shape. For example, we can't make the array with shape (3,640,640) which means the array of 3 matrices: first one for reds, second one for greens and third one for blues. Instead, the ONNX runtime for Javascript requires the flat array with 3*640*640=1228800 elements in which reds will go in the beginning, greens will go next and blues will go at the end. This is the result that the prepare_input function should return. Now let's do it step by step.

First, let's do the same actions with image as we did in other languages:

function prepare_input(buf) {
    const img = sharp(buf);
    const md = await img.metadata();
    const [img_width,img_height] = [md.width, md.height];
    const pixels = await img.removeAlpha()
        .resize({width:640,height:640,fit:'fill'})
        .raw()
        .toBuffer();
Enter fullscreen mode Exit fullscreen mode
  • It loads the file as an image using sharp.
  • It saves the original image dimensions to img_width and img_height
  • on the next line, it uses the chain of operations to
  • remove the transparency channel,
  • resize the image to 640x640,
  • return the image as a raw array of pixels to buffer

The Sharp also can't return a matrix of pixels because there are no matrices in JavaScript. That is why, now, you have the pixels array, that contains a single dimensional array of image pixels. Each pixel consists of 3 numbers: R, G, B, There are no rows and columns and pixels just go one after another. To convert it to required format, you need to convert it to 3 arrays: array of reds, array of greens and array of blues and then concatenate these 3 arrays to one in which the reds will go first, greens will go next and blues will go at the end.

The next image shows what you need to do with the pixels array and return from the function:

Image description

The first step is to create 3 arrays for reds, greens and blues:

const red = [], green = [], blue = [];
Enter fullscreen mode Exit fullscreen mode

Then, traverse the pixels array and collect numbers to appropriate arrays:

for (let index=0; index<pixels.length; index+=3) {
    red.push(pixels[index]/255.0);
    green.push(pixels[index+1]/255.0);
    blue.push(pixels[index+2]/255.0);
}
Enter fullscreen mode Exit fullscreen mode

This loop jumps from pixel to pixel with step=3. On each iteration, the index is equal to the red component of the current pixel, the index+1 is equal to the green component and the index+2 is equal to the blue. As you see, we divide components by 255.0 to scale and put to appropriate arrays.

The only thing that left to do after this, is to concatenate these arrays in correct order and return along with img_width and img_height.

Here is a full code of the prepare_input function:

async function prepare_input(buf) {
    const img = sharp(buf);
    const md = await img.metadata();
    const [img_width,img_height] = [md.width, md.height];
    const pixels = await img.removeAlpha()
        .resize({width:640,height:640,fit:'fill'})
        .raw()
        .toBuffer();

    const red = [], green = [], blue = [];
    for (let index=0; index<pixels.length; index+=3) {
        red.push(pixels[index]/255.0);
        green.push(pixels[index+1]/255.0);
        blue.push(pixels[index+2]/255.0);
    }

    const input = [...red, ...green, ...blue];
    return [input, img_width, img_height];
}
Enter fullscreen mode Exit fullscreen mode

Perhaps there are other less resource consuming ways exist to convert the pixels array to required form without temporary arrays (you can try your options), but I just wanted to be logical and simple in this implementation.

Now, let's run this input through the YOLOv8 model using the ONNX runtime.

Run the model

The code of the run_model function follows:

async function run_model(input) {
    const model = await ort.InferenceSession.create("yolov8m.onnx");
    input = new ort.Tensor(Float32Array.from(input),[1, 3, 640, 640]);
    const outputs = await model.run({images:input});
    return outputs["output0"].data;
}
Enter fullscreen mode Exit fullscreen mode
  • On the first line, we load the model from yolov8m.onnx file.
  • On the second line, we prepare the input array. The ONNX Runtime requires to convert it to an internal ort.Tensor object. Constructor of this object require specifying the flat numbers array, converted to Float32 and a shape, that this array should have, which is as usual [1,3,640,640].
  • On the third line, we run the model with constructed tensor and receive outputs.
  • Finally, we return the data of the first output. In JavaScript version, we require specifying the name of this output, instead of index. The name of the YOLOv8 output, as you have seen in the beginning of this article, is output0.

As a result, the function returns the array with (1,84,8400) shape, or you can think about this as about 84x8400 matrix. However, JavaScript does not support matrices, that is why, it returns an output as a single dimension array. The numbers in this array ordered as 84x8400, but as a flat array of 705600 items. So, you can't transpose it, and you can't traverse it by rows in a loop, because it's required to specify the absolute position of the item. But do not worry, in the next section we will learn how to deal with it.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to JavaScript. Include them to your code below the process_output function:

function iou(box1,box2) {
    return intersection(box1,box2)/union(box1,box2);
}

function union(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
    const box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
    return box1_area + box2_area - intersection(box1,box2)
}

function intersection(box1,box2) {
    const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
    const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
    const x1 = Math.max(box1_x1,box2_x1);
    const y1 = Math.max(box1_y1,box2_y1);
    const x2 = Math.min(box1_x2,box2_x2);
    const y2 = Math.min(box1_y2,box2_y2);
    return (x2-x1)*(y2-y1)
}
Enter fullscreen mode Exit fullscreen mode

also, you will need to find YOLO class label by ID, so add the yolo_classes array to your code:

const yolo_classes = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat',
    'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase',
    'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
    'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
    'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
];
Enter fullscreen mode Exit fullscreen mode

Now let's implement the process_output function. As mentioned above, the function receives output as a flat array that ordered as 84x8400 matrix. When work in Python, we had a NumPy to transform it to 8400x84 and then traverse in a loop by row. Here, we can't transform it this way, so, we need to traverse it by columns.

boxes=[];
for (index=0;index<8400;index++) {

}
Enter fullscreen mode Exit fullscreen mode

Moreover, you do not have row indexes and column indexes, but have only absolute indexes. You can only virtually reshape this flat array to 84x8400 matrix in your head and use this representation to calculate these absolute indexes, using those "virtual rows" and "virtual columns".

Let's display how the output array looks to clarify this:

Image description

Here we virtually reshaped the output array with 705600 items to a 84x8400 matrix. It has 8400 columns with indexes from 0 to 8399 and 84 rows with indexes from 0 to 83. The absolute indexes of items have written inside boxes. Each detected object represented by a column in this matrix. The first 4 rows of each column with indexes from 0 to 3 are coordinates of the bounding box of the appropriate object: x_center, y_center, width and height. Cells in the other 80 rows, starting from 4 to 83 contain the probabilities that the object belongs to each of the 80 YOLO classes.

I drew this table to understand how to calculate the absolute index of any item in it, knowing the row and column indexes. For example, how you calculate the index of first greyed item that stands on row 2 and column 2, which is a bounding box width of the third detected object? If you think about this a little more, you will find, that to calculate this you need to multiply the row index by the length of the row (8400) and add the column index to this. Let's check it: 8400*2+2=16802. Now, let's calculate the index of the item below it, which is a height of the same object: 8400*3+2=25202. Bingo! Matched again! Finally, let's check the bottom gray box, which is a probability that object 8399 belongs to class 79 (toothbrush): 8400*83+8398=705598. Great, so you have a formula to calculate absolute index: 8400*row_index+column_index.

Let's return to our empty loop. Assuming that the index loop counter is an index of current column and that coordinates of bounding box located in rows 0-3 of current column, we can extract them this way:

boxes=[];
for (index=0;index<8400;index++) {
    const xc = output[8400*0+index];
    const yc = output[8400*1+index];
    const w = output[8400*2+index];
    const h = output[8400*3+index];
}
Enter fullscreen mode Exit fullscreen mode

Then you can calculate the corners of the bounding box and scale them to the size of the original image:

const x1 = (xc-w/2)/640*img_width;
const y1 = (yc-h/2)/640*img_height;
const x2 = (xc+w/2)/640*img_width;
const y2 = (yc+h/2)/640*img_height;
Enter fullscreen mode Exit fullscreen mode

Now similarly you need to get probabilities of the object, that goes in rows from 4 to 83, find which of them is biggest and the index of this probability, and save these values to the prob and the class_id variables. You can write a nested loop, that traverses rows from 4 to 83 and saves the highest value, and it's index:

let class_id = 0, prob = 0;
for (let col=4;col<84;col++) {
    if (output[8400*col+index]>prob) {
        prob = output[8400*col+index];
        class_id = col - 4;
    }
}
Enter fullscreen mode Exit fullscreen mode

It works fine, but I'd better rewrite this in a functional way:

const [class_id,prob] = [...Array(80).keys()]
    .map(col => [col, output[8400*(col+4)+index]])
    .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
Enter fullscreen mode Exit fullscreen mode
  • The first line [...Array(80).keys()] generates a range array with numbers from 0 to 79
  • Then, the map function constructs the array of probabilities for each class_id where each item collected as a [class_id,probability] array
  • The reduce function reduces the array to a single item, that contains maximum probability and its class id.
  • This item finally returned and destructured to class_id and prob variables.

Then, having the maximum probability and class_id, you can either skip that object, if the probability is less than 0.5 or find the label of this class.

Here is a final code, that processes and collects bounding boxes to the boxes array:

    let boxes = [];
    for (let index=0;index<8400;index++) {
        const [class_id,prob] = [...Array(80).keys()]
            .map(col => [col, output[8400*(col+4)+index]])
            .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
        if (prob < 0.5) {
            continue;
        }
        const label = yolo_classes[class_id];
        const xc = output[index];
        const yc = output[8400+index];
        const w = output[2*8400+index];
        const h = output[3*8400+index];
        const x1 = (xc-w/2)/640*img_width;
        const y1 = (yc-h/2)/640*img_height;
        const x2 = (xc+w/2)/640*img_width;
        const y2 = (yc+h/2)/640*img_height;
        boxes.push([x1,y1,x2,y2,label,prob]);
    }

Enter fullscreen mode Exit fullscreen mode

The last step is to filter the boxes array using "Non-maximum suppression", to exclude all overlapping boxes from it. This code is close to the Python implementation:

boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
const result = [];
while (boxes.length>0) {
    result.push(boxes[0]);
    boxes = boxes.filter(box => iou(boxes[0],box)<0.7);
}
Enter fullscreen mode Exit fullscreen mode
  • We sort the boxes by probability in reverse order to put the boxes with the highest probability to the top
  • In a loop, we put the box with the highest probability to result
  • Then we filter out all boxes that overlap the selected box too much (all boxes that have IoU>0.7 with this box)

That's all! For convenience, here is a full code of the process_output function:

function process_output(output, img_width, img_height) {
    let boxes = [];
    for (let index=0;index<8400;index++) {
        const [class_id,prob] = [...Array(80).keys()]
            .map(col => [col, output[8400*(col+4)+index]])
            .reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
        if (prob < 0.5) {
            continue;
        }
        const label = yolo_classes[class_id];
        const xc = output[index];
        const yc = output[8400+index];
        const w = output[2*8400+index];
        const h = output[3*8400+index];
        const x1 = (xc-w/2)/640*img_width;
        const y1 = (yc-h/2)/640*img_height;
        const x2 = (xc+w/2)/640*img_width;
        const y2 = (yc+h/2)/640*img_height;
        boxes.push([x1,y1,x2,y2,label,prob]);
    }

    boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
    const result = [];
    while (boxes.length>0) {
        result.push(boxes[0]);
        boxes = boxes.filter(box => iou(boxes[0],box)<0.7);
    }
    return result;
}
Enter fullscreen mode Exit fullscreen mode

If you like to work with this output in a more convenient "Pythonic" way, there is a NumJS library that emulates NumPy in JavaScript. You can use it to physically reshape the output to 84x8400, then transpose to 8400x84 and then traverse detected objects by row.

However, the option to work with single dimension array as with matrix described in this section is the most efficient, because we got all values we need without additional array transformations. I think that installing additional external dependency is overkill for this case.

That is it for Node.js implementation. If you wrote everything correctly, then you can start this web service by running the following command:

node object_detector.js
Enter fullscreen mode Exit fullscreen mode

and open http://localhost:8080 in a web browser.

The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.

We used only a small subset of ONNX runtime JavaScript API required for basic operations. Full reference available here.

You can find a source code of Node.js object detector web service in this repository.

Create a web service on JavaScript

Could you ever realize that you can write all code for object detector right in the HTML page? Using the ONNX library for JavaScript, you can process the image right in the frontend, without sending it to any server. Furthermore, you can reuse most code that we wrote for Node.js because the underlying ONNX runtime API is the same.

Setup the project

You can reuse the frontend from Node.js project. Create a new folder and copy the index.html and yolov8m.onnx files to it.

Then, open the index.html and add the JavaScript library for ONNX runtime to the head section of the HTML:

<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
Enter fullscreen mode Exit fullscreen mode

This library exposes the ort global variable, that is a root of the ONNX runtime API. You can use it to instantiate and run models the same way as we used the ort variable in the Node.js project.

Perhaps in a moment when you read it, the URL to the library will change, so you can look in the official documentation for installation instructions.

This is an index.html file that you should have in the beginning:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>YOLOv8 Object Detection</title>
    <style>
      canvas {
          display:block;
          border: 1px solid black;
          margin-top:10px;
      }
    </style>
    <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
</head>
<body>
    <input id="uploadInput" type="file"/>
    <canvas></canvas>
    <script>

       const input = document.getElementById("uploadInput");
       input.addEventListener("change",async(event) => {
           const data = new FormData();
           data.append("image_file",event.target.files[0],"image_file");
           const response = await fetch("/detect",{
               method:"post",
               body:data
           });
           const boxes = await response.json();
           draw_image_and_boxes(event.target.files[0],boxes);
       })

      function draw_image_and_boxes(file,boxes) {
          const img = new Image()
          img.src = URL.createObjectURL(file);
          img.onload = () => {
              const canvas = document.querySelector("canvas");
              canvas.width = img.width;
              canvas.height = img.height;
              const ctx = canvas.getContext("2d");
              ctx.drawImage(img,0,0);
              ctx.strokeStyle = "#00FF00";
              ctx.lineWidth = 3;
              ctx.font = "18px serif";
              boxes.forEach(([x1,y1,x2,y2,label]) => {
                  ctx.strokeRect(x1,y1,x2-x1,y2-y1);
                  ctx.fillStyle = "#00ff00";
                  const width = ctx.measureText(label).width;
                  ctx.fillRect(x1,y1,width+10,25);
                  ctx.fillStyle = "#000000";
                  ctx.fillText(label, x1, y1+18);
              });
          }
      }
    </script>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode

To run ONNX runtime in a browser, you need to run the content of this folder on a web server. You can use VS Code embedded web server to run the index.html in it.

When it works, let's load the image and prepare an input array from it.

Prepare the input

User loads the image by using the upload file field to select the image file. This process implemented in the change event listener:

input.addEventListener("change",async(event) => {
    const data = new FormData();
           data.append("image_file",event.target.files[0],"image_file");
    const response = await fetch("/detect",{
        method:"post",
        body:data
    });
    const boxes = await response.json();
    draw_image_and_boxes(event.target.files[0],boxes);
})
Enter fullscreen mode Exit fullscreen mode

In this code, you used fetch to post the file from event.target.files[0] variable to the backend. Then backend returns the array of bounding boxes that decoded to a boxes array.

However, in this version, we will not have a backend to load the image to. All code we will write here, in the index.html file, including the detect_objects_on_image and all other functions. So you need to remove this fetch call and just pass the file to the detect_objects_on_image function:

input.addEventListener("change",async(event) => {
    const boxes = await detect_objects_on_image(event.target.files[0]);
    draw_image_and_boxes(event.target.files[0],boxes);
})
Enter fullscreen mode Exit fullscreen mode

Then, define the detect_objects_on_image function, which is the same as in Node.js example:

async function detect_objects_on_image(buf) {
    const [input,img_width,img_height] = await prepare_input(buf);
    const output = await run_model(input);
    return process_output(output,img_width,img_height);
}
Enter fullscreen mode Exit fullscreen mode

The only difference here is that buf is a File object, that user selected in the upload file field. You need to load this file as an image in the browser and convert to array of pixels. The most common way to load an image in HTML and JavaScript is using the HTML5 canvas object. This object loads the image as a flat array of pixel colors, almost the same, as the Sharp library loaded it in the Node.js version. This work we will do in the prepare_input function:

 async function prepare_input(buf) {
      const img = new Image();
      img.src = URL.createObjectURL(buf);
      img.onload = () => {
          const [img_width,img_height] = [img.width, img.height]
          const canvas = document.createElement("canvas");
          canvas.width = 640;
          canvas.height = 640;
          const context = canvas.getContext("2d");
          context.drawImage(img,0,0,640,640);
          const imgData = context.getImageData(0,0,640,640);
          const pixels = imgData.data;
      }
  }
Enter fullscreen mode Exit fullscreen mode
  • The HTML5 Canvas element can draw the HTML images, that is why, we need to load the file to the Image() object first.
  • Then, before drawing it on the canvas, we need to ensure that the image is loaded. That is why, all next code we write in the onload() event handler of the image object, that executed only after the image is loaded.
  • We save the original image size to img_width and img_height.
  • Then we create a canvas object and set it size to 640x640, because this is a size, that required by the YOLOv8 model.
  • Then we get the HTML5 canvas drawing context of created canvas to draw the image on the canvas. The drawImage method allows drawing and resize at the same time, that is why we set the size of image on the canvas to 640x640.
  • Then the getImageData() used to get the imageData object with image pixels.
  • The only required property of the ImageData object is the data which contains the array of pixels that we need.

Now you have the pixels array, that contains one dimensional array of image pixels. Each pixel consists of 4 numbers that define the color components: R, G, B, A where R=red, G=green, B=blue and A=transparency(Alpha channel). There are no rows and columns in this array, and pixels just go one after another. To convert it to required format, you need to convert it to 3 arrays: array of reds, array of greens and array of blues first and then concatenate these 3 arrays to one in which the reds will go first, greens will go next and blues will go at the end.

The next image shows what you need to do with the pixels array and return from the function:

Image description

The first step is to create 3 arrays for reds, greens and blues:

const red = [], green = [], blue = [];
Enter fullscreen mode Exit fullscreen mode

Then, traverse the pixels array and collect numbers to appropriate arrays:

for (let index=0; index<pixels.length; index+=4) {
    red.push(pixels[index]/255.0);
    green.push(pixels[index+1]/255.0);
    blue.push(pixels[index+2]/255.0);
}
Enter fullscreen mode Exit fullscreen mode

This loop jumps from pixel to pixel with step=4. On each iteration, the index is equal to the red component of the current pixel, the index+1 is equal to the green component and the index+2 is equal to blue. The fourth component of color is skipped in this loop. As you see, we divide components by 255.0 to scale and put to appropriate arrays.

The only thing that left to do after this, is to concatenate these arrays in correct order and return along with img_width and img_height. But we can't add the return from the prepare_input function here, because we write all this code inside an internal function, in the onload event handler and by writing return, we are just returning from this handler but not from the prepare_input function.

To handle this issue, we wrap the code of the prepare_input function to the Promise and return it. Then, inside the event handler, we will use the resolve([input, img_width, img_height]) to resolve that promise with results, that will be returned.

Here is a full code of the prepare_input function:

async function prepare_input(buf) {
    return new Promise(resolve => {
        const img = new Image();
        img.src = URL.createObjectURL(buf);
        img.onload = () => {
            const [img_width,img_height] = [img.width, img.height]
            const canvas = document.createElement("canvas");
            canvas.width = 640;
            canvas.height = 640;
            const context = canvas.getContext("2d");
            context.drawImage(img,0,0,640,640);
            const imgData = context.getImageData(0,0,640,640);
            const pixels = imgData.data;

            const red = [], green = [], blue = [];
            for (let index=0; index<pixels.length; index+=4) {
                red.push(pixels[index]/255.0);
                green.push(pixels[index+1]/255.0);
                blue.push(pixels[index+2]/255.0);
            }
            const input = [...red, ...green, ...blue];
            resolve([input, img_width, img_height])
        }
    })
}
Enter fullscreen mode Exit fullscreen mode

Run the model and process the output

This prepare_input function returns the input exactly in the same format as in the Node.js version. That is why, all other code, including run_model, process_output, iou, intersection and union functions can be copy/pasted as is from the Node.js project.

After it's done, the JavaScript web service finished!

Now you can use any web server to run the index.html file and try this wonderful feature - to run neural network models right in a web browser frontend.

The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.

We used only a small subset of ONNX runtime JavaScript API required for basic operations. Full reference available here.

You can find a source code of JavaScript object detector web service in this repository.

Create a web service on Go

Go is the first statically typed and compiled programming language in our journey. From my point of view, the greatest thing about Go is how you can deploy the apps written on it. You can compile all your code and it's dependencies to a single binary executable, then just copy this file to a production server and run. This is how the whole deployment process looks on Go. You do not need to install any third party dependencies to run Go programs, that is why, the Go applications usually compact and convenient to update. Also, the go is faster than Python and JavaScript. Definitely, it would be great to have an opportunity to deploy neural networks this way. Fortunately, there are several ONNX runtime bindings exist that will help us to achieve this goal.

Setup the project

Create a new folder, enter it and run:

go mod init object_detector
Enter fullscreen mode Exit fullscreen mode

This command will initialize the object_detector project in the current folder.

Install required external modules:

go get github.com/yalue/onnxruntime_go
go get github.com/nfnt/resize
Enter fullscreen mode Exit fullscreen mode

The other thing for which I respect Go, is that all other modules, including web framework and image processing functions, already exist in standard library.

The ONNX module for Go provides the API, but does not contain the Microsoft ONNX runtime library itself. Instead, it has a function to specify a path, in which this library located. Here you have two options: install the Microsoft ONNX runtime library to a well known system path, or download the version for your operating system and put it to the project folder. For this project, I will go the second way, to make the project autonomous and independent of operating system setup.

Go to the Releases page: https://github.com/microsoft/onnxruntime/releases and download the archive for your operating system. After it's done, extract the files from the archive and copy all files from the lib subfolder to the project.

We are not going to change the frontend, that is why, just copy the index.html file from one of the previous projects to current folder. Also, copy the yolov8m.onnx model file.

By convention, the main file of Go project should have a main.go name. So, create this file and put the following boilerplate code to it:

package main

import (
    "encoding/json"
    "github.com/nfnt/resize"
    ort "github.com/yalue/onnxruntime_go"
    "image"
    _ "image/gif"
    _ "image/jpeg"
    _ "image/png"
    "io"
    "math"
    "net/http"
    "os"
    "sort"
)

func main() {
    server := http.Server{
    Addr: "0.0.0.0:8080",
    }
    http.HandleFunc("/", index)
    http.HandleFunc("/detect", detect)
    server.ListenAndServe()
}

func index(w http.ResponseWriter, _ *http.Request) {
    file, _ := os.Open("index.html")
    buf, _ := io.ReadAll(file)
    w.Write(buf)
}

func detect(w http.ResponseWriter, r *http.Request) {
    r.ParseMultipartForm(0)
    file, _, _ := r.FormFile("image_file")
    boxes := detect_objects_on_image(file)
    buf, _ := json.Marshal(&boxes)
    w.Write(buf)
}

func detect_objects_on_image(buf io.Reader) [][]interface{} {
    input, img_width, img_height := prepare_input(buf)
    output := run_model(input)
    return process_output(output, img_width, img_height)
}

func prepare_input(buf io.Reader) ([]float32, int64, int64) {

}

func run_model(input []float32) []float32 {

}

func process_output(output []float32, img_width, img_height int64) [][]interface{} {

}
Enter fullscreen mode Exit fullscreen mode

First, we import required packages. Most of them go from Go standard library:

  • encoding/json - to encode bounding boxes to JSON before sending response
  • github.com/nfnt/resize - to resize image to 640x640
  • ort "github.com/yalue/onnxruntime_go" - ONNX runtime library. We import it as ort variable
  • image, image/gif, image/jpeg, image/png - image library and libraries to support images of different formats
  • io - to read data from local files
  • math - for Max an Min functions
  • net/http - to create and run a web server
  • os - to open local files
  • sort - to sort bounding boxes

Then, the main function defines two HTTP endpoints: index and detect that are handled by appropriate functions and starts the web server on port 8080 that handles these endpoints.

The index endpoint just returns the content of the index.html file.

The detect endpoint receives the uploaded image file, sends it to the detect_objects_on_image function, which passes it through the YOLOv8 model. Then it receives the array of bounding boxes, encodes them to JSON and returns this JSON to the frontend.

The detect_objects_on_image is the same as in previous projects. The only difference is the type of value that it returns, which is the [][]interface{}. The detect_objects_on_image should return an array of bounding boxes. Each bounding box is an array of 6 items (x1,y1,x2,y2,label, probability). These items have different types. However, the Go as strong typed programming language does not allow having array with items of different types. But it has a special type interface{} which can hold value of any type. This is a common trick in the Go to define a variable using the interface{} type, if it can have values of different types. That is why, to have an array of items of different types, you need to create an array of interfaces: []interface{}. Consequently, the bounding box is an array of interfaces and the array of bounding boxes is an array of interface arrays: [][]interface{}.

Then there are stubs of prepare_input, run_model and process_output functions defined. In the next sections, we will implement them one by one.

Prepare the input

To prepare the input for the YOLOv8 model, you need to load the image, resize it and convert to a tensor of (3,640,640) shape where the first item is an array of red components of image pixels, second item is an array of greens and the last component is an array of blues. Furthermore, the ONNX library for Go, requires you to provide this tensor as a flat array, e.g. to concat these three arrays one after one, like displayed on the next image.

Image description

So, let's load and resize the image first:

func prepare_input(buf io.Reader) ([]float32, int64, int64) {
    img, _, _ := image.Decode(buf)
    size := img.Bounds().Size()
    img_width, img_height := int64(size.X), int64(size.Y)
    img = resize.Resize(640, 640, img, resize.Lanczos3)
Enter fullscreen mode Exit fullscreen mode

This code:

  • loaded the image,
  • saved the size of original image to img_width, img_height variables
  • resized it to 640x640 pixels

Then you need to collect the colors of pixels to different arrays, that you should define first:

    red := []float32{}
    green := []float32{}
    blue := []float32{}
Enter fullscreen mode Exit fullscreen mode

Then you need to extract pixels and their colors from the image. To do that, the img object has .At(x,y) method, that can be used to get the pixel object at a specified point of the image. The color object, returned by this method has an .RGBA() method, that returns the color components as an array of 4 elements: [R,G,B,A]. You need to extract only R,G,B and scale them.

Now, you have everything to traverse the image and collect pixel colors to created arrays:

for y := 0; y < 640; y++ {
    for x := 0; x < 640; x++ {
        r, g, b, _ := img.At(x, y).RGBA()
        red = append(red, float32(r/257)/255.0)
        green = append(green, float32(g/257)/255.0)
        blue = append(blue, float32(b/257)/255.0)
    }
}
Enter fullscreen mode Exit fullscreen mode
  • This code traverses all rows and columns of image.
  • It extracts array of color components of each pixel and destructures them to r, g and b variables.
  • Then it scales these components and appends them to appropriate arrays.

Finally, you need to concatenate these arrays to a single one in correct order:

input := append(red, green...)
input = append(input, blue...)
Enter fullscreen mode Exit fullscreen mode

So, the input variable contains the input, required for ONNX runtime. Here is a full code of this function, which returns the input and the size of original image that will be used later when process the output from the model.

func prepare_input(buf io.Reader) ([]float32, int64, int64) {
    img, _, _ := image.Decode(buf)
    size := img.Bounds().Size()
    img_width, img_height := int64(size.X), int64(size.Y)
    img = resize.Resize(640, 640, img, resize.Lanczos3)
    red := []float32{}
    green := []float32{}
    blue := []float32{}
    for y := 0; y < 640; y++ {
        for x := 0; x < 640; x++ {
            r, g, b, _ := img.At(x, y).RGBA()
            red = append(red, float32(r/257)/255.0)
            green = append(green, float32(g/257)/255.0)
            blue = append(blue, float32(b/257)/255.0)
        }
    }
    input := append(red, green...)
    input = append(input, blue...)
    return input, img_width, img_height
}
Enter fullscreen mode Exit fullscreen mode

Now, let's run it through the model.

Run the model

The run_model does the same as in Python example, but it is quite wordy, because of Go language specifics:

func run_model(input []float32) []float32 {
    ort.SetSharedLibraryPath("./libonnxruntime.so")
    _ = ort.InitializeEnvironment()

    inputShape := ort.NewShape(1, 3, 640, 640)
    inputTensor, _ := ort.NewTensor(inputShape, input)

    outputShape := ort.NewShape(1, 84, 8400)
    outputTensor, _ := ort.NewEmptyTensor[float32](outputShape)

    model, _ := ort.NewSession[float32]("./yolov8m.onnx",
        []string{"images"}, []string{"output0"},
        []*ort.Tensor[float32]{inputTensor},[]*ort.Tensor[float32]{outputTensor})

    _ = model.Run()
    return outputTensor.GetData()
}
Enter fullscreen mode Exit fullscreen mode
  • As written in the setup section, the Go ONNX library needs to know where is ONNX runtime library located. You need to use the ort.SetSharedLibraryPath() to specify a location of main file of the ONNX runtime library and initialize the environment with this library. If you downloaded it manually, as suggested earlier, then just specify a name of the file. For Linux, the file name will be libonnxruntime.so, for macOS - libonnxruntime.dylib, for Windows - onnxruntime.dll. I work on Linux, so in this example I use the Linux library.
  • Then, the library requires converting the input to internal tensor format with (1,3,640,640) shape.
  • Then, the library also requires creating an empty structure for output tensor, and specify its shape. The Go ONNX library does not return the output, but it writes it to the variable, that defined in advance. Here, we defined the outputTensor variable as a tensor with (1,84,8400) shape that will be used to receive the data from the model.
  • Then we create a model using the NewSession function, which receives both arrays of input and output names and arrays of input and output tensors.
  • Then we run this model, that processes input and writes the output to the outputTensor variable.
  • The outputTensor.GetData() method returns the output data as a flat array of float numbers.

As a result, the function returns the array with (1,84,8400) shape, or you can think about this as about 84x8400 matrix. However, it returns an output as a single dimension array. The numbers in this array ordered as 84x8400, but as a flat array of 705600 items. So, you can't transpose it, and you can't traverse it by rows in a loop, because it's required to specify the absolute position of each item. But do not worry, in the next section we will learn how to deal with it.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Go. Include them to your code below the process_output function:

func iou(box1, box2 []interface{}) float64 {
    return intersection(box1, box2) / union(box1, box2)
}

func union(box1, box2 []interface{}) float64 {
    box1_x1, box1_y1, box1_x2, box1_y2 := box1[0].(float64), box1[1].(float64), box1[2].(float64), box1[3].(float64)
    box2_x1, box2_y1, box2_x2, box2_y2 := box2[0].(float64), box2[1].(float64), box2[2].(float64), box2[3].(float64)
    box1_area := (box1_x2 - box1_x1) * (box1_y2 - box1_y1)
    box2_area := (box2_x2 - box2_x1) * (box2_y2 - box2_y1)
    return box1_area + box2_area - intersection(box1, box2)
}

func intersection(box1, box2 []interface{}) float64 {
    box1_x1, box1_y1, box1_x2, box1_y2 := box1[0].(float64), box1[1].(float64), box1[2].(float64), box1[3].(float64)
    box2_x1, box2_y1, box2_x2, box2_y2 := box2[0].(float64), box2[1].(float64), box2[2].(float64), box2[3].(float64)
    x1 := math.Max(box1_x1, box2_x1)
    y1 := math.Max(box1_y1, box2_y1)
    x2 := math.Min(box1_x2, box2_x2)
    y2 := math.Min(box1_y2, box2_y2)
    return (x2 - x1) * (y2 - y1)
}
Enter fullscreen mode Exit fullscreen mode

also, you will need to find YOLO class label by ID, so add the yolo_classes array to your code:

var yolo_classes = []string{
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush",
}
Enter fullscreen mode Exit fullscreen mode

Now let's implement the process_output function. As mentioned above, the function receives output as a flat array that ordered as 84x8400 matrix. When work in Python, we had a NumPy to transform it to 8400x84 and then traverse in a loop by row. Here, we can't transform it this way, so, we need to traverse it by columns.

boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {

}
Enter fullscreen mode Exit fullscreen mode

Moreover, you do not have row indexes and column indexes, but have only absolute indexes. You can only virtually reshape this flat array to 84x8400 matrix in your head and use this representation to calculate these absolute indexes, using those "virtual rows" and "virtual columns".

Let's display how the output array looks to clarify this:

Image description

Here we virtually reshaped the output array with 705600 items to a 84x8400 matrix. It has 8400 columns with indexes from 0 to 8399 and 84 rows with indexes from 0 to 83. The absolute indexes of items have written inside boxes. Each detected object represented by a column in this matrix. The first 4 rows of each column with indexes from 0 to 3 are coordinates of the bounding box of the appropriate object: x_center, y_center, width and height. Cells in the other 80 rows, starting from 4 to 83 contain the probabilities that the object belongs to each of the 80 YOLO classes.

I drew this table to understand how to calculate the absolute index of any item in it, knowing the row and column indexes. For example, how you calculate the index of first greyed item that stands on row 2 and column 2, which is a bounding box width of the third detected object? If you think about this a little more, you will find, that to calculate this you need to multiply the row index by the length of the row (8400) and add the column index to this. Let's check it: 8400*2+2=16802. Now, let's calculate the index of the item below it, which is a height of the same object: 8400*3+2=25202. Bingo! Matched again! Finally, let's check the bottom gray box, which is a probability that object 8399 belongs to class 79 (toothbrush): 8400*83+8398=705598. Great, so you have a formula to calculate absolute index: 8400*row_index+column_index.

Let's return to our empty loop. Assuming that the index loop counter is an index of current column and that coordinates of bounding box located in rows 0-3 of current column, we can extract them this way:

boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
    xc := output[index]
    yc := output[8400+index]
    w := output[2*8400+index]
    h := output[3*8400+index]
}
Enter fullscreen mode Exit fullscreen mode

Then you can calculate the corners of the bounding box and scale them to the size of the original image:

    x1 := (xc - w/2) / 640 * float32(img_width)
    y1 := (yc - h/2) / 640 * float32(img_height)
    x2 := (xc + w/2) / 640 * float32(img_width)
    y2 := (yc + h/2) / 640 * float32(img_height)
Enter fullscreen mode Exit fullscreen mode

Now similarly you need to get probabilities of the object, that goes in rows from 4 to 83, find which of them is biggest and the index of this probability, and save these values to the prob and the class_id variables. You can write a nested loop, that traverses rows from 4 to 83 and saves the highest value, and it's index:

class_id, prob := 0, float32(0.0)
for col := 0; col < 80; col++ {
    if output[8400*(col+4)+index] > prob {
        prob = output[8400*(col+4)+index]
        class_id = col
    }
}
Enter fullscreen mode Exit fullscreen mode

Then, having the maximum probability and class_id, you can either skip that object, if the probability is less than 0.5 or find the label of this class.

Here is a final code, that processes and collects bounding boxes to the boxes array:

boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
    class_id, prob := 0, float32(0.0)
    for col := 0; col < 80; col++ {
        if output[8400*(col+4)+index] > prob {
            prob = output[8400*(col+4)+index]
            class_id = col
        }
    }
    if prob < 0.5 {
        continue
    }
    label := yolo_classes[class_id]
    xc := output[index]
    yc := output[8400+index]
    w := output[2*8400+index]
    h := output[3*8400+index]
    x1 := (xc - w/2) / 640 * float32(img_width)
    y1 := (yc - h/2) / 640 * float32(img_height)
    x2 := (xc + w/2) / 640 * float32(img_width)
    y2 := (yc + h/2) / 640 * float32(img_height)
    boxes = append(boxes, []interface{}{float64(x1), float64(y1), float64(x2), float64(y2), label, prob})
}
Enter fullscreen mode Exit fullscreen mode

The last step is to filter the boxes array using "Non-maximum suppression", to exclude all overlapping boxes from it. This code does the same as the Python implementation, but looks slightly different because of the Go language specifics:

sort.Slice(boxes, func(i, j int) bool {
    return boxes[i][5].(float32) < boxes[j][5].(float32)
})
result := [][]interface{}{}
for len(boxes) > 0 {
    result = append(result, boxes[0])
    tmp := [][]interface{}{}
    for _, box := range boxes {
        if iou(boxes[0], box) < 0.7 {
            tmp = append(tmp, box)
        }
    }
    boxes = tmp
}
Enter fullscreen mode Exit fullscreen mode
  • First we sort the boxes by probability in reverse order to put the boxes with the highest probability to the top
  • In a loop, we put the box with the highest probability to the result array
  • Then we create a temporary tmp array and in the inner loop over all boxes, we put to this array only boxes, that do not overlap selected too much (that have IoU<0.7).
  • Then we overwrite the boxes array with the tmp array. This way, we filter out all overlapping boxes from the boxes array.
  • If some boxes exist after filtering, the loop continues going until the boxes array becomes empty.

Finally, the result variable contains all bounding boxes that should be returned.

That's all! For convenience, here is a full code of the process_output function:

func process_output(output []float32, img_width, img_height int64) [][]interface{} {
    boxes := [][]interface{}{}
    for index := 0; index < 8400; index++ {
        class_id, prob := 0, float32(0.0)
        for col := 0; col < 80; col++ {
            if output[8400*(col+4)+index] > prob {
                prob = output[8400*(col+4)+index]
                class_id = col
            }
        }
        if prob < 0.5 {
            continue
        }
        label := yolo_classes[class_id]
        xc := output[index]
        yc := output[8400+index]
        w := output[2*8400+index]
        h := output[3*8400+index]
        x1 := (xc - w/2) / 640 * float32(img_width)
        y1 := (yc - h/2) / 640 * float32(img_height)
        x2 := (xc + w/2) / 640 * float32(img_width)
        y2 := (yc + h/2) / 640 * float32(img_height)
        boxes = append(boxes, []interface{}{float64(x1), float64(y1), float64(x2), float64(y2), label, prob})
    }

    sort.Slice(boxes, func(i, j int) bool {
        return boxes[i][5].(float32) < boxes[j][5].(float32)
    })
    result := [][]interface{}{}
    for len(boxes) > 0 {
        result = append(result, boxes[0])
        tmp := [][]interface{}{}
        for _, box := range boxes {
            if iou(boxes[0], box) < 0.7 {
                tmp = append(tmp, box)
            }
        }
        boxes = tmp
    }
    return result
}
Enter fullscreen mode Exit fullscreen mode

If you like to work with this output in a more convenient "Pythonic" way, there is a Gorgonia Tensor library that emulates features of NumPy in Go. You can use it to physically reshape the output to 84x8400, then transpose to 8400x84 and then traverse detected objects by row.

However, the option to work with single dimension array as with matrix described in this section is the most efficient, because we got all values we need without additional array transformations. I think that installing additional external dependency is overkill for this case.

That is it for Go implementation. If you wrote everything correctly, then you can start this web service by running the following command:

go run main.go
Enter fullscreen mode Exit fullscreen mode

and open http://localhost:8080 in a web browser.

The code that we developed here intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. I made it as simple as possible, and it does not include any details, except working with ONNX. It does not include any resource management, error processing and exception handling. These tasks depend on real use cases and it's up to you how to implement it for your projects.

Full reference of GO library for ONNX runtime available here.

You can find a source code of Go object detector web service in this repository.

Create a web service on Rust

This article can not be complete without an example of a low level language, the high performance and efficient language, on which developers manage memory by themselves and not rely on a garbage collector. I was thinking which one to choose, either C++ or Rust. Finally, I decided to ask people and created the following poll in the LinkedIn group:

Image description

Regardless of received results, I also analyzed comments and understood that highly likely people answered not the question that I have asked. I did not ask "Which of these programming languages do you know?", or "Which of them do you like?" or "Which of them is the most popular?". Instead, I asked: "Which is better to learn TODAY to create NEW high performance server applications?".

Finally, I got only one valuable comment:

Image description

It was the only comment that received some likes and I completely agree with that text.

Finally, the choice was made! We are going to create an object detection web service on Rust - the safest low-level programming language today.

Setup the project

Enter the command to create a new Rust project:

cargo new object_detector
Enter fullscreen mode Exit fullscreen mode

This will create an object_detector folder with a project template in it.

Go to this folder and open the Cargo.toml file in it.

Write the following packages to the dependencies section:

[dependencies]
image = "0.24.6"
ndarray = "0.15.6"
ort = "1.14.6"
serde = "1.0.84"
serde_derive = "1.0.84"
serde_json = "1.0.36"
rocket = "=0.5.0-rc.3"
Enter fullscreen mode Exit fullscreen mode

Create a Rocket.toml file which will contain configuration for the Rocket web server and add the following lines to it:

[global]
address = "0.0.0.0"
port = 8080
Enter fullscreen mode Exit fullscreen mode

We are not going to change frontend, so copy the index.html to the project. Also, copy the yolov8m.onnx model.

Before continue, ensure that the ONNX runtime installed on your operating system, because the library that integrated to the Rust package may not work correctly. To install it, you can download the archive for your operating system from here, extract and copy contents of "lib" subfolder to the system libraries path of your operating system.

The main.rs, the main project file already generated, and it's located in the src subfolder. Open this file and add the following boilerplate code to it:

use std::{sync::Arc, path::Path, vec};
use image::{GenericImageView, imageops::FilterType};
use ndarray::{Array, IxDyn, s, Axis};
use ort::{Environment,SessionBuilder,tensor::InputTensor};
use rocket::{response::content,fs::TempFile,form::Form};
#[macro_use] extern crate rocket;

#[rocket::main]
async fn main() {
    rocket::build()
        .mount("/", routes![index])
        .mount("/detect", routes![detect])
        .launch().await.unwrap();
}

#[get("/")]
fn index() -> content::RawHtml<String> {
    return content::RawHtml(std::fs::read_to_string("index.html").unwrap());
}

#[post("/", data = "<file>")]
fn detect(file: Form<TempFile<'_>>) -> String {
    let buf = std::fs::read(file.path().unwrap_or(Path::new(""))).unwrap_or(vec![]);
    let boxes = detect_objects_on_image(buf);
    return serde_json::to_string(&boxes).unwrap_or_default()
}

fn detect_objects_on_image(buf: Vec<u8>) -> Vec<(f32,f32,f32,f32,&'static str,f32)> {
    let (input,img_width,img_height) = prepare_input(buf);
    let output = run_model(input);
    return process_output(output, img_width, img_height);    
}

fn prepare_input(buf: Vec<u8>) -> (Array<f32,IxDyn>, u32, u32) {

}

fn run_model(input:Array<f32,IxDyn>) -> Array<f32,IxDyn> {

}

fn process_output(output:Array<f32,IxDyn>,img_width: u32, img_height: u32) -> Vec<(f32,f32,f32,f32,&'static str, f32)> {

}
Enter fullscreen mode Exit fullscreen mode

First block imports required modules:

  • image - to process images
  • ndarray - to work with tensors
  • ort - ONNX runtime library
  • rocket - Rocket Web framework
  • std - some objects from Rust standard library

Then, in the main function we start the Rocket web server and attach index and detect routes to it.

The index function serves the root of the service, it just returns the content of the index.html file as HTML.

The detect function serves the /detect endpoint. It receives the uploaded file, passes it to the detect_objects_on_image, receives the array of bounding boxes, serializes them to JSON and returns this JSON string to the frontend.

The detect_objects_on_image implements the same actions as the Python version. It converts the image to the multidimensional array of numbers, passes it to the ONNX runtime and processes the output. Finally, it returns the array of bounding boxes, where each bounding box is a tuple of (x1,y1,x2,y2,label, prob). The Rust is strong typed language, so we have to specify types of all variables in this tuple. That is why it returns Vec<(f32,f32,f32,f32,&'static str,f32)> which is a vector of bounding box tuples.

Then we define stubs for prepare_input, run_model and process_output functions, that will be implemented one by one in the following sections.

Prepare the input

To prepare the input for the YOLOv8 model, you need to load the image, resize it and convert to a tensor of (1,3,640,640) shape which is an array of single image represented as 3 640x640 matrices. The first item is an array of red components of image pixels, the second item is an array of greens, and the last item is an array of blues. We will use the ndarray library to construct this tensor and fill it with pixel color values. But first we need to load the image, and resize it to 640x640:

let img = image::load_from_memory(&buf).unwrap();
let (img_width, img_height) = (img.width(), img.height());
let img = img.resize_exact(640, 640, FilterType::CatmullRom);
Enter fullscreen mode Exit fullscreen mode
  • In the first line, the image is loaded from uploaded file buffer
  • Next, we save the original image width and height for future
  • Finally, we resized the image to 640x640

Then, let's construct the input array of required shape:

let mut input = Array::zeros((1, 3, 640, 640)).into_dyn();
Enter fullscreen mode Exit fullscreen mode

This line created a new 4-dimensional tensor filled with zeros.

Now, you need to get access to the image pixels and their color components. The img object has a pixels() method, which is an iterator for image pixels. You can use it to get access to each pixel in a loop:

for pixel in img.pixels() {
}
Enter fullscreen mode Exit fullscreen mode

The pixel is a Pixel object with properties that we need:

  • x - the x coordinate of pixel
  • y - the y coordinate of pixel
  • color - the object with an array with 4 items [r,g,b,a]: color components of pixel.

Having this, you can fill the tensor input in a loop:

for pixel in img.pixels() {
    let x = pixel.0 as usize;
    let y = pixel.1 as usize;
    let [r,g,b,_] = pixel.2.0;
    input[[0, 0, y, x]] = (r as f32) / 255.0;
    input[[0, 1, y, x]] = (g as f32) / 255.0;
    input[[0, 2, y, x]] = (b as f32) / 255.0;
};
Enter fullscreen mode Exit fullscreen mode
  • First, we extract x and y variables and convert them to the type that can be used as a tensor index
  • Then we destructure color to r, g and b variables.
  • Finally, we put these pixel color components to appropriate cells of the tensor. Notice that the y goes first and the x goes next. This is because in matrices, the first dimension is a row and the second is a column.

So, now you have an input prepared for the neural network. You need to return it from the function along with img_width and img_height. Here is a full source of the prepare_input:

fn prepare_input(buf: Vec<u8>) -> (Array<f32,IxDyn>, u32, u32) {
    let img = image::load_from_memory(&buf).unwrap();
    let (img_width, img_height) = (img.width(), img.height());
    let img = img.resize_exact(640, 640, FilterType::CatmullRom);
    let mut input = Array::zeros((1, 3, 640, 640)).into_dyn();
    for pixel in img.pixels() {
        let x = pixel.0 as usize;
        let y = pixel.1 as usize;
        let [r,g,b,_] = pixel.2.0;
        input[[0, 0, y, x]] = (r as f32) / 255.0;
        input[[0, 1, y, x]] = (g as f32) / 255.0;
        input[[0, 2, y, x]] = (b as f32) / 255.0;
    };
    return (input, img_width, img_height);
}
Enter fullscreen mode Exit fullscreen mode

Now, it's time to pass this input through the YOLOv8 model.

Run the model

The run_model function used to pass the input tensor through the model and return the output tensor. This is its source code:

fn run_model(input:Array<f32,IxDyn>) -> Array<f32,IxDyn> {
    let input = InputTensor::FloatTensor(input);
    let env = Arc::new(Environment::builder().with_name("YOLOv8").build().unwrap());
    let model = SessionBuilder::new(&env).unwrap().with_model_from_file("yolov8m.onnx").unwrap();
    let outputs = model.run([input]).unwrap();
    let output = outputs.get(0).unwrap().try_extract::<f32>().unwrap().view().t().into_owned();
    return output;
}
Enter fullscreen mode Exit fullscreen mode
  • First it converts the input to the internal ONNX runtime tensor format
  • Then it creates the environment and instantiates the ONNX model in it from the yolov8m.onnx file.
  • Then it runs the model with the input tensor and receives the array of outputs.
  • Finally, it extracts the first output and returns it.

The returned output is an Ndarray tensor, so we can traverse it in a loop. Let's process it.

Process the output

The code of the process_output function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Rust. Include them to your code below the process_output function:

fn iou(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
    return intersection(box1, box2) / union(box1, box2);
}

fn union(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
    let (box1_x1,box1_y1,box1_x2,box1_y2,_,_) = *box1;
    let (box2_x1,box2_y1,box2_x2,box2_y2,_,_) = *box2;
    let box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1);
    let box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1);
    return box1_area + box2_area - intersection(box1, box2);
}

fn intersection(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
    let (box1_x1,box1_y1,box1_x2,box1_y2,_,_) = *box1;
    let (box2_x1,box2_y1,box2_x2,box2_y2,_,_) = *box2;
    let x1 = box1_x1.max(box2_x1);
    let y1 = box1_y1.max(box2_y1);
    let x2 = box1_x2.min(box2_x2);
    let y2 = box1_y2.min(box2_y2);
    return (x2-x1)*(y2-y1);
}
Enter fullscreen mode Exit fullscreen mode

Also, we will need to get labels for detected objects, so include this array of COCO class labels:

const YOLO_CLASSES:[&str;80] = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
    "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
    "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
    "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
    "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
];
Enter fullscreen mode Exit fullscreen mode

Now let's start writing the process_output function.

Let's define an array to which you will put collected bounding boxes:

let mut boxes = Vec::new();
Enter fullscreen mode Exit fullscreen mode

The output from YOLOv8 model is a tensor and for some reason, it has a shape [8400,84,1], instead of how it looks in other programming languages. It's already ordered by rows, but has an extra dimension at the end. Let's remove it:

let output = output.slice(s![..,..,0])
Enter fullscreen mode Exit fullscreen mode

This line extracted the (8400,84) matrix from this tensor, and we can traverse it by first axis, e.g. by rows:

for row in output.axis_iter(Axis(0)) {
}
Enter fullscreen mode Exit fullscreen mode

Here, the row is a single dimension NdArray object that represents a row with 84 float numbers. It will be more convenient to convert it to the basic array, let's do it:

for row in output.axis_iter(Axis(0)) {
    let row:Vec<_> = row.iter().map(|x| *x).collect();
}
Enter fullscreen mode Exit fullscreen mode

The first 4 items of this array contain bounding box coordinates, and we can convert and scale them to x1,y1,x2,y2 now:

let xc = row[0]/640.0*(img_width as f32);
let yc = row[1]/640.0*(img_height as f32);
let w = row[2]/640.0*(img_width as f32);
let h = row[3]/640.0*(img_height as f32);
let x1 = xc - w/2.0;
let x2 = xc + w/2.0;
let y1 = yc - h/2.0;
let y2 = yc + h/2.0;
Enter fullscreen mode Exit fullscreen mode

Then, all items from 4 to 83 are probabilities that this bounding box belongs to each of 80 object classes. You need to find maximum of these items and the index of this item, which can be used as an ID of object class. You can do this in a loop:

let mut class_id = 0;
let mut prob:f32 = 0.0;
for index in 4..row.len() {
    if row[index]>prob {
        prob = row[index];
        class_id = index-4;
    }
}
let label = YOLO_CLASSES[class_id];
Enter fullscreen mode Exit fullscreen mode

Here we determined the maximum probability, the class_id of object with maximum probability and the label of object of this class.

It works ok, but I'd better implement it in a functional way instead of loop:

let (class_id, prob) = row.iter().skip(4).enumerate()
    .map(|(index,value)| (index,**value))
    .reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
let label = YOLO_CLASSES[class_id];
Enter fullscreen mode Exit fullscreen mode
  • This code gets an iterator for row element that starts from 4th item.
  • Then it maps the row items to a tuples (class_id, prob).
  • Then it reduces this array of tuples to a single element with maximum prob.
  • The resulting tuple, the destructured to the class_id and prob variables.

Finally, you can skip the row if the prob < 0.5 or collect all values to a bounding box and push this bounding box to the boxes array.

Here is all code that we have now, in which operations ordered correctly:

let mut boxes = Vec::new();
let output = output.slice(s![..,..,0]);
for row in output.axis_iter(Axis(0)) {
    let row:Vec<_> = row.iter().map(|x| *x).collect();
    let (class_id, prob) = row.iter().skip(4).enumerate()
        .map(|(index,value)| (index,**value))
        .reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
    if prob < 0.5 {
        continue
    }
    let label = YOLO_CLASSES[class_id];
    let xc = row[0]/640.0*(img_width as f32);
    let yc = row[1]/640.0*(img_height as f32);
    let w = row[2]/640.0*(img_width as f32);
    let h = row[3]/640.0*(img_height as f32);
    let x1 = xc - w/2.0;
    let x2 = xc + w/2.0;
    let y1 = yc - h/2.0;
    let y2 = yc + h/2.0;
    boxes.push((x1,y1,x2,y2,label,prob));
}
Enter fullscreen mode Exit fullscreen mode

P.S. Actually, it's possible to implement all this in a functional way instead of loop. You can do it as a homework.

Finally, you need to filter the boxes array to exclude the boxes, that overlap each other, using the Intersection over union. The filtered boxes should be collected to the result array:

let mut result = Vec::new();
boxes.sort_by(|box1,box2| box2.5.total_cmp(&box1.5));
while boxes.len()>0 {
    result.push(boxes[0]);
    boxes = boxes.iter().filter(|box1| iou(&boxes[0],box1) < 0.7).map(|x| *x).collect()
}
Enter fullscreen mode Exit fullscreen mode
  • First, we sort boxes by probability in descending order to put the boxes with the highest probability to the top.
  • Then, in a loop, we put the first box with highest probability to the resulting array
  • Then, we overwrite the boxes array using a filter, that adds to it only those boxes, which iou value is less than 0.7 if compare with the selected box.
  • If after filter, the boxes contains more elements, the loop continues.

Finally, after the loop, the boxes array will be empty and the result will contain bounding boxes of all different detected objects.

The result array should be returned by this function. Here is the whole code:

fn process_output(output:Array<f32,IxDyn>,img_width: u32, img_height: u32) -> Vec<(f32,f32,f32,f32,&'static str, f32)> {
    let mut boxes = Vec::new();
    let output = output.slice(s![..,..,0]);
    for row in output.axis_iter(Axis(0)) {
        let row:Vec<_> = row.iter().map(|x| *x).collect();
        let (class_id, prob) = row.iter().skip(4).enumerate()
            .map(|(index,value)| (index,**value))
            .reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
        if prob < 0.5 {
            continue
        }
        let label = YOLO_CLASSES[class_id];
        let xc = row[0]/640.0*(img_width as f32);
        let yc = row[1]/640.0*(img_height as f32);
        let w = row[2]/640.0*(img_width as f32);
        let h = row[3]/640.0*(img_height as f32);
        let x1 = xc - w/2.0;
        let x2 = xc + w/2.0;
        let y1 = yc - h/2.0;
        let y2 = yc + h/2.0;
        boxes.push((x1,y1,x2,y2,label,prob));
    }

    boxes.sort_by(|box1,box2| box2.5.total_cmp(&box1.5));
    let mut result = Vec::new();
    while boxes.len()>0 {
        result.push(boxes[0]);
        boxes = boxes.iter().filter(|box1| iou(&boxes[0],box1) < 0.7).map(|x| *x).collect()
    }
    return result;
}
Enter fullscreen mode Exit fullscreen mode

That is it for Rust web service. If everything written correctly, you can start web service by running the following command in the project folder:

cargo run
Enter fullscreen mode Exit fullscreen mode

and open http://localhost:8080 in a web browser.

The code that we developed here is oversimplified. It's intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. I made it as simple as possible, and it does not include any other details, except working with ONNX. It does not include any resource management, error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.

Full reference of Rust library for ONNX runtime available here.

You can find a source code of Rust object detector web service in this repository.

Conclusion

In this article I showed that even if the YOLOv8 neural network created on Python, you can use it from other programming languages, because it can be exported to universal ONNX format.

We explored the foundational algorithms, used to prepare the input and process the output from ONNX model, which is the same for all programming languages that have interfaces for ONNX runtime.

After discovered the main concepts, I showed how to create an object detection web service based on ONNX runtime using Python, Julia, Node.js, JavaScript, Go and Rust. Each language has some differences, but in general, all workflow follows the same algorithm.

You can apply this experience for any other neural networks, created using PyTorch or TensorFlow (which are the most neural networks, existing in the world), because each framework can export its models to ONNX.

There are ONNX runtime interfaces for other programming languages like Java, C# or C++ and for other platforms, including mobile phones. You can find the list of official bindings here.

Also, there are unofficial bindings for other languages, like PHP. It's a great way to integrate neural networks to WordPress websites.

I believe that it won't be difficult to rewrite the projects that we created here on those other languages if you know those languages, of course.

In the next article, I will show how to detect objects on a video in web browser in real time. Follow me to know first when I publish this.

You can find me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.

Have a fun coding and never stop learning!

Top comments (21)

Collapse
 
moscamich profile image
Michele Moscaritolo

Impressive and outstanding! You wrote a wonderful, unique article. I'm sure your expertise will help many developers like it did help me.

minor suggestion add the support to the GPU, for example in Julia:
load_inference("yolov8m.onnx", execution_provider=:cuda)

And then you have what I believe is the Bible

Bravo! Standing ovation

Collapse
 
pcismyname profile image
Chidsanuphong Pengchai

This is what I'm looking for, tysm!

Collapse
 
aajadp profile image
aajad-p

Great explanation.. thank you sir,
Can you brief about SAM(segment anything model) in ONNX format and can use in any language.

Collapse
 
andreygermanov profile image
Andrey Germanov • Edited

Yes, in the next article. So, subscribe and stay tuned ;)

Collapse
 
yubolong profile image
YuboLong

I'm currently working on deploying yolov8 to java , your post-process code really helps me lot , thanks !

Collapse
 
anto5040 profile image
Antonio

Thank you Andrey for this post. It has been really enlightening and helpful in my task. Keep up with the good work!

Collapse
 
zain18jan2000 profile image
Muhammad Zain Ul Haque

I just login here to say this is the best article with too much detail I have ever read in my life on computer vision. Respect for the author.

Collapse
 
andreygermanov profile image
Andrey Germanov

Thanks a lot!

Collapse
 
wenbindu profile image
dean.du

Great work, thank you so much. You have created an amazing article.

Collapse
 
camerayuhang profile image
camerayuhang

I have never, ever, ever seen such an incredibly amazing tutorial like this. This article is truly unique and wonderful.

Collapse
 
fatmaboodai profile image
fatmaboodai

Hello I tried to use this tutorial but with the classifying model but i'm getting this error :
ERROR:object-detection:Exception on /detect [POST]
Traceback (most recent call last):
File "C:\Users\mega\Desktop\YoloPlayGround\YoloPlayGround\lib\site-packages\flask\app.py", line 1455, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\mega\Desktop\YoloPlayGround\YoloPlayGround\lib\site-packages\flask\app.py", line 869, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\mega\Desktop\YoloPlayGround\YoloPlayGround\lib\site-packages\flask\app.py", line 867, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\mega\Desktop\YoloPlayGround\YoloPlayGround\lib\site-packages\flask\app.py", line 852, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "C:\Users\mega\Desktop\YoloPlayGround\object-detection.py", line 24, in detect

boxes = detect_objects_on_image(buf.stream)
File "C:\Users\mega\Desktop\YoloPlayGround\object-detection.py", line 29, in detect_objects_on_image
output = run_model(input)
File "C:\Users\mega\Desktop\YoloPlayGround\object-detection.py", line 53, in run_model

outputs = model.run(["output0"], {"images":input})
File "C:\Users\mega\Desktop\YoloPlayGround\YoloPlayGround\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: images for the following indices
index: 2 Got: 640 Expected: 224
index: 3 Got: 640 Expected: 224

Collapse
 
andreygermanov profile image
Andrey Germanov • Edited

However, if you are interested in how to parse the YOLOv8 classification model output, then I can help you.

Here is the code, that you need to rewrite and use for this:

def detect_objects_on_image(buf):
    input = prepare_input(buf)
    output = run_model(input)
    return process_output(output)

def prepare_input(buf):
    img = Image.open(buf)
    img = img.resize((224, 224))
    img = img.convert("RGB")
    input = np.array(img)
    input = input.transpose(2, 0, 1)
    input = input.reshape(1, 3, 224, 224) / 255.0
    return input.astype(np.float32)

def run_model(input):
    model = ort.InferenceSession("yolov8m-cls.onnx", providers=['CPUExecutionProvider'])
    outputs = model.run(["output0"], {"images":input})
    return outputs[0]

def process_output(output):
    class_id = output[0].argmax()
    return imagenet_classes[class_id] # returns only a label of detected class

imagenet_classes = {}  # get ImageNet class labels from here: https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a
Enter fullscreen mode Exit fullscreen mode

But you can use this only in other applications, because the web application in this tutorial requires bounding boxes of detected objects, which are not available in classifying model's output.

Collapse
 
fatmaboodai profile image
fatmaboodai

Thank you so much i really appreciate it

Collapse
 
andreygermanov profile image
Andrey Germanov • Edited

Hello,

Unfortunately, this tutorial was written for object detection and can't be used for YOLOv8 classifying model, because this model is very different:

  1. It requires an image resized to 224x224 (this is the error, that you've got)
  2. It trained not on COCO dataset with 80 classes, but on ImageNet dataset with 1000 classes (80 class labels, provided in the tutorial, do not match 1000 class labels of ImageNet).
  3. It returns an output of (1, 1000) shape, that contains probabilities of each of 1000 ImageNet classes and does not contain any bounding boxes (output processing function should be different).

Moreover, the web application, that you create by following this tutorial used to draw bounding boxes of detected objects on the image, but YOLOv8 classifying model does not return bounding boxes, it returns only class ids for the whole image. That is why, it will not work for this application.

So, sorry, but it's wrong tutorial for this task.

Collapse
 
fatmaboodai profile image
fatmaboodai

Thank you so much

Collapse
 
pcismyname profile image
Chidsanuphong Pengchai

I have a question in pre-processing image, Is padding neccessary or important in this step

Collapse
 
andreygermanov profile image
Andrey Germanov

When I tried padding with YOLOv8, I didn't see a difference in quality of results.

Collapse
 
pcismyname profile image
Chidsanuphong Pengchai • Edited

thank you.

Collapse
 
devanshsahni profile image
DevanshSahni

Thankyou so much for such a great explanation!

Collapse
 
fatmaboodai profile image
fatmaboodai

Hello,
I was trying to push my project to github after following your tutorial and it’s says that the folder are too big
How were you are to push your project?

Collapse
 
andreygermanov profile image
Andrey Germanov

Perhaps the model file is too big for GitHub. I've experienced this sometimes. In this case, you need to push without model file.