DEV Community: Serhii Maksymenko

AI Assisted Real-Time Video Processing

Serhii Maksymenko — Thu, 11 Mar 2021 13:36:32 +0000

This article was written based on our research and expertise of building real-time video processing products, together with creating pipelines for applying Machine Learning and Deep Learning models.

When it comes to real time video processing, the data pipeline becomes more complex to handle. And we are striving to minimize latency in streaming video. On the other hand, we must also ensure sufficient accuracy of the implemented models.

Overall, the livestreaming industry has increased up to 99% in hours watched since last year according to dailyesports.gg statistics. So, it will totally change fan experience, gaming, telemedicine, etc. Moreover, Grand View Research reports that the Video Stream Market will be worth USD 184.27 billion by 2027.

AI-driven Live Video Processing Use Cases

Trained models, able to detect certain objects, are not an uncomplicated thing to create. However, when it comes to children in kindergarten – security is a top priority. The models may help to prevent, for example, a kid from running away or slipping out. Or, as another example, also about runaways, would be animals leaving the borders of a farm, zoo or reserve.

Organizations that store and process facial images for identification and authentication, sometimes need to implement security solutions to ensure privacy and meet GDPR data protection requirements. Some examples would be blurring faces while streaming conferences, meetings, etc. via YouTube, CCTV, private channels, or security cameras in a manufacturing building, and in shopping malls.

Another area of AI-based Visual Inspection for Defect Detection has been implemented at manufacturing facilities that are on the way to becoming fully robotic. So, computer vision makes it easier to distinguish the manufacturer’s flaws. With visual inspection technology, integration of deep learning methods allows differentiating parts, anomalies, and characters, which imitate a human visual inspection while running a computerized system.

How to Speed up Real Time Video Processing?

A Technical problem, which we are solving, is to blur faces of video subjects quickly and accurately while live streaming, and without quality loss through the use of Artificial Intelligence.

In short, video processing may be sketched as a series of consequent processes: decoding, computation and encoding. Although the criteria for this serial process, like speed, accuracy and flexibility may complicate the easiness of the first blush skim. So, the final resolution is supposed to be flexible in terms of input, output and configuration.

To make processing faster, keeping the accuracy at the reasonable level is possible in several ways: 1) to do something parallely; 2) to speed up the algorithms.

Basically, there are two approaches for ways to parallel the processes: file splitting and pipeline architecture.

The first one, file splitting, is to make the algorithms run in parallel so it might be possible to keep using slower, yet accurate models. It is implemented when video is split into parts and processed in parallels. In such a manner, splitting is a kind of virtual file generation, not a real sub-file generation. However, this process is not very suitable for real time processing, because it may be difficult to pause, resume or even move the processing at a different position in timespin.

The second one, pipeline architecture, is to make a certain effort to accelerate the algorithms themselves, or their parts with no significant loss of the accuracy. Instead of splitting the video, the pipeline approach is aimed to split and parallelize the operations, which are performed during the processing. Because of this process, the pipeline approach is more flexible.

Why is the pipeline approach more flexible? One of the benefits of the pipeline is the ease of manipulation of the components due to requirements. Decoding can work using a video file to encode frames into another file.

Alternatively, input can be an RTSP stream from an IP camera. Output can be a WebRTC connection in the browser or mobile application. There is a unified architecture, which is based on a video stream for all combinations of input and output formats. The computation process is not necessarily a monolith operation.

How to Implement a Pipeline Approach

As part of one of the projects, we had to process video in real time using AI algorithms.

The pipeline was composed of decoding, face detection, face blurring and encoding stages. The flexibility of the system was essential in this case because it was essential to process not only video files, but also different formats of video live-stream. It showed a good FPS in range 30-60 depending on the configuration.

INTERPOLATION WITH TRACKING

We used the tracking algorithm based on centroids, because it is easier to apply. However, when there is a need – other algorithms like Deep SORT can be used. But they really impact the speed if there are too many faces on the video. That’s why interpolation should be used additionally.

What is the quality of interpolated frames? Since we need to skip some frames, we’d like to know the quality of the interpolated frames. Therefore, the F1 metric was calculated and ensured that there aren’t too many false positives and false negatives because of interpolation. F1 value was around 0.95 for most of the video examples.

SHARING MEMORY

The next stage of this process is sharing memory. Usually, it is quite slow to send data through the queue, so to do it between the processes in Python is really the best way to do this process

The PyTorch version of multiprocessing has the ability to pass a tensor handle through the queue so that another process can just get a pointer to the existing GPU memory. So, another approach was used: a system level inter-process communication (IPC) mechanism for shared memory based on POSIX API. The speed of interprocess communication was extremely improved with the help of Python libraries, providing an interface for using this memory.

MULTIPLE WORKERS OR MULTIPROCESSING

Finally, there is a need to add several workers for a pipeline component to reduce the time that is needed for processing. This was applied in the face detection stage, and may also be done for every heavy operation which doesn’t need an ordered input. The thing is that it really depends on the operations which are done inside the pipeline. In case, there is comparatively fast face detection, and FPS may be lowered after adding more detection workers.

The time which is required for managing one more process can be bigger than time we gained from adding it. Neural networks, which are used in the multiple workers, will calculate tensors in a serial CUDA stream unless a separate stream is created for each network, which may be tricky to implement.

Multiple workers, due to their concurrent nature, can’t guarantee that the order will be the same with input sequence. Therefore, it requires additional effort to fix the order in the pipeline stages while encoding, for example. Still, the skipping frames may cause the same problem.

Thus, detection time was reduced almost twice, provided that we have 2 workers running a model with MobileNetv2 backbone.

How to Develop an AI-Based Live Video Processing System

How complex is it to apply AI to live video streams? As for the basic scenario, the process of implementation consists of several stages:

Adjusting a pre-trained neural network (or traine) to be able to perform the tasks needed
Setting a cloud infrastructure to enable processing of the video and to be scalable to a certain point
Building a software layer to pack the process and implement user scenarios (mobile applications, web and admin panels, etc.)

To create a product like this, using a pre-trained NN and some simple application layers, takes 3-4 months for building an MVP. However, the details are crucial and each product is unique in terms of the scope and timeline.

We strongly suggest our client start with the Proof of Concept to explore the main and/or the most complicated flow. Spending a few weeks on exploring the best approaches and gaining results, often secures further development flow and brings confidence to both the client and the engineering team.

Written by Serhii Maksymenko, Data Science Engineer at MobiDev.

Full article originally published at https://mobidev.biz. It is based on MobiDev technology research and experience providing software development services.

Image Credit

How to build face recognition app

Serhii Maksymenko — Sat, 21 Mar 2020 21:33:31 +0000

Machine Learning (ML) is playing a key role in a wide range of critical applications.

Recently, I was tasked to design a biometric identification system that integrated with real-time camera streaming. The task had several constraints that required innovative solutions. For instance, the system workflow had to not only detect faces but also had to recognize them in a near-instant in order to expedite further action.

A camera application triggers the detection and recognition workflow. The application (local console app for Ubuntu and Raspbian) written in Golang is installed on a device that is connected to the camera.

We used a JSON config file with Local Camera ID and Camera Reader type for the first app launch configuration.

For face detection we experimented with several processes and discovered that Caffe Face tracking and TensorFlow object detection models provided the best detection outcomes. In addition, both programs are available through OpenCV library.

For face detection we don’t use Dlib. We have a separate API to calculate vectors of features with dlib behind the scenes and compare with references and we call it not for every video frame: we assume that if the bounding box didn’t move too fast it’s the same person.

When a face is captured, the image is cropped and transmitted via HTTP form data request to the backend. The image is saved to a local file system using a backend API. A record is created and saved to Detection Log along with a personID.

A background worker at the app’s backend finds any record with a “classified=false” tag. Using Dlib, the worker calculates a 128-dimensional descriptor vector of face features. Each feature vector is run against multiple reference images within an existing database. The application finds a match by comparing the Euclidean distance between feature vectors of the live-stream image with the Euclidean distance between feature vectors of existing records and entries of each person in the database.

The code uses properly defined point indices in Dlib and represents each point index with a facial feature.

To get started, you can use the dlib wrapper: https://github.com/ageitgey/face_recognition
There is an example how to compare faces: https://github.com/ageitgey/face_recognition/blob/master/examples/web_service_example.py
A worker knows if an image is of a known or an unknown individual based on the Euclidean distance of facial vectors. If a detected person's features rank less than 0.6, then the background worker sets a personID that is marked as classified and enters it into the Detection Log. If a feature is greater than 0.6, then a new record is created. This record is set as unknown and a new personID is entered into the Detection Log.

Unidentified person images are sent as notifications to the corresponding manager. We chose to implement chatbot messenger notifications and found that simple alert chatbot could be implemented within 2-5 days.

We created two chatbots, one with Microsoft Bot Framework and the other with Python-based Errbot. Once chatbots are in place, it is possible for security personnel or others to manually grant remote access to unknown individuals on a case-by-case basis.

Captured images and their corresponding records are managed using an Admin Panel that acts as a portal to a database that contains stored photos and IDs. Both the Admin Panel and database are prepared and entered into the biometric identification system before its implementation. Still, unidentified images and IDs can be added to the existing database using the Admin Portal.

NOTES:

It is necessary to note that the app’s backend requires Golang and MongoDB Collections to store employee data. Yet, API requests are based on RESTful API. Users can test the system on regular workstations prior to implementation.

As unidentified images and IDs are added, databases will grow. Our use case employed a 200-entry database. Since the app works in real-time and recognition is near-instant, the need to scale becomes evident. If organizations need to add cameras or create databases with entries of 10,000 or more, then there could be a lag in real-time analysis and recognition speed. To solve this issue we used parallelization. Using a load balancer and several web workers for simultaneous tasks, the system can chunk an entire database which allows for quick match searches and provides swift results.
Anti-spoofing measures must be highly adaptable to bad actors that might gain entry using false facial images. Our team has put in place enhanced security measures and anti-spoofing features to counteract fraudulent attempts at access.

While this case study is focused on facial recognition, the underlying technology can be used for a range of objects. Object recognition models can be trained to identify any other object once a dataset has been created.