Laurent Picard for Google Cloud

Posted on Jun 25, 2020 • Edited on Oct 8, 2021 • Originally published at Medium

🎞️ Video object tracking as a service 🐍

#python #serverless #machinelearning #googlecloud

👋 Hello!

In this article, following up on the previous one ("Video summary as a service"), you'll see the following:

how to track objects present in a video,
with an automated processing pipeline,
in less than 300 lines of Python code.

Here is an example of an auto-generated object summary for the video :

🛠️ Tools

A few tools will do:

Storage space for videos and results
A serverless solution to run the code
A machine learning model to analyze videos
A library to extract frames from videos
A library to render the objects

🧱 Architecture

Here is a possible architecture using 3 Google Cloud services (Cloud Storage, Cloud Functions, and the Video Intelligence API):

The processing pipeline follows these steps:

You upload a video
The upload event automatically triggers the tracking function
The function sends a request to the Video Intelligence API
The Video Intelligence API analyzes the video and uploads the results (annotations)
The upload event triggers the rendering function
The function downloads both annotation and video files
The function renders and uploads the objects
You know which objects are present in your video!

🐍 Python libraries

Video Intelligence API

To analyze videos
https://pypi.org/project/google-cloud-videointelligence

Cloud Storage

To manage downloads and uploads
https://pypi.org/project/google-cloud-storage

OpenCV

To extract video frames
OpenCV offers a headless version (without GUI features, ideal for a service)
https://pypi.org/project/opencv-python-headless

Pillow

To render and annotate object images
Pillow is a very popular imaging library, both extensive and easy to use
https://pypi.org/project/Pillow

🧠 Video analysis

Video Intelligence API

The Video Intelligence API is a pre-trained machine learning model that can analyze videos. One of its multiple features is detecting and tracking objects. For the 1st Cloud Function, here is a possible core function calling annotate_video() with the OBJECT_TRACKING feature:

from google.cloud import storage, videointelligence

def launch_object_tracking(video_uri: str, annot_bucket: str):
    """ Detect and track video objects (asynchronously)

    Results will be stored in <annot_uri> with this naming convention:
    - video_uri: gs://video_bucket/path/to/video.ext
    - annot_uri: gs://annot_bucket/video_bucket/path/to/video.ext.json
    """
    print(f"Launching object tracking for <{video_uri}>...")
    features = [videointelligence.Feature.OBJECT_TRACKING]
    video_blob = storage.Blob.from_string(video_uri)
    video_bucket = video_blob.bucket.name
    path_to_video = video_blob.name
    annot_uri = f"gs://{annot_bucket}/{video_bucket}/{path_to_video}.json"
    request = dict(features=features, input_uri=video_uri, output_uri=annot_uri)

    video_client = videointelligence.VideoIntelligenceServiceClient()
    video_client.annotate_video(request)

Cloud Function entry point

import os

ANNOTATION_BUCKET = os.getenv("ANNOTATION_BUCKET", "")
assert ANNOTATION_BUCKET, "Undefined ANNOTATION_BUCKET environment variable"

def gcf_track_objects(data, context):
    """Cloud Function triggered by a new Cloud Storage object"""
    video_bucket = data["bucket"]
    path_to_video = data["name"]
    video_uri = f"gs://{video_bucket}/{path_to_video}"
    launch_object_tracking(video_uri, ANNOTATION_BUCKET)

Notes:

This function will be called when a video is uploaded to the bucket defined as a trigger.

Using an environment variable makes the code more portable and lets you deploy the exact same code with different trigger and output buckets.

🎨 Object rendering

Code structure

It's interesting to split the code into 2 main classes:

StorageHelper for managing local files and cloud storage objects
VideoProcessor for all graphical processings

Here is a possible core function for the 2nd Cloud Function:

class VideoProcessor:
    @staticmethod
    def render_objects(annot_uri: str, output_bucket: str):
        """Render objects from video annotations"""
        with StorageHelper(annot_uri, output_bucket) as storage:
            with VideoProcessor(storage) as video_proc:
                print(f"Objects to render: {video_proc.object_count}")
                video_proc.render_object_summary()

Cloud Function entry point

def gcf_render_objects(data, context):
    """Cloud Function triggered by a new Cloud Storage object"""
    annotation_bucket = data["bucket"]
    path_to_annotation = data["name"]
    annot_uri = f"gs://{annotation_bucket}/{path_to_annotation}"
    VideoProcessor.render_objects(annot_uri, OBJECT_BUCKET)

Note: This function will be called when an annotation file is uploaded to the bucket defined as a trigger.

Frame rendering

OpenCV and Pillow easily let you extract video frames and compose over them:

class VideoProcessor:
    def get_frame_with_overlay(
        self, obj: vi.ObjectTrackingAnnotation, frame: vi.ObjectTrackingFrame
    ) -> PilImage:
        def get_video_frame() -> PilImage:
            pos_ms = frame.time_offset.total_seconds() * 1000
            self.video.set(cv.CAP_PROP_POS_MSEC, pos_ms)
            _, cv_frame = self.video.read()
            image = Image.fromarray(cv.cvtColor(cv_frame, cv.COLOR_BGR2RGB))
            image.thumbnail(self.cell_size)  # Makes it smaller if needed
            return image

        def add_caption(image: PilImage) -> PilImage:
            ...

        def add_bounding_box(image: PilImage) -> PilImage:
            ...
            draw.rectangle(r, outline=BBOX_COLOR, width=BBOX_WIDTH_PX)
            return image

        return add_bounding_box(add_caption(get_video_frame()))

Note: It would probably be possible to only use OpenCV but I found it more productive developing with Pillow (code is more readable and intuitive).

🔎 Results

Here are the main objects found in the video :

Notes:

The machine learning model has correctly identified different wildlife species: those are "true positives". It has also incorrectly identified our planet as "packaged goods": this is a "false positive". Machine learning models keep learning by being trained with new samples so, with time, their precision keeps increasing (resulting in less false positives).

The current code filters out objects detected with a confidence below 70% or with less than 10 frames. Lower the thresholds to get more results.

🍒 Cherry on Py 🐍

Now, the icing on the cake (or the "cherry on the pie" as we say in French), you can enrich the architecture to add new possibilities:

Trigger the processing for videos from any bucket (including external public buckets)
Generate individual object animations (in parallel to object summaries)

Architecture (v2)

A - Video object tracking can also be triggered manually with an HTTP GET request
B - The same rendering code is deployed in 2 sibling functions, differentiated with an environment variable
C - Object summaries and animations are generated in parallel

Cloud Function HTTP entry point

def gcf_track_objects_http(request):
    """Cloud Function triggered by an HTTP GET request"""
    if request.method != "GET":
        return ("Please use a GET request", 403)
    if not request.args or "video_uri" not in request.args:
        return ('Please specify a "video_uri" parameter', 400)
    video_uri = request.args["video_uri"]
    launch_object_tracking(video_uri, ANNOTATION_BUCKET)
    return f"Launched object tracking for <{video_uri}>"

Note: This is the same code as gcf_track_objects() with the video URI parameter specified by the caller through a GET request.

🎉 Results

Here are some auto-generated trackings for the video :

The left elephant (a big object ;) is detected:
The right elephant is perfectly isolated too:
The veterinarian is correctly identified:
The animal he's feeding too:

Moving objects or static objects in moving shots are tracked too, as in :

A building in a moving shot:
Neighbor buildings are tracked too:
Persons in a moving shot:
A surfer crossing the shot:

Here are some others for the video :

A butterfly (easy?):
An insect, in larval stage, climbing a moving twig:
An ape in a tree far away (hard?):
A monkey jumping from the top of a tree (harder?):
Now, a trap! If we can be fooled, current machine learning state of the art can too:

🚀 Source code and deployment

The Python source code is less than 300 lines.
You can deploy this architecture in less than 8 minutes.
See "Deploying from scratch".

🖖 See you

Do you want more, do you have questions? I'd love to read your feedback. You can also follow me on Twitter.

⏳ Updates

2021-10-08: Updated with latest library versions + Python 3.7 → 3.9

Top comments (2)

amananandrai • Jun 25 '20

@picardparis Awesome post would love to see more such posts in the future. Also, I would suggest making a post on Machine Learning on Android devices.

AiSara • Jun 26 '20

Wow such a nice post here ! Really interesting.