DEV Community

BMF: Frame extraction acceleration- video similarity search with Pinecone

TL;DR: This is a tutorial on how to create a video similarity search with BMF and Pinecone from scratch. View this project's code on github and test it out in a notebook like colab.

So you might have seen in my last blog post that I showed you how to accelerate video frame extraction using GPU's and Babit multimedia framework. In this blog we are going to improve upon our video frame extractor and create a video similarity search(Reverse video search) utlizing different RAG(Retrival Augemented Gerneation) concepts with Pinecone, the vector database that will help us build knowledgeable AI. Pinecone is designed to perform vector searches effectively. You'll see throughout this blog how we extrapulate vectors from videos to make our search work like a charm. With Pinecone, you can quickly find items in a dataset that are most similar to a query vector, making it handy for tasks like recommendation engines, similar item search, or even detecting duplicate content. It's particularly well-suited for machine learning applications where you deal with high-dimensional data and need fast, accurate similarity search capabilities.
Reverse video search works like reverse image search but uses a video to find other videos that are alike. Essentially, you use a video to look for matching ones. While handling videos is generally more complex and the accuracy might not be as good as with other models, the use of AI for video tasks is growing. Reverse video search is really good at finding videos that are connected and can make other video applications better.
So why would you want to create a video similarity search app?

Here are some reasons:

  1. Content Discovery: It enables users to find videos that are visually or contextually similar to what they're interested in, enhancing content discoverability on platforms like streaming services or stock footage libraries.
  2. Recommendation Systems: Enhances recommendation engines by suggesting content that is similar to a user's viewing history, thus improving user engagement and retention.
  3. Duplicate or Near-duplicate Detection: Helps in identifying copies or slight variations of the same video, which is useful for copyright enforcement or content management.
  4. Categorization and Tagging: Assists in automatically categorizing and tagging videos based on content, which can simplify content management and improve searchability.
  5. User-generated Content Moderation: Useful in moderating platforms where vector similarity can help identify potentially problematic content by comparing new uploads with known flagged videos.
  6. Video Analysis: In fields like surveillance, sports, or medical imaging, it can help in analyzing and identifying specific moments or objects in video sequences.

Oh yeah and of course a similarity search like what we'll do in this blog! I've taken inspiration by reading the Milvus video reserve video search notebook and decided to recreate it using technologies I prefer.
Babit Multimedia framework brings forth all the great things we know and love about FFMPEG and amplifies it all with its multi-language support and GPU acceleration capabilities.
Now you might be familiar with other frame extraction methods using OpenCV, FFmpeg,or GStreamer. These are all great options. However, I'm choosing to use BMF for a few reasons:

  • Multi-language support- BMF supports the use of Python, GO and C++
  • Full compatiblity with FFmpeg- BMF is fully compatible with FFmpeg’s processing capabilities and indicators, such as demuxing, decoding, filter, encoding, and muxing capabilities. The configuration of these processing capabilities and the results consistent with FFmpeg’s pts, duration, bitrate, fps and other indicators can be obtained. Very good It satisfies the need to quickly integrate FFmpeg capabilities into projects.
  • Enhanced Support for NVIDIA GPUs to create enterprise ready GPU accelerated video pipelines
    • NVENC/NVDEC/GPU filters work out-of-box by inheriting abilities from FFmpeg.
    • High performance frame processing is enabled by integration of CV-CUDA and customized CUDA kernels.
    • AI inferencing can be easily integrated into video pipelines using TensorRT.
    • Data moving between CPU and GPU can be done by a simple call. Alright, so that's more than just a few reasons, but you get the point! Now let's build a video similarity search/

### The Architecture

Image description

Image description

Required Python packages to install:

  • Python 3.9-3.10
  • pinecone-client
  • BabitMF-GPU
  • torch
  • torchvision>=0.12.0
  • python-dotenv
  • av Grab a video. I had a short video stored on a github repo. You can use a video stored on your system or elsewhere. BMF can handle any video format(FFmpeg compatibility!)

Inserting the dataset into a Pinecone index

Let's start with inserting videos from our dataset into our Pinecone index. We do this so that our vector database has knowledge of the videos we will be comparing to the end user's video. This is a necessary starting point for our application.
First, I'm going to create an account on Pinecone and create my first index using Pinecone serverless. Pinecone is a fully managed vector database. You can use the CLI or the dashboard when you log in. Here's to learn how to set it up: https://docs.pinecone.io/guides/getting-started/quickstart.

git clone https://github.com/Joshalphonse/Bmf-Huggingface.git

Install BMF with GPU capabilities

!pip install -qU \
  pinecone-client \
  BabitMF-GPU \
  torch \
  torchvision>=0.12.0 \
  python-dotenv \
  av
Enter fullscreen mode Exit fullscreen mode

Install this video dataset or use your own
The data is organized as follows:

  • train: candidate videos, 20 classes, 10 videos per class (200 in total)
  • test: query videos, same 20 classes as train data, 1 video per class (20 in total)
  • reverse_video_search.csv: a csv file containing an id, path, and label for each video in train data
! curl -L https://github.com/towhee-io/examples/releases/download/data/reverse_video_search.zip -O
! unzip -q -o reverse_video_search.zip
Enter fullscreen mode Exit fullscreen mode

Put the files in a dataframe and convert them to a list

import pandas as pd

df = pd.read_csv('./reverse_video_search.csv', nrows=3) #put the files in the dataframe
video_paths = df['path'].tolist() #convert df to python list

print(video_paths) #check if the video paths
Enter fullscreen mode Exit fullscreen mode

Make sure to import all of the necessary packages. Then create environment variables to manage your configurations. They will make your life a lot easier.
Afterwards, load the CSV file from the data set folder I'm also limiting the list to 3 rows just to speed things up for demo purposes.
We'll also load the ResNet Pretrained model because in the next steps we will use it to generate the vector embeddings.
Lastly, in this code snippet, configure a preprocessing pipeline for images using PyTorch's transforms module, which is often used in deep learning for preparing data before feeding it into a neural network.

import os
import cv2
import av
from pinecone import Pinecone, ServerlessSpec
import numpy as np
import pandas as pd
import torch
import torchvision.transforms as transforms
import torchvision.models as models

PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
PINECONE_ENVIRONMENT = os.environ["PINECONE_ENVIRONMENT"]
PINECONE_DATABASE = os.environ["PINECONE_DATABASE"]

# Replace 'your_pinecone_api_key' with your actual Pinecone API key or use environment variables like I am here
pc = Pinecone(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index = pc.Index(PINECONE_DATABASE)

#load the CSV file
csv_file = './reverse_video_search.csv'
df = pd.read_csv(csv_file, nrows=3)
video_paths = df['path'].tolist()
print(video_paths) #check if the video paths

#load a pretrained ResNet model
model = models.resnet18(pretrained = True)
model.eval()

#remove the last fully connectected layer
model = torch.nn.Sequential(*list(model.children())[:-1])

# Define the preprocessing transforms
preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
Enter fullscreen mode Exit fullscreen mode

Since we have our dataset and our files ready to go, we will iterate over each video path and generate an embedding. I'm also using the av package to handle the video file, so we can open it and do the extraction.
We then iterate over the frames of the video, preprocessing each frame (using a preprocess function that is not shown) and generating an embedding for the frame using a pre-trained ResNet model. These frame embeddings are stored in a list.
Once all the frame embeddings have been collected, then we calculate the average of the embeddings to get a single embedding that represents the entire video.
Now all we have to do is use the Pinecone package to upsert (insert or update) the average video embedding to a Pinecone index, under the namespace 'video_embeddings'. The video path is used as the unique identifier for the embedding

# Iterate over each video path and generate embeddings
for video_path in video_paths:
    # Open the video file
    video = av.open(video_path)

    # Get the first video stream
    video_stream = next(s for s in video.streams if s.type == 'video')

    # Initialize variables for storing embeddings
    embeddings = []

    # Iterate over the video frames
    for frame in video.decode(video=0):
        # Convert the frame to a numpy array
        img = frame.to_ndarray(format='rgb24')

        # Preprocess the frame
        img = preprocess(img)
        img = img.unsqueeze(0)  # Add batch dimension

        # Generate embeddings using the ResNet model
        with torch.no_grad():
            embedding = model(img)
            embedding = embedding.squeeze().numpy()

        # Append the embedding to the list
        embeddings.append(embedding)

    # Convert the list of embeddings to a numpy array
    embeddings = np.array(embeddings)

    # Calculate the average embedding for the video
    avg_embedding = np.mean(embeddings, axis=0)
    print(avg_embedding)

    # Upsert the embedding to Pinecone
    index.upsert(
        vectors=[
            (video_path, avg_embedding.tolist())
        ],
        namespace='video_embeddings'
    )

    print(f"Upserted embedding for video: {video_path}")
Enter fullscreen mode Exit fullscreen mode

Now you can either use the Pinecone CLI or the dashboard to view what in your index we just updated the data to. Check out the picture below.

Image description

Searching For A Similar Video

Install ffmpeg and related libraries. For this demo, we don't have to do this step, because ffmpeg libraries are already installed in the Google Colab environment.

sudo apt install ffmpeg

List the ffmpeg libraries. It is expected that the related libraries such libavcodec, libavformat are installed. The output should be shown below:

Image description

sudo apt install libdw1

dpkg -l | grep -i ffmpeg

ffmpeg -version

Install the following package to show the BMF C++ logs in the colab console, otherwise only python logs are printed. This step is not necessary if you're not in a Colab or iPython notebook environment.

pip install wurlitzer
%load_ext wurlitzer
Enter fullscreen mode Exit fullscreen mode

Now import all of these dependencies listed and the beginning of our process is the same as our data upsert from above. Use your Pinecone credentials that we stored in a .env file and work with the ResNet18 pretrained model.
The difference here is that we are finally using BMF for frame extraction.

import os
import glob
import numpy as np
import torch
import torchvision.transforms as transforms
import torchvision.models as models
from pinecone import Pinecone
import bmf
import cv2
from IPython import display
from PIL import Image

PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
PINECONE_ENVIRONMENT = os.environ["PINECONE_ENVIRONMENT"]
PINECONE_DATABASE = os.environ["PINECONE_DATABASE"]

# Replace 'your_pinecone_api_key' with your actual Pinecone API key or use environment variables like I am here
pc = Pinecone(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index = pc.Index(PINECONE_DATABASE)

model = models.resnet18(pretrained=True)
model.eval()
#remove the last fully connectected layer
model = torch.nn.Sequential(*list(model.children())[:-1])

preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

input_video_path = "/content/linedancing.mp4"
output_path = "./extracted-images/simple_%03d.jpg"

graph = bmf.graph({'dump_graph':1})
video = graph.decode({
    "input_path": input_video_path,
}).fps(2)

try:
    (
        bmf.encode(
            video['video'],
            None,
            {
                "output_path": output_path,
                "format": "image2",
                "video_params": {"codec": "jpg"},
            }
        ).run()
    )
    print("Frame extraction completed successfully.")
except Exception as e:
    print(f"Error during frame extraction: {str(e)}")
Enter fullscreen mode Exit fullscreen mode

Next, we will load the extracted query frames and generate the embeddings for our video that we will compare to the ones stored in our Pinecone index.

Let me break it down for you:

  • Load the extracted query frames: I used the glob module to find all the file paths of the extracted query frames, which are stored in the query_frame_paths variable. These are individual frames extracted from the original video.
  • Generate embeddings for each query frame: We then iterate over each query frame path, load the image using cv2.imread, preprocess it (using a preprocess function that is not shown), and generate an embedding for the frame using the pre-trained model.
  • Store the embeddings: The generated embeddings for each frame are stored in the query_embeddings list.
  • Calculate the average embedding: Once all the frame embeddings have been collected, then we calculate the average of the embeddings to get a single embedding that represents the entire set of query frames. By generating an average embedding for the query frames, we are able to capture the overall visual content of the query, which is a main component of how our similarity search will work.
# Load the extracted query frames and generate embeddings
query_frame_paths = glob.glob(output_path.replace("%03d", "*"))
query_embeddings = []

for frame_path in query_frame_paths:
    frame = cv2.imread(frame_path)
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    frame = preprocess(frame)
    frame = frame.unsqueeze(0)  # Add batch dimension

    with torch.no_grad():
        embedding = model(frame)
        embedding = embedding.squeeze().numpy()

    query_embeddings.append(embedding)

query_embeddings = np.array(query_embeddings)
avg_query_embedding = np.mean(query_embeddings, axis=0)
Enter fullscreen mode Exit fullscreen mode

Lastly, let's perform our similarity search with Pinecone.
The query method from Pinecone will be used to search for the most similar vectors to the avg_query_embedding we created. The top_k parameter is set to 5, which means that the code will retrieve the 5 closest matching vectors to the query(choose whatever number you'd like depending on how many items were upserted into your database. The include_metadata parameter is set to True, which means that we will retrieve the metadata (in this case, the video file paths) associated with the matching vectors.
This step is really straight forward. Pinecone has great documentation and a really easy to use package.

# Perform similarity search using Pinecone
num_results = 5  # Number of similar videos to retrieve
results = index.query(
    vector=avg_query_embedding.tolist(),
    top_k=num_results,
    include_metadata=True,
    namespace='video_embeddings'
)

# Print the most similar video paths
for match in results['matches']:
    video_path = match['id']
    print(f"Similar video: {video_path}")
Enter fullscreen mode Exit fullscreen mode

And our result is....

Image description

Bonus

Since I'm using a notebook and I don't want to use up a ton of memory, I also converted all the videos to gifs to view them easier. So here is some bonus code for ya!

def video_to_gif(video_path):
    gif_path = os.path.join(tmp_dir, video_path.split('/')[-1][:-4] + '.gif')
    frames = []
    cap = cv2.VideoCapture(video_path)
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(Image.fromarray(frame))
    cap.release()
    frames[0].save(fp=gif_path, format='GIF', append_images=frames[1:], save_all=True, loop=0)
    return gif_path

# Display the input video as a GIF
html = 'Query video "{}": <br/>'.format(input_video_path.split('/')[-1])
query_gif = video_to_gif(input_video_path)
html_line = '<img src="{}"> <br/>'.format(query_gif)
html += html_line
html += 'Top {} search results: <br/>'.format(num_results)

# Display the similar videos as GIFs
for path in [match['id'] for match in results['matches']]:
    gif_path = video_to_gif(path)
    html_line = '<img src="{}" style="display:inline;margin:1px"/>'.format(gif_path)
    html += html_line

display.HTML(html)
Enter fullscreen mode Exit fullscreen mode

You can do it too

What I've showed you is a niche use case for BMF. Video frame extraction has a lot of use cases outside of our example. There are a ton of features with this framework, especially when it comes to building video processing pipelines. Make sure you check out the BMF documentation and try out some other example apps on the quick experience page for more.

Top comments (0)