DEV Community

Vardhanam Daga
Vardhanam Daga

Posted on

Using Qdrant’s Discovery API for Video Search

How Video Search Works

Image description

In this blog, we are going to understand how semantic search works for a video database. It is a technique where the description of a video tells a story so that querying becomes easier than ever before. But, first, let’s understand how most video search works currently.

1. Metadata-Based Search: This is one of the most common methods. Videos are tagged with various metadata such as titles, descriptions, tags, and other relevant information. When you search, the system looks for keywords in this metadata to find matching videos. This search is limited to only the keywords contained in the description of the video, and misses the semantic intention behind these descriptions.

2. Transcript or Closed Caption Search: With the advent of better speech recognition technology, many platforms have started using automated transcripts or closed captions for videos. This allows for text-based searching within the video content. When you search for a phrase, the system checks the transcript for matches and returns videos containing those words.

3. Thumbnail-Based Search: Some search algorithms analyze the thumbnails of videos to determine their content and relevance to a search query. This method, however, is less precise than the others.

4. Content-Based Video Retrieval (CBVR): This is a more advanced approach where the search algorithm analyzes the actual content of the video (like objects, scenes, actions, faces, etc.) using image and video processing techniques. This method can be resource-intensive and is not yet widely implemented in most commercial video search platforms.

5. User Interaction Data: Search algorithms also take into account user interaction data like views, likes, comments, and watch history to determine the relevance and ranking of videos in search results.

6. Contextual and Semantic Analysis: Some advanced search systems also perform contextual and semantic analysis of the video description and user query to understand the intent behind a search and the context of the content in the videos.

In this article, we are going to delve deeper into this last point, i.e., semantic analysis of video descriptions.

What Is Semantic Analysis?

Semantic analysis involves understanding the themes, narratives, and concepts presented in the video description, beyond just the visible keywords. For example, a video description might have the words ‘a person running’ but, semantically, it could be about ‘fitness’, ‘perseverance’, or even a ‘sports brand advertisement’.

Let's consider a more nuanced example of semantic video search:

Suppose our Search Query is: 'Easy dinner for busy weekdays’.

A Traditional Keyword-Based Search Approach:

  • Would focus on keywords like 'dinner,' 'busy,' and 'weekdays.'

  • Might return generic dinner recipes or videos titled with these specific keywords.

On the other hand, a Semantic Video Search approach would:

  • Understand the Query Semantics: Recognize that the user is looking for a meal that is quick and easy to prepare, suitable for a busy weekday schedule.The system might also consider semantically related concepts such as 'meal prep ideas,' 'quick healthy dinners,' or 'family-friendly quick recipes,' expanding the search scope to include relevant content that doesn't necessarily match the exact query terms.

Then the system would search for videos where the description aligns with the concept of 'easy' and 'quick' preparation, even if those exact words aren't used in the video's title or description or content. For example, it might prioritize videos with concepts such as '30 minutes or less,' 'simple ingredients,' or 'one-pot meals.'

Sentence Transformers

To generate embeddings for our video description, we’ll use the ‘all-MiniLM-L6-v2’ sentence transformers from hugging face. This model can map sentences of our video description to a vector space of 384 dimensions, and is ideal for tasks like semantic searching.

Building a Semantic Video Search Engine with Qdrant

In this article, we’ll use a collection of GIFs (which is, basically, low-resolution short video snippets), and use the sentence-transformers model to generate their embeddings. Then we’ll store these embeddings in Qdrant, which is a vector database designed for efficient storing, searching and retrieval of vector embeddings.

An interesting feature of Qdrant is its Discovery API. This API is particularly notable for refining search parameters to achieve greater precision, focusing on the concept of 'context.' This context refers to a collection of positive-negative pairs that define zones within a space, dividing the space into positive or negative segments. The search algorithm then prioritizes points based on their inclusion within positive zones or their avoidance of negative zones.

The Discovery API offers two main modes of operation:

  1. Discovery Search: This utilizes a target point to find the most relevant points in the collection, but only within preferred areas. It's essentially a controlled search operation.

Image description

(from Qdrant’s Documentation)

  1. Context Search: Similar to discovery search but without a target point. Instead, it uses 'context' to navigate the Hierarchical Navigable Small World (HNSW) graph towards preferred zones. This mode is expected to yield diverse results, not centered around one point, and is suitable for more exploratory approaches in navigating the vector space.

Image description

(from Qdrant’s Documentation)

Setting Up the Environment and Code

We’ll use data from the raingo/TGIF-Release GitHub repository. It’s a tsv file containing 10,000 rows of GIF urls and descriptions. Download it with the following command on your Colab notebooks.

!wget https://github.com/raingo/TGIF-Release/raw/master/data/tgif-v1.0.tsv
Enter fullscreen mode Exit fullscreen mode

Next, install all the dependencies.

pip install requests pillow transformers qdrant-client sentence-transformers accelerate tqdm sentence-transformers
Enter fullscreen mode Exit fullscreen mode

Next, launch a cluster on Qdrant Cloud.

Image description

Retain the API key of the cluster. In this cluster, we’ll create a collection for storing the vector embeddings (also known as points).

Load the CLIP model and create a Qdrant client.

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.http import models

print("[INFO] Loading the model...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

client = QdrantClient(url=url="https://xxxxxx-xxxxx-xxxxx-xxxx-xxxxxxxxx.us-east.aws.cloud.qdrant.io:6333",
    api_key="<your-api-key>"
,)
print("[INFO] Client created...")
Enter fullscreen mode Exit fullscreen mode

We’ll now create embeddings of the description of our video. We’ll also create an index for our vectors. Additionally, we’ll make a dictionary of payloads that will contain the url and the description of our videos.

import csv

# Replace 'your_file.tsv' with the path to your TSV file
file_path = '/content/tgif-v1.0.tsv'

# Lists to store the data
descriptions = []
payload = []


# Reading the TSV file
with open(file_path, 'r', encoding='utf-8') as file:
    tsv_reader = csv.reader(file, delimiter='\t')


    # Iterate through each row in the TSV file
    for row in tsv_reader:
        if len(row) >= 2:  # Checking if the row has at least two elements (URL and description)
            url, description = row[0], row[1]
            descriptions.append(description)
            payload.append({"url": url, "description": description})


idx = list(range(1, len(descriptions) + 1))
Enter fullscreen mode Exit fullscreen mode

Create embeddings of the description.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(descriptions).tolist()
Enter fullscreen mode Exit fullscreen mode

Create a Qdrant collection with the name gif_collection.

print("[INFO] Creating qdrant data collection...")
client.create_collection(
    collection_name="gif_collection",
    vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE),
Enter fullscreen mode Exit fullscreen mode
)
Enter fullscreen mode Exit fullscreen mode

Next, we upload the records to our collection. We’ll only upload the first 2500 records for convenience’s sake.

#uploading the records to client
print("[INFO] Uploading data records to data collection...")
client.upsert(
    collection_name="gif_collection",
    points=models.Batch(
        ids=idx[:2500],
        payloads=payload[:2500],
        vectors=embeddings[:2500],
    ),
)
Enter fullscreen mode Exit fullscreen mode

Now we have our Vector DB ready, and we can use the Discovery API to query it.

Let’s set our target to videos of 'tiger', with the positive context as 'animals', but we don’t want any cat videos, so we’ll set the negative context 'cats':

target = 'animals'
positive = 'tiger'
negative = 'cats'
emb_target = model.encode(target).tolist()
emb_positive = model.encode(positive).tolist()
emb_negative = model.encode(negative).tolist()

hits = client.discover(
    collection_name="gif_collection",
    target = emb_target,
    context = [{'positive': emb_positive, 'negative': emb_negative}],
    limit=5,
)
Enter fullscreen mode Exit fullscreen mode

Let’s see what the top 5 results look like:

hits
Enter fullscreen mode Exit fullscreen mode
[ScoredPoint(id=53, version=0, score=1.6944957, payload={'description': 'the tiger is running really fast in the garden', 'url': 'https://38.media.tumblr.com/8c14aa571911985d6d774762f8159452/tumblr_n4gk7efcMw1spy7ono1_400.gif'}, vector=None, shard_key=None),
 ScoredPoint(id=1120, version=0, score=1.6891048, payload={'description': 'a giant tiger creature jumps through the air.', 'url': 'https://38.media.tumblr.com/be596acc2860e691e578c2a64d044d94/tumblr_nq0mhrqyAf1u4fhlko1_250.gif'}, vector=None, shard_key=None),
 ScoredPoint(id=1269, version=0, score=1.6566299, payload={'description': 'a young tiger cub nestles down to sleep next to a large teddy bear.', 'url': 'https://38.media.tumblr.com/b9762b528bbd4a61f4b1c6ac5352b92f/tumblr_nqi6joMEOR1uv0y2ro1_250.gif'}, vector=None, shard_key=None),
 ScoredPoint(id=4, version=0, score=1.6384017, payload={'description': 'an animal comes close to another in the jungle', 'url': 'https://38.media.tumblr.com/9f659499c8754e40cf3f7ac21d08dae6/tumblr_nqlr0rn8ox1r2r0koo1_400.gif'}, vector=None, shard_key=None),
 ScoredPoint(id=2333, version=0, score=1.6347136, payload={'description': 'a view from car where a lion is looking in the tries to get in.', 'url': 'https://33.media.tumblr.com/2aeb5012cde835e61f2e35cd49960971/tumblr_nnums3GbD71ssgyoro1_400.gif'}, vector=None, shard_key=None)]
Enter fullscreen mode Exit fullscreen mode

We get three video descriptions with the word ‘tiger’ in it, followed by a description which has the word ‘animal’ in it, and then the last video is that of a ‘lion’. This is as expected. Most of our videos are of our target, which is ‘tiger’. We got one ‘animal’ video because that is our positive context. And we did not get a single video of ‘cat’ as that is our negative context. We got one video of ‘lion’ because ‘lion’ is semantically similar to ‘tiger’.

Conclusion

In conclusion, the process of building a semantic video search engine using sentence-transformers and Qdrant demonstrates the capabilities of AI in understanding and categorizing video descriptions. The traditional keyword-based search methods are limited to metadata, transcripts, and thumbnails, which may not capture the full context or semantic meaning of the video. However, with semantic search, the system comprehends the deeper themes and narratives within the videos, making search results more relevant and precise.

The Discovery API of Qdrant, with its unique ability to refine search parameters through context, further enhances the precision of these search results.

References

https://qdrant.tech/documentation/concepts/explore/

Top comments (0)