Build your very own image search engine
Know about me & reach me out at: LinkedIn or X š¤
Do you ever wonder how Google Photos and Apple Photos are able to understand images?
Or, how do they allow you to search for images based on what you ātypeā?
Or, how Googleās very own image search works?
Well, I cracked it!
More Than Just an Introduction
In this new mind-boggling project, I was able to mimic this very ability of such powerful platforms right on my local system.
Creativity and imagination go hand-in-hand. We should always indulge in imaginative thought experiments that spark creativity, and this is one such thought experiment that has been teasing me for quite some time. I am happy to share that I have succeeded to some extent in satisfying my intellectual thirst through the help of Qdrant & OpenAIās open-sourced model.
In this article, Iāll be exploring the creation of a semantic image search engine using OpenAI's latest and greatest open-sourced CLIP model coupled with the sheer might of Qdrantās Vector Database.
This project is divided into the following sections:
- Environment Setup
- Data Pre-processing & Populating Vector Database
- Embedding Feature-Vector-Driven Semantic Search Over Vector Database for Active Image Retrieval
Environment Setup
I always love to organize my projects with a proper structure, which makes them easier to review later on. Similarly, I believe you also prefer to keep your projects straightforward and manageable.
Pro Tip:
I prefer to divide my AI projects this way:
model_type/
|----project_title/
|----demo/
|----recorded_demo.mp4
|----stable_build/
|----exp_<experiment_number>/
|----data/
|----raw/
|----processed_training_data/
|----model/
|----metrics/
|----classification_reports/
|----performance_scores/
|----README.md
The first step in preparing the environment for this project involves pulling the Docker container image and then executing it on your local Docker daemon. [Don't forget to launch the Docker application first].
- Pull the Qdrant client container from the Docker Hub. Then run the container using the following command, which will host the application at
localhost:6333
.
docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
_NOTE: If you are running on Windows, kindly replace $(pwd) with your local path.
_
- Next comes the most important step of all: āUse an Environmentā. You need to have an independent environment when performing experiments or, else, you will surely fall into a blackhole like Matthew McConaughey in the film āInterstellarā.
So, letās create a Python environment (I used Conda) and install the following basic dependencies necessary to run the AI model.
conda create -n qdrant -y
pip install qdrant-client sentence-transformers accelerate tqdm datasets gradio
Now that we are all set, Letās begin the show!
Data Pre-Processing and Populating Vector Database
For this project, Iāve used the āarampacha/rsicdā dataset, a collection of diverse satellite images from Hugging Face. We leverage the datasets library from Hugging Face to load the training split of the dataset.
import datasets
print("[INFO] Loading dataset...")
ds = datasets.load_dataset('arampacha/rsicd', split='train')
Now comes the AI.
I have browsed through a pile of models to find the one that best fits my need for generating feature-focused embeddings from satellite images, as well as creating text embeddings that can be utilized later for semantic search.
I settled on OpenAI's CLIP model, specifically 'openai/clip-vit-base-patch32'. This model is tailored for zero-shot image classification and yields a (1,512)-dimensional feature embedding for each image. And it doesnāt stop there. Being pre-trained on images and their corresponding captions, it aligns both textual and visual contexts within the same embedding tensor space. This implies that whether you input text or an image, you will receive a (1,512)-dimensional embedding tensor.
The elegance of the CLIP model lies in its ability to map both image data and textual data to the same embedding space as illustrated in the above image.
If the input query is textual, we can use the tokenizer to tokenize it and create token_ids. Subsequently, we can generate an embeddings tensor using the get_text_features method from the model class. This process will result in an embedding feature tensor with the shape (1,512).
If the input query is an image, we can use the processor to process and convert the image into a format suitable for the model. Following this, we can generate an image embedding tensor with the shape (1, 512) using the get_image_features method from the model class.
Hence, it functions as a versatile model capable of generating either image embeddings or text embeddings depending on our specific use case. The key advantage is the consistent dimensionality of both embedding types, whether text or image. Pre-trained to understand the interconnected feature distribution between an image and its captions, the model stands as the optimal choice for text-to-image or image-to-image searches.
OpenAIās comment on CLIP model:
āIf the task of a dataset is classifying photos of dogs vs cats, we check for each image whether a CLIP model predicts the text description āa photo of a dogā or āa photo of a catā is more likely to be paired with it.ā
from transformers import AutoTokenizer, AutoProcessor, AutoModelForZeroShotImageClassification
print("[INFO] Loading the model...")
model_name = "openai/clip-vit-base-patch32"
tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForZeroShotImageClassification.from_pretrained(model_name)
Here, we have a tokenizer that is used to tokenize text and a processor to prepare images which are consumable by the model.
After loading the model, you will need to instantiate a Qdrant client tethering to the local docker container running the Qdrant application. Create a Qdrant data collection that would be hosting the vectorized data. We set the vector size to be 512 since the output embedding feature tensor from the model is of shape (1,512).
from qdrant_client import QdrantClient
from qdrant_client.http import models
client = QdrantClient("localhost", port=6333)
print("[INFO] Client created...")
print("[INFO] Creating qdrant data collection...")
client.create_collection(
collection_name="satellite_img_db",
vectors_config=models.VectorParams(size=512, distance=models.Distance.COSINE),
)
Populate the VectorDB by processing each image in the dataset, extract its features using the CLIP model, and upload the resulting embeddings to Qdrantās āsatellite_img_dbā VectorDB data collection.
Note: If you observe closely, I am not only saving the image embeddings but also storing the image pixel values and image size in the vector payload. I will use this information later to reconstruct the image for display on the Gradio app. To better understand the flow of the experiment, do check out the āData Flowā illustration that I made in the following section.
from tqdm import tqdm
import numpy as np
print("[INFO] Creating a data collection...")
records = []
for idx, sample in tqdm(enumerate(ds), total=len(ds)):
processed_img = processor(text=None, images=sample['image'], return_tensors="pt")['pixel_values']
img_embds = model.get_image_features(processed_img).detach().numpy().tolist()[0]
img_px = list(sample['image'].getdata())
img_size = sample['image'].size
records.append(models.Record(id=idx, vector=img_embds, payload={"pixel_lst": img_px, "img_size": img_size, "captions": sample['captions']}))
#uploading the records to client
print("[INFO] Uploading data records to data collection...")
#It's better to upload chunks of data to the VectorDB
for i in range(30,len(records), 30):
print(f"finished {i}")
client.upload_records(
collection_name="satellite_img_db",
records=records[i-30:i],
)
print("[INFO] Successfully uploaded data records to data collection!")
Embedding Feature-Vector-Driven Semantic Search Over Vector Database for Active Image Retrieval
Now that we have our data ready and chilling in Qdrantās VectorDB, letās build an app to interact with it and retrieve information through Qdrantās Semantic Search functionality.
I will be using Gradio to build a quick functional application with a beautiful UI. Why? Because it comes with a prebuilt UI bundle that is easy to set up and great for quick demos. Coding through it is a breeze. Just visit hugging face spaces and you will understand what I mean.
To put it in simple terms ā all we need in this application is to consume a text input from the user, vectorize the text by generating text-embeddings using the āget_text_featuresā method from the model class, then using the vectorized text as query we perform semantic search over the vectorDB utilizing the search method from Qdrantās client class.
def process_text(text):
inp = tokenizer(text, return_tensors="pt")
text_embeddings = model.get_text_features(**inp).detach().numpy().tolist()[0]
hits = client.search(
collection_name="satellite_img_db",
query_vector=text_embeddings,
limit=1,
)
for hit in hits:
img_size = tuple(hit.payload['img_size'])
pixel_lst = hit.payload['pixel_lst']
new_image = Image.new("RGB", img_size)
new_image.putdata(list(map(lambda x: tuple(x), pixel_lst)))
return new_image
iface = gr.Interface(
title="Semantic Search Over Satellite Images Using Qdrant Vector Database",
description="by Niranjan Akella",
fn=process_text,
inputs=gr.Textbox(label="Input prompt"),
outputs=gr.Image(type="pil", label="Satellite Image"),
)
iface.launch()
Note: The complete code is shared at the end along with the link to Git-gist.
You can directly run the Gradio application from the terminal using the Python runtime python3 app.py
Scope
The scope of this experiment doesnāt end here. In this project, I have built a text-to-image search engine, but it is also possible to build an image-to-image search engine using the processor of the CLIP model. I highly recommend you to experiment with that and reach out to me on LinkedIn or X to discuss more about it.
Conclusion
In this project, I successfully combined the power of OpenAI's CLIP model for image embeddings with Qdrantās Semantic Search functionality over its vector database for efficient semantic search, trying to mimic a very popular Google Photos/Apple Photos functionality. I demonstrated the power of AI coupled with a powerful VectorDB like Qdrant along with a working demo using Gradio application providing a user-friendly interface for semantic image search based on textual queries. This article serves as a wonderful guide for building your own image search engine using AI+Qdrantās VectorDB ā combining advanced open-sourced AI models and a scalable vector database.
Here's the code
Top comments (0)