DEV Community

Cover image for AI-Powered Image Search with CLIP, pgvector, and Fast API
Matt Fergoda
Matt Fergoda

Posted on

AI-Powered Image Search with CLIP, pgvector, and Fast API

TL;DR

I built an image search engine that allows a user to find images in a database based on their content using natural language.

A GIF of someone searching for yawning dogs, sleeping cats, then a beach at sunset with photos matching the search appearing after each query.

It's:

  • Fast. Images are searched using a vector database inside Postgres. In the demo you can search tens of thousands of images instantly with no fine-tuning and minimal hardware. No GPUs required.
  • Plug-and-play. You can easily configure it to connect to your own AWS S3 bucket.
  • Lightweight. There's no fluff or custom logic making it easy to use as a microservice or customize to your needs.
  • Containerized. It's quick to run locally and deploy.
  • Well-tested. There are comprehensive unit and integration tests with 98% code coverage.
  • Open source.

Here's a demo where you can search against Unsplash's open source dataset of 25,000 images and here's a link to the code.

Background

Talk of generative AI is everywhere these days. And for good reason. However, there are large non-generative open source models that predate the release of ChatGPT and have gotten a bit buried in the generative AI hype. These models can be used to build powerful product features too.

Take pre-trained encoder models. These are large transformer neural networks that take in text, image, or audio data and spit back out an array of numbers called an embedding. Arrays of numbers are often called vectors in machine learning, so you'll often see these called vector embeddings.

These models have already learned from thousands or even millions of examples in such a way that these embeddings have a relationship to the content of the image/text/audio passed into them. That means that you can infer the similarity of the content of the inputs -- the text, the image, or the audio file -- based on the similarity of these embeddings using simple addition and multiplication.

A diagram showing how an image encoder turns image files into vector embeddings you can compare.

Note: in reality encoding models usually output vectors with dimension in the 100s or 1,000s, but 2-D embeddings make it easy to visualize their similarity.

Until recently, an image encoder would be used to encode and compare images while text encoders would do the same for text. Then, in early 2021, OpenAI introduced CLIP and broke the barrier between image and text encoders.

CLIP was trained on 400 million publicly-available image-text pairs. During training, CLIP ran the text through a text encoder and the images through an image encoder and learned to predict which vector embeddings for images match those for the corresponding text based on their content. When we run images or text through the trained model and get their embeddings, we can compare those embeddings like we could with the embeddings from individual encoders. This means we can quickly compare the semantic content of images and text, paving the way for tons of creative AI-powered applications.

A diagram showing how CLIP can generate text or image embeddings that you can compare.
Note: Since these embeddings can be for images or text, you'll sometimes see them called "multimodal" embeddings, like I did here.

Application Logic

Here's the logical flow of a CLIP-powered image search engine:

  • When a user uploads an image, run it through CLIP and store the embedding in a vector database (more on this below).
  • When a user searches for images, run the query text through CLIP and compare its embedding to the image embeddings stored in the vector database. Return the results sorted by similarity.

A diagram illustrating the application logic as described above.

I also added some API routes for getting a single image, uploading an image, and deleting an image.

The Technology Stack

  • HuggingFace's instance of CLIP. This CLIP interface nicely abstracts away some of the more verbose Pytorch patterns making the code more succinct.
  • AWS S3 for storing and hosting image files.
  • PostgreSQL with the pgvector extension for adding a vector database.
  • SQLAlchemy ORM for abstracting away SQL queries and handling query sanitization.
  • Fast API.
  • Docker.

A diagram showing how elements of the tech stack connect to each other.

Design Decisions

Here is some discussion on a few of the design decisions I made.

Using the pgvector PostgreSQL extension as a vector database

Vector databases are optimized for storing and searching vectors, making them ideal for storing embeddings. There are several stand-alone vector database offerings, but I went with pgvector. pgvector allows you to store embeddings in a vector database and treat that database like just another column in your Postgres table. This is really handy for minimizing both the complexity of the application architecture and the number of requests made to different data stores. It's also compatible with SQLAlchemy, so the code interacting with the vector store can live right alongside the rest of code for interacting with the data model. And, like everything else in the Postgres ecosystem, it's open source.

Storing normalized image embeddings to avoid extra calculations on retrieval

The similarity between embeddings is calculated using cosine similarity, which depends on the length of the two vectors. If we were to store the raw embeddings in the database, then we'd have to calculate these lengths at retrieval time to calculate the cosine similarity. But, if we instead normalize the vectors to have length 1 when they're written to the database, we can simply do a dot product comparison at retrieval time, which is fewer operations.

Scraping images' EXIF metadata on upload

EXIF data contains key-value pairs about the image: the camera model, location data, whether it was taken in portrait or landscape, etc. This data isn't used in the current iteration of the search engine, but it may be useful to have in future iterations. Perhaps this could be become a hybrid search engine, with some traditional filtering logic on top of the CLIP-powered search.

POST and DELETE routes are protected

These routes are protected with an Authorization header to avoid abuse of the API if it's public-facing. Fast API also has security utilities for implementing Oauth2 if needed.

Fast API as the backend framework

I initially planned to build this with Flask, but thought it would be a good opportunity to try Fast API. With Fast API you get data validation, JSON error messages, and Swagger documents out of the box which sounded pretty appealing. Overall I found the experience building with Fast API to be pretty seamless and I'll definitely use it again.

The API and database are containerized with Docker

This is to minimize development and deployment environment headaches.

Demo

Here's a live demo with a simple React frontend. It's searching against an S3 bucket containing Unsplash's open source dataset of 25,000 images, plus a few of my own.

Closing

This project showcases one potential application of pre-trained, non-generative AI models and how easily they can integrate with technology you're already building with. Feel free to fork this repo and use it in your own projects.

If you have any questions, comments, suggestions, or ideas for additional features, please reach out!

Are you or someone you know looking for an AI or machine learning engineer?

I'm currently in the market for MLE, or backend / full-stack roles with a heavy ML/AI component. I'm an engineer with experience working across the stack who specializes in building and deploying ML models.

LinkedIn
GitHub
Portfolio Site

Top comments (0)