<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matt Fergoda</title>
    <description>The latest articles on DEV Community by Matt Fergoda (@mattfergoda).</description>
    <link>https://dev.to/mattfergoda</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1275653%2F0eac54b7-fae7-4526-be0e-1d30f9f5c69a.jpeg</url>
      <title>DEV Community: Matt Fergoda</title>
      <link>https://dev.to/mattfergoda</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mattfergoda"/>
    <language>en</language>
    <item>
      <title>AI-Powered Image Search with CLIP, pgvector, and Fast API</title>
      <dc:creator>Matt Fergoda</dc:creator>
      <pubDate>Mon, 12 Feb 2024 18:01:18 +0000</pubDate>
      <link>https://dev.to/mattfergoda/ai-powered-image-search-with-clip-pgvector-and-fast-api-1f1d</link>
      <guid>https://dev.to/mattfergoda/ai-powered-image-search-with-clip-pgvector-and-fast-api-1f1d</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I built an image search engine that allows a user to find images in a database based on their content using natural language.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanc6p7e0bjptbfqb18oh.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanc6p7e0bjptbfqb18oh.gif" alt="A GIF of someone searching for yawning dogs, sleeping cats, then a beach at sunset with photos matching the search appearing after each query."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fast.&lt;/strong&gt; Images are searched using a vector database inside Postgres. In the &lt;a href="https://demo.simsearch.mattfergoda.me/" rel="noopener noreferrer"&gt;demo&lt;/a&gt; you can search tens of thousands of images instantly with no fine-tuning and minimal hardware. No GPUs required. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plug-and-play.&lt;/strong&gt; You can easily configure it to connect to your own AWS S3 bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight.&lt;/strong&gt; There's no fluff or custom logic making it easy to use as a microservice or customize to your needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containerized.&lt;/strong&gt; It's quick to run locally and deploy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Well-tested.&lt;/strong&gt; There are comprehensive unit and integration tests with 98% code coverage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open source.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's a &lt;a href="https://demo.simsearch.mattfergoda.me/" rel="noopener noreferrer"&gt;demo&lt;/a&gt; where you can search against &lt;a href="https://unsplash.com/data" rel="noopener noreferrer"&gt;Unsplash's open source dataset of 25,000 images&lt;/a&gt; and here's &lt;a href="https://github.com/mattfergoda/semantic-image-search" rel="noopener noreferrer"&gt;a link to the code&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Talk of generative AI is everywhere these days. And for &lt;a href="https://www.gartner.com/en/topics/generative-ai" rel="noopener noreferrer"&gt;good reason&lt;/a&gt;. However, there are large non-generative open source models that predate the release of ChatGPT and have gotten a bit buried in the generative AI hype. These models can be used to build powerful product features too.&lt;/p&gt;

&lt;p&gt;Take pre-trained encoder models. These are large &lt;a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)" rel="noopener noreferrer"&gt;transformer neural networks&lt;/a&gt; that take in text, image, or audio data and spit back out an array of numbers called an &lt;em&gt;embedding&lt;/em&gt;. Arrays of numbers are often called &lt;em&gt;vectors&lt;/em&gt; in machine learning, so you'll often see these called &lt;em&gt;vector embeddings&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;These models have already learned from thousands or even millions of examples in such a way that these embeddings have a relationship to the content of the image/text/audio passed into them. That means that you can infer the similarity of the content of the &lt;em&gt;inputs&lt;/em&gt; -- the text, the image, or the audio file -- based on the similarity of these embeddings using simple addition and multiplication. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fam2pb1s195qvzzf56rlz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fam2pb1s195qvzzf56rlz.png" alt="A diagram showing how an image encoder turns image files into vector embeddings you can compare."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: in reality encoding models usually output vectors with dimension in the 100s or 1,000s, but 2-D embeddings make it easy to visualize their similarity.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Until recently, an image encoder would be used to encode and compare images while text encoders would do the same for text. Then, in early 2021, OpenAI introduced &lt;a href="https://openai.com/research/clip" rel="noopener noreferrer"&gt;CLIP&lt;/a&gt; and broke the barrier between image and text encoders. &lt;/p&gt;

&lt;p&gt;CLIP was trained on 400 million publicly-available image-text pairs. During training, CLIP ran the text through a text encoder and the images through an image encoder and learned to predict which vector embeddings for images match those for the corresponding text based on their content. When we run images or text through the trained model and get their embeddings, we can compare those embeddings like we could with the embeddings from individual encoders. This means we can quickly compare the semantic content of images and text, paving the way for tons of creative AI-powered applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrl1t1noqtic4h7ede2r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrl1t1noqtic4h7ede2r.png" alt="A diagram showing how CLIP can generate text or image embeddings that you can compare."&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Note: Since these embeddings can be for images or text, you'll sometimes see them called "multimodal" embeddings, like I did here.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Application Logic
&lt;/h2&gt;

&lt;p&gt;Here's the logical flow of a CLIP-powered image search engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When a user uploads an image, run it through CLIP and store the embedding in a vector database (more on this below).&lt;/li&gt;
&lt;li&gt;When a user searches for images, run the query text through CLIP and compare its embedding to the image embeddings stored in the vector database. Return the results sorted by similarity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuj2l9khn4obr59alp5vx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuj2l9khn4obr59alp5vx.png" alt="A diagram illustrating the application logic as described above."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I also added some API routes for getting a single image, uploading an image, and deleting an image.  &lt;/p&gt;

&lt;h2&gt;
  
  
  The Technology Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/docs/transformers/model_doc/clip" rel="noopener noreferrer"&gt;HuggingFace's instance of CLIP&lt;/a&gt;. This CLIP interface nicely abstracts away some of the more verbose Pytorch patterns making the code more succinct.&lt;/li&gt;
&lt;li&gt;AWS S3 for storing and hosting image files.&lt;/li&gt;
&lt;li&gt;PostgreSQL with the pgvector extension for adding a vector database.&lt;/li&gt;
&lt;li&gt;SQLAlchemy ORM for abstracting away SQL queries and handling query sanitization.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://fastapi.tiangolo.com/" rel="noopener noreferrer"&gt;Fast API&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Docker.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcaj7g3x4et9zfbonzx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcaj7g3x4et9zfbonzx4.png" alt="A diagram showing how elements of the tech stack connect to each other."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Decisions
&lt;/h2&gt;

&lt;p&gt;Here is some discussion on a few of the design decisions I made.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using the pgvector PostgreSQL extension as a vector database
&lt;/h3&gt;

&lt;p&gt;Vector databases are optimized for storing and searching vectors, making them ideal for storing embeddings. There are several stand-alone vector database offerings, but I went with pgvector. pgvector allows you to store embeddings in a vector database and treat that database like just another column in your Postgres table. This is really handy for minimizing both the complexity of the application architecture and the number of requests made to different data stores. It's also compatible with SQLAlchemy, so the code interacting with the vector store can live right alongside the rest of code for interacting with the data model. And, like everything else in the Postgres ecosystem, it's open source.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storing &lt;em&gt;normalized&lt;/em&gt; image embeddings to avoid extra calculations on retrieval
&lt;/h3&gt;

&lt;p&gt;The similarity between embeddings is calculated using &lt;a href="https://en.wikipedia.org/wiki/Cosine_similarity" rel="noopener noreferrer"&gt;cosine similarity&lt;/a&gt;, which depends on the length of the two vectors. If we were to store the raw embeddings in the database, then we'd have to calculate these lengths at retrieval time to calculate the cosine similarity. But, if we instead normalize the vectors to have length 1 when they're written to the database, we can simply do a &lt;a href="https://en.wikipedia.org/wiki/Dot_product" rel="noopener noreferrer"&gt;dot product&lt;/a&gt; comparison at retrieval time, which is fewer operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scraping images' EXIF metadata on upload
&lt;/h3&gt;

&lt;p&gt;EXIF data contains key-value pairs about the image: the camera model, location data, whether it was taken in portrait or landscape, etc. This data isn't used in the current iteration of the search engine, but it may be useful to have in future iterations. Perhaps this could be become a hybrid search engine, with some traditional filtering logic on top of the CLIP-powered search.&lt;/p&gt;

&lt;h3&gt;
  
  
  POST and DELETE routes are protected
&lt;/h3&gt;

&lt;p&gt;These routes are protected with an Authorization header to avoid abuse of the API if it's public-facing. Fast API also has &lt;a href="https://fastapi.tiangolo.com/tutorial/security/oauth2-jwt/" rel="noopener noreferrer"&gt;security utilities&lt;/a&gt; for implementing Oauth2 if needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fast API as the backend framework
&lt;/h3&gt;

&lt;p&gt;I initially planned to build this with Flask, but thought it would be a good opportunity to try Fast API. With Fast API you get data validation, JSON error messages, and Swagger documents out of the box which sounded pretty appealing. Overall I found the experience building with Fast API to be pretty seamless and I'll definitely use it again. &lt;/p&gt;

&lt;h3&gt;
  
  
  The API and database are containerized with Docker
&lt;/h3&gt;

&lt;p&gt;This is to minimize development and deployment environment headaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Here's &lt;a href="https://demo.simsearch.mattfergoda.me" rel="noopener noreferrer"&gt;a live demo with a simple React frontend&lt;/a&gt;. It's searching  against an S3 bucket containing &lt;a href="https://unsplash.com/data" rel="noopener noreferrer"&gt;Unsplash's open source dataset of 25,000 images&lt;/a&gt;, plus a few of my own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;This project showcases one potential application of pre-trained, non-generative AI models and how easily they can integrate with technology you're already building with. Feel free to fork this repo and use it in your own projects.&lt;/p&gt;

&lt;p&gt;If you have any questions, comments, suggestions, or ideas for additional features, please reach out!&lt;/p&gt;

&lt;h2&gt;
  
  
  Are you or someone you know looking for an AI or machine learning engineer?
&lt;/h2&gt;

&lt;p&gt;I'm currently in the market for MLE, or backend / full-stack roles with a heavy ML/AI component. I'm an engineer with experience working across the stack who specializes in building and deploying ML models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/matt-fergoda/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/mattfergoda" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
&lt;a href="https://mattfergoda.me" rel="noopener noreferrer"&gt;Portfolio Site&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>fastapi</category>
      <category>vectordatabase</category>
    </item>
  </channel>
</rss>
