Joel Wekesa

Posted on Sep 28, 2024

Beyond Keywords: Crafting Semantic Search with FastAPI and AI

#langchain #ai #vectordatabase #postgres

So, its a cozy Netflix and chill night, and you're in the mood for a steamy romance flick.You can't quite remember the title, but you know it had sizzling chemistry between the leads, a passionate beach scene, and a twist that left you breathless. In the past, you'd be stuck scrolling through endless lists or relying on awkward keyword searches. But what if your search could actually understand the sultry essence of what you're craving?
Enter the world of semantic search - a game-changer in how we interact with data. In this blog post, we'll explore the exciting realm of semantic search, powered by a combination of LangChain, Hugging Face embeddings, and pgvector. Together, these cutting-edge technologies can transform a simple movie dataset into a smart, intuitive search experience that grasps context and meaning, not just keywords.
At the heart of semantic search systems are vector databases and embeddings. Vector databases store complex data as high-dimensional vectors, enabling efficient similarity searches. Embeddings are numerical representations of data (like words or documents) that capture semantic relationships in a multi-dimensional space. This allows for understanding context and intent beyond simple keyword matching.
Now that we've set the stage for semantic search, it's time to dive into the technical implementation. We'll be building a powerful FastAPI-based API that accepts natural language search queries and returns results that match not just by keywords, but by meaning. With FastAPI's speed and automatic API documentation, the development process will be smooth and efficient.

Project Overview

We're building a powerful FastAPI-based API that accepts natural language search queries and returns results based on meaning rather than just keywords. FastAPI's speed and automatic API documentation will streamline the development process.

Tech Stack

FastAPI: The lightning-fast, easy-to-use api framework.
LangChain: Orchestrates our language model operations.
Hugging Face: Provides pre-trained embeddings for deep semantic understanding.
pgvector: Powers efficient similarity search in our database.
Docker: Facilitates running a PostgreSQL instance with pgvector.

Dataset

We're using a movies dataset that contains information about different films, such as titles, genres, descriptions, and more. You can find and download the dataset here. Make sure to grab it and have it ready in your working directory.

What to Expect

This API will leverage these tools to return the most relevant results. Imagine typing

"passionate beach rendezvous"

and instantly getting perfect movie recommendations!

We'll walk you through each step - from setting up the FastAPI framework to integrating the semantic search tools that will bring our movie dataset to life.

Setting Up FastAPI

Before we start coding, let's set up our development environment. We'll be using FastAPI as our web framework, and it's essential to isolate our project dependencies in a virtual environment.

1. Create and Activate a Virtual Environment

First, open your terminal and navigate to the directory where you want to set up the project. Then, create a virtual environment:
For Mac/Linux:

python3 -m venv env

For Windows

python -m venv env

Next, activate the virtual environment:
For Mac/Linux:

source env/bin/activate

For Windows

.\env\Scripts\activate

Once the virtual environment is activated, your terminal prompt should show (env) indicating that you're working inside the env virtual environment.

2. Install FastAPI

With the virtual environment activated, install FastAPI.

pip install "fastapi[standard]"

3. Verify Installation

To verify that FastAPI is installed correctly, you can run the following command:

pip freeze

You should see fastapi listed in the output.

4. Creating Your FastAPI App

Now, let's create a simple FastAPI app to make sure everything is working.
In your project directory, create a file called main.py and add the following code:

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def root():
    return {
        "message": "API is up and running! Happy days"
    }

This is a simple FastAPI app that returns a JSON response when you visit the root URL (/).

5. Running the FastAPI App

To run the app, run this command in your terminal

fastapi dev main.py  --port 9000

This command does two things:

fastapi dev:
This is used for running the FastAPI app in development mode.
- port 9000:
Instead of running the app on the default port (8000), it will be hosted on port 9000.

Once the server is up, you can access the app in your browser at http://127.0.0.1:9000/. You'll see the message: {"message": "API is up and running! Happy days"}.
Additionally, FastAPI provides automatic API documentation, which you can explore by visiting the interactive docs at http://127.0.0.1:9000/docs.

Now that we have FastAPI up and running, we can move on to integrating our tech stack with LangChain, Hugging Face embeddings, and pgvector for semantic search!

Next Step: Setting Up the Movies Dataset

Now, let's add the movies dataset to our project. This dataset will be the foundation of our semantic search functionality.

1. Copy the Movies JSON File

After downloading the movies dataset, copy the movies.json file into the root of your project directory. It should be at the same level as your main.py file.
Your project structure should now look like this:

/your-project-directory
├── env/           # Your virtual environment
├── main.py        # FastAPI app
└── movies.json    # Movies dataset

2. Verify the Dataset

Let's quickly verify that the dataset is correctly placed and readable. Add the following code to your main.py file:

import json

@app.get("/dataset-info")
async def dataset_info():
    with open('movies.json', 'r') as f:
        movies = json.load(f)
    return {
        "total_movies": len(movies),
        "first_movie": movies[0]
    }

Run your FastAPI app and visit http://127.0.0.1:9000/dataset-info. You should see the total number of movies and details of the first movie in the dataset.

Next Step: Setting Up the pgvector Database with Docker

To leverage the power of pgvector for our semantic search capabilities, we need to set up a PostgreSQL instance with pgvector. This can be easily done using Docker.

1. Run the Docker Command

Open your terminal and execute the following command:

docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16

Breaking Down the Command:

docker run: This command is used to create and start a new container.
-name pgvector-container: This option names the container pgvector-container for easy reference.
-e POSTGRES_USER=langchain: Sets the PostgreSQL username to langchain.
-e POSTGRES_PASSWORD=langchain: Sets the password for the langchain user.
-e POSTGRES_DB=langchain:Creates a new database named langchain.
-p 6024:5432:Maps port 5432 in the container (the default PostgreSQL port) to port 6024 on your local machine. You will connect to PostgreSQL through port 6024.
-d: Runs the container in detached mode, allowing it to run in the background.
pgvector/pgvector: Specifies the image to use, in this case, the pgvector version compatible with PostgreSQL 16.

2. Verify the Container is Running

After executing the command, check if the container is running smoothly:

docker ps

You should see pgvector-container listed in the output, indicating that the PostgreSQL instance is ready for use.

Next Step: Managing sensitive information

To manage our secrets effectively, we'll create a .env file at the root of our project. This file will store sensitive information like our database connection string.

1. Create the .env File

In your project directory, create a file named .env and add the following content:

DB_CONNECTION_STRING="postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
COLLECTION_NAME=movies

Explanation:
DB_CONNECTION_STRING: This variable holds the connection string required to connect to your PostgreSQL database. It specifies:
COLLECTION_NAME: This variable specifies the name of the collection we'll be using, which is set to movies.

2. Install the python-dotenv Package

To use the environment variables from the .env file in our FastAPI app, we need to install the python-dotenv package. With your virtual environment activated, run the following command:

pip install python-dotenv

This package allows us to load environment variables from the .env file into our application, ensuring sensitive data is not hard-coded in our source files.

Next Step: Setting Up the Search Functionality

To facilitate our semantic search, we'll create a folder to hold our search-related functionality. This will include a script to check for existing movie IDs in our database, ensuring we don't re-add duplicate items.

1. Required Packages

For this section, ensure you have the following packages installed in your virtual environment:

pip install sqlalchemy langchain langchain_community langchain_postgres

2. Create the search Folder

At the root of your project, create a new folder named search.
mkdir search

3. Create the read_ids.py File

Inside the search folder, create a file named read_ids.py and paste the following code:

import os
from sqlalchemy import create_engine, text
from dotenv import load_dotenv

load_dotenv()

engine = create_engine(os.environ['DB_CONNECTION_STRING'])

def read_collection_ids():
    with engine.connect() as connection:
        query = text("SELECT id FROM langchain_pg_embedding")
        result = connection.execute(query)

        return [row[0] for row in result]

Explanation of the Code:

Imports:
os: Used to access environment variables.
create_engine and text from sqlalchemy: These are used to create a database connection and execute SQL queries.
load_dotenv: Loads environment variables from the .env file.

Environment Variable Loading:

The load_dotenv() function loads the database connection string defined in the .env file, allowing access to the database.

Creating the Database Engine:

engine = create_engine(os.environ['DB_CONNECTION_STRING']):Initializes the SQLAlchemy engine using the connection string.

Function read_collection_ids():

Connects to the database, executes a query to select all IDs from the langchain_pg_embedding table, and returns a list of IDs. This allows us to check for already existing items in the database, preventing duplicate entries.

Next Step: Adding the Search Functionality

Now that we have the read_ids.py file, the next step is to implement the main search functionality. We will create a new file called search.py in the search folder. This file will contain the code to load our movie data, create embeddings, and add new entries to the database.

1. Create the search.py File

Inside the search folder, create a file named search.py and paste the following code:

import os
import json
import time
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import JSONLoader
from langchain_postgres import PGVector
from dotenv import load_dotenv
from search.read_ids import read_collection_ids

load_dotenv()

jq_schema='.[] | {id: .id, title: .title, overview: .overview, genres: .genres, poster: .poster, release_date: .release_date}'

loader = JSONLoader(
    file_path='./movies.json',
    jq_schema=jq_schema,
    text_content=False,
)

data = loader.load()

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_function = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

vector_store = PGVector(
    embeddings=embedding_function,
    collection_name=os.environ["COLLECTION_NAME"],
    connection=os.environ["DB_CONNECTION_STRING"],
    use_jsonb=True,
)

def add_to_db_in_batches(batch_size=100):
    existing_ids = read_collection_ids()

    data_ids = [str(json.loads(item.page_content)["id"]) for item in data]

    new_ids = list(set(data_ids) - set(existing_ids))

    # print(new_ids)

    if len(new_ids) > 0:
        new_documents = [item for item in data if json.loads(item.page_content)["id"] in new_ids]

        total_products = len(new_documents)
        start_time = time.time()  # Start the timer

        for i in range(0, total_products, batch_size):
            batch_data = new_documents[i:i + batch_size]
            ids = [json.loads(item.page_content)["id"] for item in batch_data]
            vector_store.add_documents(batch_data, ids=ids)
            remaining = total_products - (i + len(batch_data))

            elapsed_time = time.time() - start_time
            batches_processed = (i // batch_size) + 1
            average_time_per_batch = elapsed_time / batches_processed if batches_processed > 0 else 0
            estimated_remaining_batches = (total_products // batch_size) - batches_processed
            estimated_remaining_time = average_time_per_batch * estimated_remaining_batches

            # Format estimated remaining time
            estimated_remaining_time_minutes = estimated_remaining_time // 60
            estimated_remaining_time_seconds = estimated_remaining_time % 60

            print(f'Added products {i + 1} to {min(i + len(batch_data), total_products)} to the database. '
                f'Remaining: {remaining}. Estimated remaining time: {int(estimated_remaining_time_minutes)} minutes and {int(estimated_remaining_time_seconds)} seconds.')

    else:
        pass

Explanation of the Code

Let's break down the code
Imports:
os, json, and time: Standard libraries for environment variable management, JSON handling, and time tracking.
HuggingFaceEmbeddings: Loads pre-trained embeddings from Hugging Face to convert movie data into numerical representations.
JSONLoader: Loads data from a JSON file based on the specified schema.
PGVector: Enables storing and querying of vector embeddings in a PostgreSQL database.
load_dotenv: Loads environment variables from the .env file.
read_collection_ids: A function to read existing IDs from the database, ensuring no duplicates are added.

Loading the Movie Data:
The jq_schema variable defines a JSON query to extract specific fields from the movies.json file.
The JSONLoader instance loads the movie data based on this schema, making it easy to handle the dataset.

Setting Up the Embedding Model:
The HuggingFaceEmbeddings instance is initialized with a pre-trained model that converts movie data into embeddings, which helps in understanding semantic meanings.

Setting Up the PGVector:
The PGVector instance connects to the PostgreSQL database, using the environment variables to retrieve the collection name and connection string.

Function add_to_db_in_batches(batch_size=100):
Reads existing IDs from the database using the read_collection_ids function.
Checks for new IDs that are not already in the database and prepares them for insertion.
The function processes new movie entries in batches (default size of 100). It measures the time taken for each batch and provides estimated remaining time for completion, helping to monitor progress during insertion.

2. Current Folder Structure

After adding the search.py file, your project structure should now look like this:

/your-project-directory
├── env/               # Your virtual environment
├── main.py            # FastAPI app
├── movies.json        # Movies dataset
├── .env               # Environment variables
└── search/            # Folder for search functionality
    ├── read_ids.py    # Script for reading existing movie IDs
    └── search.py      # Script for adding movie data

Next Step: Implementing the Search Endpoint

Now, we will create a search.py file in the search folder. This file will define an endpoint that allows users to search for movies based on their queries.

1. Create the search.py File

Inside the search folder, create a file named search.py and copy and paste the following code:

import json
from typing import Annotated
from fastapi import Query
from pydantic import BaseModel, Field
from .setup import vector_store

class SearchParams(BaseModel):
    query:str = Field(..., max=150)
    k: int = Field(5, ge=5, le=1000)

def get_search_results(params: Annotated[SearchParams, Query()]):

    results = vector_store.similarity_search(
        query=params.query,
        k=params.k
    )

    response = [json.loads(result.page_content) for result in results]

    return response

Explanation of the Code

Let's break down the code into sections for clarity:
Imports:
json: A standard library for handling JSON data.
Annotated: Used for adding type hints for FastAPI query parameters.
Query: A FastAPI utility for defining query parameters in request handlers.
BaseModel, Field: From Pydantic, used to create data validation and settings management classes.
vector_store: An instance of PGVector that connects to our PostgreSQL database to perform similarity searches.

Class SearchParams:
This class defines the parameters for our search query.
query: A required string parameter for the search query with a maximum length of 150 characters.
k: An optional integer parameter specifying how many results to return (default is 5), constrained to be between 5 and 1000.

Function get_search_results(params: Annotated[SearchParams, Query()]):
This function takes the search parameters as input and performs a similarity search on the vector_store.
results: This variable holds the search results from the similarity_search method, which retrieves the top k most similar items based on the query.
The function constructs a response by extracting the page_content from each result and parsing it as JSON.
Finally, it returns a list of the search results as a JSON-compatible Python list.

Next Step: Setting Up the FastAPI Application with Search Functionality

In this step, we will integrate the search functionality into our main.py file, allowing users to call the API for searching movies. We will also set up CORS (Cross-Origin Resource Sharing) to enable our API to be accessed from different origins.
Update the main.py File
Modify your main.py file to include the following code:

from typing import Annotated
from fastapi import FastAPI, Query
from fastapi.middleware.cors import CORSMiddleware
from search.search import SearchParams, get_search_results
from search.setup import add_to_db_in_batches
import json

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Allow all origins
    allow_credentials=True,
    allow_methods=["*"],  # Allow all methods (GET, POST, etc.)
    allow_headers=["*"],  # Allow all headers
)

@app.get("/")
async def root():
    return {
        "message": "API is up and running! Happy days"
    }

@app.get("/dataset-info")
async def dataset_info():
    with open('movies.json', 'r') as f:
        movies = json.load(f)
    return {
        "total_movies": len(movies),
        "first_movie": movies[0]
    }

@app.get("/search")
async def search(params: Annotated[SearchParams, Query()]):
    return get_search_results(params)

@app.on_event('startup')
def startup_event():
    add_to_db_in_batches()

Explanation of the Code
Let's go through the key components of the main.py file:
Imports:
The code imports several necessary modules and classes, including FastAPI, Query, and CORSMiddlewarefrom FastAPI, as well as the functions for retrieving movies and performing searches.

Creating the FastAPI Instance:
app = FastAPI(): This line creates an instance of the FastAPI application.

CORS Middleware Setup:
The middleware allows requests from any origin (allow_origins=["*"]), enables credential support, and allows all HTTP methods and headers. This is important for ensuring that your API can be accessed from different frontend applications without running into CORS issues.

Root Endpoint:
The root endpoint (GET /) returns a simple message indicating that the API is running. It helps in quickly checking if the API is accessible.

Dataset Info Endpoint:
The /dataset-infoendpoint reads the movies.json file and returns the total number of movies and the first movie's details in the dataset. This provides useful metadata about the dataset being used.

Search Endpoint:
The /search endpoint allows users to perform a search by passing query parameters defined by the SearchParams class. It calls the get_search_results function, which performs the similarity search using the vector store.

Startup Event:
The startup_event function is called when the application starts. It invokes the add_to_db_in_batches function to populate the database with movie data in batches, ensuring that the database is ready for searches when the application is running.

Next Step: Using the Search Endpoint

Now that we have set up the search functionality in our FastAPI application, let's explore how to use the /search endpoint.

Making a Search Request

You can interact with the search endpoint using tools like Postman, cURL, or your web browser. To perform a search, enter a query related to movies that feature an AI plot. For example:

A programmer develops an AI that evolves beyond it's creators

Expected Response

When you submit the search request, the API will process your query and return a list of results from the database that match your search criteria. You can expect a JSON response structured as follows:

[
    {
        "id": 644,
        "title": "A.I. Artificial Intelligence",
        "overview": "David, a robotic boy-the first of his kind programmed to love-is adopted as a test case by a Cybertronics employee and his wife. Though he gradually becomes their child, a series of unexpected circumstances make this life impossible for David. Without final acceptance by humans or machines, David embarks on a journey to discover where he truly belongs, uncovering a world in which the line between robot and machine is both vast and profoundly thin.",
        "genres": ["Drama", "Science Fiction", "Adventure"],
        "poster": "https://image.tmdb.org/t/p/w500/wnUAcUrMRGPPZUDroLeZhSjLkuu.jpg",
        "release_date": 993772800
    },
    {
        "id": 234157,
        "title": "Mechanical Marvels: Clockwork Dreams",
        "overview": "Documentary presented by Professor Simon Schaffer which charts the amazing and untold story of automata - extraordinary clockwork machines designed hundreds of years ago to mimic and recreate life. The film brings the past to life in vivid detail as we see how and why these masterpieces were built.",
        "genres": ["Documentary"],
        "poster": "https://image.tmdb.org/t/p/w500/dHDCmw9kzjzXzDbpiImrW7k7xHa.jpg",
        "release_date": 1370217600
    },
    {
        "id": 391719,
        "title": "The Secret Rules of Modern Living: Algorithms",
        "overview": "Without us noticing, modern life has been taken over. Algorithms run everything from search engines on the internet to satnavs and credit card data security. Mathematician Professor Marcus du Sautoy demystifies the hidden world of algorithms.",
        "genres": ["Documentary"],
        "poster": "https://image.tmdb.org/t/p/w500/17AEcWyF7zhVtmVOSoMhxIaNchC.jpg",
        "release_date": 1443052800
    },
    {
        "id": 48038,
        "title": "Evolver",
        "overview": "Meet Evolver, the ultimate toy for the Cyberpunk generation, a virtual-reality game brought to fierce, three-dimensional life. Suddenly, some fatal accidents begin to happen, leading to a terrifying missing link.",
        "genres": ["Action", "Horror", "Science Fiction"],
        "poster": "https://image.tmdb.org/t/p/w500/ht71HZit00PcmDR6CSz2qgUqsEa.jpg",
        "release_date": 792374400
    },
    {
        "id": 435878,
        "title": "How to Build a Human",
        "overview": "Gemma Chan, the star of Humans, explores Artificial Intelligence and builds an AI version of herself. Are AI humans just around the corner?",
        "genres": ["Documentary"],
        "poster": "https://image.tmdb.org/t/p/w500/5WYsNXnbrz1mikqmLgnQDquzNmD.jpg",
        "release_date": 1477699200
    }
]

Try Different Queries

Feel free to experiment with various movie-related queries about AI, such as:

Which movies delve into the relationship between humans and AI?

Can you recommend films that highlight the ethical implications of artificial intelligence?

What are some documentaries about AI and its impact on society?

By utilizing the search endpoint, you can discover a wealth of information stored in your movies dataset.
You can access the full code for this project on GitHub at the following link: Advanced Search API.
Additionally, I've created an interactive web app that demonstrates the use of this API. You can find the repository for that here: Advanced Search Web App.

Stay tuned, as I will soon be writing a blog on how to deploy these projects to a hosted service!

In Conclusion

By following these steps, you've built a robust semantic search API using FastAPI and AI technologies that understands user intent beyond mere keywords. You can now explore various applications of this technology across different domains like e-commerce or academic research.
I encourage you to explore the code on GitHub and try out the interactive web app to experience the capabilities of this semantic search API firsthand. Consider integrating this technology into your own applications to enhance user experiences and provide more relevant search results.
Thank you for joining me on this journey of building a smarter movie search!