So, its a cozy Netflix and chill night, and you're in the mood for a steamy romance flick.You can't quite remember the title, but you know it had sizzling chemistry between the leads, a passionate beach scene, and a twist that left you breathless. In the past, you'd be stuck scrolling through endless lists or relying on awkward keyword searches. But what if your search could actually understand the sultry essence of what you're craving?
Enter the world of semantic search - a game-changer in how we interact with data. In this blog post, we'll explore the exciting realm of semantic search, powered by a combination of LangChain, Hugging Face embeddings, and pgvector. Together, these cutting-edge technologies can transform a simple movie dataset into a smart, intuitive search experience that grasps context and meaning, not just keywords.
At the heart of semantic search systems are vector databases and embeddings. Vector databases store complex data as high-dimensional vectors, enabling efficient similarity searches. Embeddings are numerical representations of data (like words or documents) that capture semantic relationships in a multi-dimensional space. This allows for understanding context and intent beyond simple keyword matching.
Now that we've set the stage for semantic search, it's time to dive into the technical implementation. We'll be building a powerful FastAPI-based API that accepts natural language search queries and returns results that match not just by keywords, but by meaning. With FastAPI's speed and automatic API documentation, the development process will be smooth and efficient.
Project Overview
We're building a powerful FastAPI-based API that accepts natural language search queries and returns results based on meaning rather than just keywords. FastAPI's speed and automatic API documentation will streamline the development process.
Tech Stack
FastAPI: The lightning-fast, easy-to-use api framework.
LangChain: Orchestrates our language model operations.
Hugging Face: Provides pre-trained embeddings for deep semantic understanding.
pgvector: Powers efficient similarity search in our database.
Docker: Facilitates running a PostgreSQL instance with pgvector.
Dataset
We're using a movies dataset that contains information about different films, such as titles, genres, descriptions, and more. You can find and download the dataset here. Make sure to grab it and have it ready in your working directory.
What to Expect
This API will leverage these tools to return the most relevant results. Imagine typing
"passionate beach rendezvous"
and instantly getting perfect movie recommendations!
We'll walk you through each step - from setting up the FastAPI framework to integrating the semantic search tools that will bring our movie dataset to life.
Setting Up FastAPI
Before we start coding, let's set up our development environment. We'll be using FastAPI as our web framework, and it's essential to isolate our project dependencies in a virtual environment.
1. Create and Activate a Virtual Environment
First, open your terminal and navigate to the directory where you want to set up the project. Then, create a virtual environment:
For Mac/Linux:
python3 -m venv env
For Windows
python -m venv env
Next, activate the virtual environment:
For Mac/Linux:
source env/bin/activate
For Windows
.\env\Scripts\activate
Once the virtual environment is activated, your terminal prompt should show (env)
indicating that you're working inside the env virtual environment.
2. Install FastAPI
With the virtual environment activated, install FastAPI.
pip install "fastapi[standard]"
3. Verify Installation
To verify that FastAPI is installed correctly, you can run the following command:
pip freeze
You should see fastapi listed in the output.
4. Creating Your FastAPI App
Now, let's create a simple FastAPI app to make sure everything is working.
In your project directory, create a file called main.py
and add the following code:
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
async def root():
return {
"message": "API is up and running! Happy days"
}
This is a simple FastAPI app that returns a JSON response when you visit the root URL (/)
.
5. Running the FastAPI App
To run the app, run this command in your terminal
fastapi dev main.py --port 9000
This command does two things:
fastapi dev
:
This is used for running the FastAPI app in development mode.- port 9000
:
Instead of running the app on the default port (8000), it will be hosted on port 9000.
Once the server is up, you can access the app in your browser at http://127.0.0.1:9000/.
You'll see the message: {"message": "API is up and running! Happy days"}
.
Additionally, FastAPI provides automatic API documentation, which you can explore by visiting the interactive docs at http://127.0.0.1:9000/docs.
Now that we have FastAPI up and running, we can move on to integrating our tech stack with LangChain, Hugging Face embeddings, and pgvector for semantic search!
Next Step: Setting Up the Movies Dataset
Now, let's add the movies dataset to our project. This dataset will be the foundation of our semantic search functionality.
1. Copy the Movies JSON File
After downloading the movies dataset, copy the movies.json
file into the root of your project directory. It should be at the same level as your main.py file.
Your project structure should now look like this:
/your-project-directory
├── env/ # Your virtual environment
├── main.py # FastAPI app
└── movies.json # Movies dataset
2. Verify the Dataset
Let's quickly verify that the dataset is correctly placed and readable. Add the following code to your main.py file:
import json
@app.get("/dataset-info")
async def dataset_info():
with open('movies.json', 'r') as f:
movies = json.load(f)
return {
"total_movies": len(movies),
"first_movie": movies[0]
}
Run your FastAPI app and visit http://127.0.0.1:9000/dataset-info
. You should see the total number of movies and details of the first movie in the dataset.
Next Step: Setting Up the pgvector Database with Docker
To leverage the power of pgvector for our semantic search capabilities, we need to set up a PostgreSQL instance with pgvector. This can be easily done using Docker.
1. Run the Docker Command
Open your terminal and execute the following command:
docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16
Breaking Down the Command:
docker run: This command is used to create and start a new container.
-name pgvector-container: This option names the container pgvector-container for easy reference.
-e POSTGRES_USER=langchain: Sets the PostgreSQL username to langchain.
-e POSTGRES_PASSWORD=langchain: Sets the password for the langchain user.
-e POSTGRES_DB=langchain:Creates a new database named langchain.
-p 6024:5432:Maps port 5432 in the container (the default PostgreSQL port) to port 6024 on your local machine. You will connect to PostgreSQL through port 6024.
-d: Runs the container in detached mode, allowing it to run in the background.
pgvector/pgvector: Specifies the image to use, in this case, the pgvector version compatible with PostgreSQL 16.
2. Verify the Container is Running
After executing the command, check if the container is running smoothly:
docker ps
You should see pgvector-container
listed in the output, indicating that the PostgreSQL instance is ready for use.
Next Step: Managing sensitive information
To manage our secrets effectively, we'll create a .env
file at the root of our project. This file will store sensitive information like our database connection string.
1. Create the .env File
In your project directory, create a file named .env
and add the following content:
DB_CONNECTION_STRING="postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
COLLECTION_NAME=movies
Explanation:
DB_CONNECTION_STRING: This variable holds the connection string required to connect to your PostgreSQL database. It specifies:
COLLECTION_NAME: This variable specifies the name of the collection we'll be using, which is set to movies.
2. Install the python-dotenv Package
To use the environment variables from the .env
file in our FastAPI app, we need to install the python-dotenv
package. With your virtual environment activated, run the following command:
pip install python-dotenv
This package allows us to load environment variables from the .env file into our application, ensuring sensitive data is not hard-coded in our source files.
Next Step: Setting Up the Search Functionality
To facilitate our semantic search, we'll create a folder to hold our search-related functionality. This will include a script to check for existing movie IDs in our database, ensuring we don't re-add duplicate items.
1. Required Packages
For this section, ensure you have the following packages installed in your virtual environment:
pip install sqlalchemy langchain langchain_community langchain_postgres
2. Create the search Folder
At the root of your project, create a new folder named search.
mkdir search
3. Create the read_ids.py File
Inside the search folder, create a file named read_ids.py and paste the following code:
import os
from sqlalchemy import create_engine, text
from dotenv import load_dotenv
load_dotenv()
engine = create_engine(os.environ['DB_CONNECTION_STRING'])
def read_collection_ids():
with engine.connect() as connection:
query = text("SELECT id FROM langchain_pg_embedding")
result = connection.execute(query)
return [row[0] for row in result]
Explanation of the Code:
Imports:
os
: Used to access environment variables.
create_engine and text from sqlalchemy
: These are used to create a database connection and execute SQL queries.
load_dotenv
: Loads environment variables from the .env file.
Environment Variable Loading:
The load_dotenv() function loads the database connection string defined in the .env
file, allowing access to the database.
Creating the Database Engine:
engine = create_engine(os.environ['DB_CONNECTION_STRING']):
Initializes the SQLAlchemy engine using the connection string.
Function read_collection_ids():
Connects to the database, executes a query to select all IDs from the langchain_pg_embedding table, and returns a list of IDs. This allows us to check for already existing items in the database, preventing duplicate entries.
Next Step: Adding the Search Functionality
Now that we have the read_ids.py
file, the next step is to implement the main search functionality. We will create a new file called search.py
in the search folder. This file will contain the code to load our movie data, create embeddings, and add new entries to the database.
1. Create the search.py File
Inside the search folder, create a file named search.py and paste the following code:
import os
import json
import time
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import JSONLoader
from langchain_postgres import PGVector
from dotenv import load_dotenv
from search.read_ids import read_collection_ids
load_dotenv()
jq_schema='.[] | {id: .id, title: .title, overview: .overview, genres: .genres, poster: .poster, release_date: .release_date}'
loader = JSONLoader(
file_path='./movies.json',
jq_schema=jq_schema,
text_content=False,
)
data = loader.load()
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_function = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
vector_store = PGVector(
embeddings=embedding_function,
collection_name=os.environ["COLLECTION_NAME"],
connection=os.environ["DB_CONNECTION_STRING"],
use_jsonb=True,
)
def add_to_db_in_batches(batch_size=100):
existing_ids = read_collection_ids()
data_ids = [str(json.loads(item.page_content)["id"]) for item in data]
new_ids = list(set(data_ids) - set(existing_ids))
# print(new_ids)
if len(new_ids) > 0:
new_documents = [item for item in data if json.loads(item.page_content)["id"] in new_ids]
total_products = len(new_documents)
start_time = time.time() # Start the timer
for i in range(0, total_products, batch_size):
batch_data = new_documents[i:i + batch_size]
ids = [json.loads(item.page_content)["id"] for item in batch_data]
vector_store.add_documents(batch_data, ids=ids)
remaining = total_products - (i + len(batch_data))
elapsed_time = time.time() - start_time
batches_processed = (i // batch_size) + 1
average_time_per_batch = elapsed_time / batches_processed if batches_processed > 0 else 0
estimated_remaining_batches = (total_products // batch_size) - batches_processed
estimated_remaining_time = average_time_per_batch * estimated_remaining_batches
# Format estimated remaining time
estimated_remaining_time_minutes = estimated_remaining_time // 60
estimated_remaining_time_seconds = estimated_remaining_time % 60
print(f'Added products {i + 1} to {min(i + len(batch_data), total_products)} to the database. '
f'Remaining: {remaining}. Estimated remaining time: {int(estimated_remaining_time_minutes)} minutes and {int(estimated_remaining_time_seconds)} seconds.')
else:
pass
Explanation of the Code
Let's break down the code
Imports:
os, json, and time
: Standard libraries for environment variable management, JSON handling, and time tracking.
HuggingFaceEmbeddings
: Loads pre-trained embeddings from Hugging Face to convert movie data into numerical representations.
JSONLoader
: Loads data from a JSON file based on the specified schema.
PGVector
: Enables storing and querying of vector embeddings in a PostgreSQL database.
load_dotenv
: Loads environment variables from the .env file.
read_collection_ids
: A function to read existing IDs from the database, ensuring no duplicates are added.
Loading the Movie Data:
The jq_schema
variable defines a JSON query to extract specific fields from the movies.json file.
The JSONLoader
instance loads the movie data based on this schema, making it easy to handle the dataset.
Setting Up the Embedding Model:
The HuggingFaceEmbeddings
instance is initialized with a pre-trained model that converts movie data into embeddings, which helps in understanding semantic meanings.
Setting Up the PGVector:
The PGVector
instance connects to the PostgreSQL database, using the environment variables to retrieve the collection name and connection string.
Function add_to_db_in_batches(batch_size=100):
Reads existing IDs from the database using the read_collection_ids function.
Checks for new IDs that are not already in the database and prepares them for insertion.
The function processes new movie entries in batches (default size of 100)
. It measures the time taken for each batch and provides estimated remaining time for completion, helping to monitor progress during insertion.
2. Current Folder Structure
After adding the search.py file, your project structure should now look like this:
/your-project-directory
├── env/ # Your virtual environment
├── main.py # FastAPI app
├── movies.json # Movies dataset
├── .env # Environment variables
└── search/ # Folder for search functionality
├── read_ids.py # Script for reading existing movie IDs
└── search.py # Script for adding movie data
Next Step: Implementing the Search Endpoint
Now, we will create a search.py
file in the search folder. This file will define an endpoint that allows users to search for movies based on their queries.
1. Create the search.py File
Inside the search folder, create a file named search.py
and copy and paste the following code:
import json
from typing import Annotated
from fastapi import Query
from pydantic import BaseModel, Field
from .setup import vector_store
class SearchParams(BaseModel):
query:str = Field(..., max=150)
k: int = Field(5, ge=5, le=1000)
def get_search_results(params: Annotated[SearchParams, Query()]):
results = vector_store.similarity_search(
query=params.query,
k=params.k
)
response = [json.loads(result.page_content) for result in results]
return response
Explanation of the Code
Let's break down the code into sections for clarity:
Imports:
json
: A standard library for handling JSON data.
Annotated: Used for adding type hints for FastAPI query parameters.
Query
: A FastAPI utility for defining query parameters in request handlers.
BaseModel, Field
: From Pydantic, used to create data validation and settings management classes.
vector_store
: An instance of PGVector that connects to our PostgreSQL database to perform similarity searches.
Class SearchParams:
This class defines the parameters for our search query.
query
: A required string parameter for the search query with a maximum length of 150 characters.
k
: An optional integer parameter specifying how many results to return (default is 5), constrained to be between 5 and 1000.
Function get_search_results(params: Annotated[SearchParams, Query()]):
This function takes the search parameters as input and performs a similarity search on the vector_store.
results: This variable holds the search results from the similarity_search method, which retrieves the top k most similar items based on the query.
The function constructs a response by extracting the page_content from each result and parsing it as JSON.
Finally, it returns a list of the search results as a JSON-compatible Python list.
Next Step: Setting Up the FastAPI Application with Search Functionality
In this step, we will integrate the search functionality into our main.py
file, allowing users to call the API for searching movies. We will also set up CORS (Cross-Origin Resource Sharing) to enable our API to be accessed from different origins.
Update the main.py File
Modify your main.py
file to include the following code:
from typing import Annotated
from fastapi import FastAPI, Query
from fastapi.middleware.cors import CORSMiddleware
from search.search import SearchParams, get_search_results
from search.setup import add_to_db_in_batches
import json
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Allow all origins
allow_credentials=True,
allow_methods=["*"], # Allow all methods (GET, POST, etc.)
allow_headers=["*"], # Allow all headers
)
@app.get("/")
async def root():
return {
"message": "API is up and running! Happy days"
}
@app.get("/dataset-info")
async def dataset_info():
with open('movies.json', 'r') as f:
movies = json.load(f)
return {
"total_movies": len(movies),
"first_movie": movies[0]
}
@app.get("/search")
async def search(params: Annotated[SearchParams, Query()]):
return get_search_results(params)
@app.on_event('startup')
def startup_event():
add_to_db_in_batches()
Explanation of the Code
Let's go through the key components of the main.py file:
Imports:
The code imports several necessary modules and classes, including FastAPI, Query, and CORSMiddlewarefrom FastAPI, as well as the functions for retrieving movies and performing searches.
Creating the FastAPI Instance:
app = FastAPI()
: This line creates an instance of the FastAPI application.
CORS Middleware Setup:
The middleware allows requests from any origin (allow_origins=["*"]), enables credential support, and allows all HTTP methods and headers. This is important for ensuring that your API can be accessed from different frontend applications without running into CORS issues.
Root Endpoint:
The root endpoint (GET /)
returns a simple message indicating that the API is running. It helps in quickly checking if the API is accessible.
Dataset Info Endpoint:
The /dataset-info
endpoint reads the movies.json
file and returns the total number of movies and the first movie's details in the dataset. This provides useful metadata about the dataset being used.
Search Endpoint:
The /search
endpoint allows users to perform a search by passing query parameters defined by the SearchParams
class. It calls the get_search_results function
, which performs the similarity search using the vector store.
Startup Event:
The startup_event function is called when the application starts. It invokes the add_to_db_in_batches
function to populate the database with movie data in batches, ensuring that the database is ready for searches when the application is running.
Next Step: Using the Search Endpoint
Now that we have set up the search functionality in our FastAPI application, let's explore how to use the /search
endpoint.
Making a Search Request
You can interact with the search endpoint using tools like Postman, cURL, or your web browser. To perform a search, enter a query related to movies that feature an AI plot. For example:
A programmer develops an AI that evolves beyond it's creators
Expected Response
When you submit the search request, the API will process your query and return a list of results from the database that match your search criteria. You can expect a JSON response structured as follows:
[
{
"id": 644,
"title": "A.I. Artificial Intelligence",
"overview": "David, a robotic boy-the first of his kind programmed to love-is adopted as a test case by a Cybertronics employee and his wife. Though he gradually becomes their child, a series of unexpected circumstances make this life impossible for David. Without final acceptance by humans or machines, David embarks on a journey to discover where he truly belongs, uncovering a world in which the line between robot and machine is both vast and profoundly thin.",
"genres": ["Drama", "Science Fiction", "Adventure"],
"poster": "https://image.tmdb.org/t/p/w500/wnUAcUrMRGPPZUDroLeZhSjLkuu.jpg",
"release_date": 993772800
},
{
"id": 234157,
"title": "Mechanical Marvels: Clockwork Dreams",
"overview": "Documentary presented by Professor Simon Schaffer which charts the amazing and untold story of automata - extraordinary clockwork machines designed hundreds of years ago to mimic and recreate life. The film brings the past to life in vivid detail as we see how and why these masterpieces were built.",
"genres": ["Documentary"],
"poster": "https://image.tmdb.org/t/p/w500/dHDCmw9kzjzXzDbpiImrW7k7xHa.jpg",
"release_date": 1370217600
},
{
"id": 391719,
"title": "The Secret Rules of Modern Living: Algorithms",
"overview": "Without us noticing, modern life has been taken over. Algorithms run everything from search engines on the internet to satnavs and credit card data security. Mathematician Professor Marcus du Sautoy demystifies the hidden world of algorithms.",
"genres": ["Documentary"],
"poster": "https://image.tmdb.org/t/p/w500/17AEcWyF7zhVtmVOSoMhxIaNchC.jpg",
"release_date": 1443052800
},
{
"id": 48038,
"title": "Evolver",
"overview": "Meet Evolver, the ultimate toy for the Cyberpunk generation, a virtual-reality game brought to fierce, three-dimensional life. Suddenly, some fatal accidents begin to happen, leading to a terrifying missing link.",
"genres": ["Action", "Horror", "Science Fiction"],
"poster": "https://image.tmdb.org/t/p/w500/ht71HZit00PcmDR6CSz2qgUqsEa.jpg",
"release_date": 792374400
},
{
"id": 435878,
"title": "How to Build a Human",
"overview": "Gemma Chan, the star of Humans, explores Artificial Intelligence and builds an AI version of herself. Are AI humans just around the corner?",
"genres": ["Documentary"],
"poster": "https://image.tmdb.org/t/p/w500/5WYsNXnbrz1mikqmLgnQDquzNmD.jpg",
"release_date": 1477699200
}
]
Try Different Queries
Feel free to experiment with various movie-related queries about AI, such as:
Which movies delve into the relationship between humans and AI?
Can you recommend films that highlight the ethical implications of artificial intelligence?
What are some documentaries about AI and its impact on society?
By utilizing the search endpoint, you can discover a wealth of information stored in your movies dataset.
You can access the full code for this project on GitHub at the following link: Advanced Search API.
Additionally, I've created an interactive web app that demonstrates the use of this API. You can find the repository for that here: Advanced Search Web App.
Stay tuned, as I will soon be writing a blog on how to deploy these projects to a hosted service!
In Conclusion
By following these steps, you've built a robust semantic search API using FastAPI and AI technologies that understands user intent beyond mere keywords. You can now explore various applications of this technology across different domains like e-commerce or academic research.
I encourage you to explore the code on GitHub and try out the interactive web app to experience the capabilities of this semantic search API firsthand. Consider integrating this technology into your own applications to enhance user experiences and provide more relevant search results.
Thank you for joining me on this journey of building a smarter movie search!
Top comments (0)