Introduction
Whats up everyone! This blog is a tutorial on how to integrate pgvector’s docker image with langchain project to use it as a vector database. For this tutorial, I am using Google's embedding model to embed the data and Gemini-1.5-flash model to generate the response. This blog will walk you through all the important files required for the purpose.
Step 1: Set up pgvector's docker image
Create a docker-compose.yml
file to list the pgvector’s docker image and pass all the required parameters to set it up.
services:
db:
image: pgvector/pgvector:pg16
restart: always
env_file:
- pgvector.env
ports:
- "5432:5432"
volumes:
- pg_data:/var/lib/postgresql/data
volumes:
pg_data:
By default pgvector is hosted on port 5432 and it is kept on the same port for the local machine. It can be updated as required. Similarly, the name of the volume can be changed as required.
Further, create a pgvector.env file to list all the required environment variables by the docker image.
POSTGRES_USER=pgvector_user
POSTGRES_PASSWORD=pgvector_passwd
POSTGRES_DB=pgvector_db
Again, you can give any value to these variables. FYI: Once the volume is created, the database can only be accessed using these values; unless you delete the existing volume and create a new one.
This brings us to the end of the first step which was setting up the pgvector's docker image.
Step 2: Function to get Vector DB's instance
Create a db.py
file. This will contain a function which will declare the instance of pgvector db which can be used to work with the database.
Following are the dependencies required which can be installed using pip install [dependency]
python-dotenv
langchain_google_genai
langchain_postgres
psycopg
psycopg[binary]
Once the dependencies are installed, following is the db.py
file.
import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_postgres import PGVector
load_dotenv()
def get_vector_store():
# Get Gemini API Keys from environment variables.
gemini_api_key = os.getenv("GEMINI_API_KEY")
# Get DB credentials from environment variables.
postgres_collection = os.getenv("POSTGRES_COLLECTION")
postgres_connection_string = os.getenv("POSTGRES_CONNECTION_STRING")
# Initiate Gemini's embedding model.
embedding_model = GoogleGenerativeAIEmbeddings(
model="models/embedding-001",
google_api_key=gemini_api_key
)
# Initiate pgvector by passing the environment variables and embeddings model.
vector_store = PGVector(
embeddings=embedding_model,
collection_name=postgres_collection,
connection=postgres_connection_string,
use_jsonb=True,
)
return vector_store
Before explaining this file, we need to add one more .env
file to store the environment variables for the project. Therefore, create .env
file.
GEMINI_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXX
POSTGRES_COLLECTION=pgvector_documents
POSTGRES_CONNECTION_STRING=postgresql+psycopg://pgvector_user:pgvector_passwd@localhost:5432/pgvector_db
This file contains the API Keys of your LLM. Further, for the value of POSTGRES_COLLECTION
, you can again give this any value. However, the POSTGRES_CONNECTION_STRING
is made up of credentials listed in pgvector.env
. It is structured as, postgresql+psycopg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{URL on which DB is serving}/{POSTGRES_DB}
. For the URL, we are using localhost:5432
, since we are using the docker image and have connected the port 5432 of the local machine with the port 5432 of the container.
Once the .env
file is setup, let me walk you through the logic of db.py
's get_vector_store
function. Firstly, we are talking the value of variables declared in .env
file. Secondly, we have declared an instance of the embedding model. Here, I am using Google's embedding model, but you can use anyone. Lastly, we are declaring an instance of PG vector, by passing the instance of the embedding model, collection name and the connection string. Finally, we return this PG vector instance. This brings us to the end of step 2.
Step 3: Main file
In this step we will create the main file, app.py
which will read content from the document, store it in the vector db, further when user passes a query, it gets the relevant chunk of data from the vector db and passes the query as well as the relevant chunk of data to the LLM and prints the generated response. Following are the imports required for this file.
# app.py
import PyPDF2
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from db import get_vector_store
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv
load_dotenv()
Following are the dependencies required for this file, in addition of those required for db.py
.
PyPDF2 // Only if you plan the extract data from a pdf.
langchain
langchain_text_splitters
langchain_google_genai
This file has 4 functions out of which 2 are one of the important functions for this tutorial, that are, store_data
and get_relevant_chunk
. Hence, here is the detailed explanation of those functions, followed by a brief explanation of the other 2 functions.
store_data
# app.py
def store_data(data):
# Step 1: Converting the data into Document type.
document_data = Document(page_content=data)
# Step 2: Splitting the data into chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = text_splitter.split_documents([document_data])
# Step 3: Get the instance of vector db and store the data.
vector_db = get_vector_store()
if not vector_db: return
vector_db.add_documents(documents)
Purpose
The purpose of this function is to store the data received into the vector db. Lets understand the steps involved in doing so.
Logic
Step 1. The data it receives is in
string
type. Hence, it converts it intoDocument
type. ThisDocument
is imported fromlangchain.schema
.Step 2. Once the data is casted into the required type, the next step is to split the data into several chunks. To split the data, we are using
RecursiveCharacterTextSplitter
fromlangchain_text_splitters
. The splitter is set to havechunk_size
of 1000, that means each chunk of data will contain 1000 units. Moreover, thechunk_overlap
is set to 100 means that the last 100 units of the first chunk will be the first 100 units of the second chunk. This makes sure that each chunk has proper context of the data when passed to the LLM. Using this splitter, the data is first split in chunks.Step 3. The next step is the get the instance of vector db using
get_vector_store
function ofdb.py
. Finally,add_document
method of vector db is called by passing the split data to store it in the vector db.
get_relevant_chunk
def get_relevant_chunk(user_query):
# Step 1: Get the instance of vector db.
vector_db = get_vector_store()
# Step 2: Get the relevant chunk of data.
if not vector_db: return
documents = vector_db.similarity_search(user_query, k=2)
# Step 3: Convert the data from array type to string type and return it.
relevant_chunk = " ".join([d.page_content for d in documents])
return relevant_chunk
Purpose
The purpose of this function is the get the relevant chunk of data from the store data in vector db using the user's query.
Logic
Step 1. Get the instance of vector db using
get_vector_store
function ofdb.py
.Step 2. Call the
similarity_search
method of vector db by passing theuser_query
and setting k. The k stands for the number of chunks required. If it is set to 2, it would return the 2 most relevant chunk of data base of the user's query. The can be set according to the project requirements.Step 3. The relevant chunk received from vector db is in array. Hence, to get it in one string,
.join
method of python is user. Finally, the relevant data is returned.
Perfect. These were the two important function from this file for this tutorial. Following are the other two functions of this file; however, since this tutorial is about pgvector and vector db, I will quickly walk you through the logic of those functions.
def get_document_content(document_path):
pdf_text = ""
# Load the document.
with open(document_path, "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
# Read and return the document content.
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
pdf_text += page.extract_text()
return pdf_text
def prompt_llm(user_query, relevant_data):
# Initiate a prompt template.
prompt_template = PromptTemplate(
input_variables=["user_query", "relevant_data"],
template= """
You are a knowledgeable assistant trained to answer questions based on specific content provided to you. Below is the content you should use to respond, followed by a user's question. Do not include information outside the given content. If the question cannot be answered based on the provided content, respond with "I am not trained to answer this."
Content: {relevant_data}
User's Question: {user_query}
"""
)
# Initiate LLM instance.
gemini_api_key = os.getenv("GEMINI_API_KEY")
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=gemini_api_key)
# Chain the template and LLM
chain = prompt_template | llm
# Invoke the chain by passing the input variables of prompt template.
response = chain.invoke({
"user_query":user_query,
"relevant_data": relevant_data
})
# Return the generated response.
return response.content
The first function, get_document_content
basically, takes the path to a pdf, opens it, reads it and returns its content.
The second function, prompt_llm
accepts the user's query and relevant chunk of data. It initiates a prompt template, in which the instructions for LLM is listed, along with having the user's query and relevant chunk of data as input variables. Moreover, it initiates an instance of LLM model by passing the required parameters, chains the LLM model with prompt template, invokes the chain by passing the value of input variables for the prompt template and finally returns the generated response.
Finally, once these 4 utility functions are declared, we will declare the main function of this file which will call these utility functions to perform the required operations.
if __name__ == "__main__":
# Get document content.
document_content = get_document_content("resume.pdf")
# Store the data in vector db.
store_data(document_content)
# Declare a variable having user's query.
user_query = "Where does Dev currently works at?"
# Get relevant chunk of data for solving the query.
relevant_chunk = get_relevant_chunk(user_query)
# Prompt LLM to generate the response.
generated_response = prompt_llm(user_query, relevant_chunk)
# Print the generated response.
print(generated_response)
For this tutorial, I am using my resume's pdf as data. First this main function takes the data of this pdf in string, stores the data into vector db, declares a variable containing user query, gets the relevant chunk of data by passing the user query, gets the generated response by passing the user's query and relevant chunk of data and finally prints the LLM response.
To get a more detailed walkthrough of this tutorial, check out this video.
Final Words
This was the tutorial on how to integrate pgvector's docker image with langchain project to use it as a vector db. Please let me know your feedback on this or if you have any questions. I will be happy to answer.
Top comments (0)