Using LLM, Postgres VectorDB, and OpenAI to Perform Semantic Search on PDF Documents

#langchain #postgres #vectordatabase #llm

Objective

The goal of this project is to build a prototype app that can perform similarity search on PDF documents using VectorSearch (Postgres). The app will use the Langchain llm framework and OpenAI to extract and compare semantic vectors from the PDFs.

TechStack

Postgres with VectorSearch (Run in docker)
OpenAI (Create/Must have OpenAI key) : https://openai.com/
Langchain LLM
Python

Source: https://github.com/rajeshkumarbehura/pdf-reader-search

Keywords to learn

langchain framework, embeedding, vectorsearch or vector database, PVector, Document Loader

Explanation

PDF is a common format for documents in organizations, and it is fascinating to test llm semantic search on PDFs.

We used the book "Teach yourself Java in 21 days" for our testing.

This app is a test case for extracting and embedding PDF content and storing it in a database. We use Postgres as the vector database and DBeaver as the database viewer. The PVector framework handles the data design and embedding process.

Understand Steps -

Extracted the PDF content and split it into a list of documents.
Created a database connection string and used it for PGVector (framework class) to handle the creation of embeddings and push them into the database. If the table did not exist, it created it automatically.
Loaded the documents and their embeddings into the database.
Used Dbeaver tool to view the data and tables.(https://dbeaver.io/)

Wrote question for testing as ask_question(query="What is Incrementing and Decrementing ?")

Note : For more search options, check out the langchain documentation on different methods such as
similarity_search
search
similarity_search_with_score
Do more experiments on this functions and get better understanding.

Execution




Run postgess.yml using docker-compose command.
Update openai_key in Reader.py file
Run main function either to extract pdf file or search query.

Reference

https://www.youtube.com/watch?v=zxo3T4aQj6Q&t=1224s
https://github.com/pgvector/pgvector
https://python.langchain.com/docs/integrations/vectorstores/pgvector
https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf
https://python.langchain.com/docs/integrations/vectorstores/pgembedding

DEV Community

Using LLM, Postgres VectorDB, and OpenAI to Perform Semantic Search on PDF Documents

Objective

TechStack

Keywords to learn

Explanation

Execution

Reference

Top comments (0)

Read next

Using DSPy(COPRO) to refine prompt instructions

Aurora Limitless - Global Consistency (ACID)

Exploring the Exciting Possibilities of NVIDIA Megatron LM: A Fun and Friendly Code Walkthrough with PyTorch & NVIDIA Apex!

Day 36: Text Classification with LLMs