OpenAI LLM Chat with Multiple PDFs with Accurate Results

#llm #chatgpt #openai #langchain

Introduction

Hi folks,
In today's article, we are going to solve a specific problem of OpenAI and LLM which is chatting with Open AI with multiple pdf documents with accurate responses.

I have been searching for this problem solution for the last 15 days and I have found many ways and tried the same but could not find the exact solution for the above problem. So I came up with a solution by reading multiple articles and watching videos.

Problem Statement

Create an OpenAI-based question-answering tool which can answer queries with multiple PDF documents.

Tech Stack

We are going to use Python as a programming language and some useful libraries to fix this issue.

OpenAI
Langchain
FastAPI
PyPDF2
python-dotenv
langchain_community
FAISS (faiss-cpu)

Code

create a directory for the Fast Api app and in that directory create another directory where the PDF files will be stored

main_dir
- docs
- main.py

main.py

import os
from fastapi import FastAPI, HTTPException,
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.memory.buffer import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.faiss import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI


load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")
os.environ["KMP_DUPLICATE_LIB_OK"]="True"

app = FastAPI(debug=True, title="Bot API", version="0.0.1")



text_folder = 'docs'


embedding_function = OpenAIEmbeddings()

pdf_docs = [os.path.join(text_folder, fn) for fn in os.listdir(text_folder)]

def get_pdf_text(pdf_docs):
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

def get_vectorstore(text_chunks):
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vectorstore

def get_qa_chain(vectorstore):
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(
        memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain, memory


text = get_pdf_text(pdf_docs)
text_chunks = get_text_chunks(text)
vectorstore = get_vectorstore(text_chunks)
qa_chain, memory = get_qa_chain(vectorstore)

@app.get("/ask-query")
async def query(query: str):
    # Process query

    resp = qa_chain.invoke(query)

    if(len(resp['chat_history']) >= 6):
        memory.clear()

    return {"response": resp}

Conclusion:
This is a basic code which does the stuff as per my requirements. Please install all the requirements create a .env file and add the API key to load in the Fast Api.
If you still face any issues then let's discuss them in the comment section.

DEV Community

OpenAI LLM Chat with Multiple PDFs with Accurate Results

Introduction

Problem Statement

Tech Stack

Code

Top comments (0)