DEV Community

Abdulla Ansari
Abdulla Ansari

Posted on

4

OpenAI LLM Chat with Multiple PDFs with Accurate Results

Introduction

Hi folks,
In today's article, we are going to solve a specific problem of OpenAI and LLM which is chatting with Open AI with multiple pdf documents with accurate responses.

I have been searching for this problem solution for the last 15 days and I have found many ways and tried the same but could not find the exact solution for the above problem. So I came up with a solution by reading multiple articles and watching videos.

Problem Statement

Create an OpenAI-based question-answering tool which can answer queries with multiple PDF documents.

Tech Stack

We are going to use Python as a programming language and some useful libraries to fix this issue.

  • OpenAI
  • Langchain
  • FastAPI
  • PyPDF2
  • python-dotenv
  • langchain_community
  • FAISS (faiss-cpu)

Code

create a directory for the Fast Api app and in that directory create another directory where the PDF files will be stored

  • main_dir
    • docs
    • main.py

main.py

import os
from fastapi import FastAPI, HTTPException,
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.memory.buffer import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.faiss import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI


load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")
os.environ["KMP_DUPLICATE_LIB_OK"]="True"

app = FastAPI(debug=True, title="Bot API", version="0.0.1")



text_folder = 'docs'


embedding_function = OpenAIEmbeddings()

pdf_docs = [os.path.join(text_folder, fn) for fn in os.listdir(text_folder)]

def get_pdf_text(pdf_docs):
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

def get_vectorstore(text_chunks):
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vectorstore

def get_qa_chain(vectorstore):
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(
        memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain, memory


text = get_pdf_text(pdf_docs)
text_chunks = get_text_chunks(text)
vectorstore = get_vectorstore(text_chunks)
qa_chain, memory = get_qa_chain(vectorstore)

@app.get("/ask-query")
async def query(query: str):
    # Process query

    resp = qa_chain.invoke(query)

    if(len(resp['chat_history']) >= 6):
        memory.clear()

    return {"response": resp}
Enter fullscreen mode Exit fullscreen mode

Conclusion:
This is a basic code which does the stuff as per my requirements. Please install all the requirements create a .env file and add the API key to load in the Fast Api.
If you still face any issues then let's discuss them in the comment section.

Image of Datadog

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more