DEV Community

Abdulla Ansari
Abdulla Ansari

Posted on

4

OpenAI LLM Chat with Multiple PDFs with Accurate Results

Introduction

Hi folks,
In today's article, we are going to solve a specific problem of OpenAI and LLM which is chatting with Open AI with multiple pdf documents with accurate responses.

I have been searching for this problem solution for the last 15 days and I have found many ways and tried the same but could not find the exact solution for the above problem. So I came up with a solution by reading multiple articles and watching videos.

Problem Statement

Create an OpenAI-based question-answering tool which can answer queries with multiple PDF documents.

Tech Stack

We are going to use Python as a programming language and some useful libraries to fix this issue.

  • OpenAI
  • Langchain
  • FastAPI
  • PyPDF2
  • python-dotenv
  • langchain_community
  • FAISS (faiss-cpu)

Code

create a directory for the Fast Api app and in that directory create another directory where the PDF files will be stored

  • main_dir
    • docs
    • main.py

main.py

import os
from fastapi import FastAPI, HTTPException,
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.memory.buffer import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.faiss import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI


load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")
os.environ["KMP_DUPLICATE_LIB_OK"]="True"

app = FastAPI(debug=True, title="Bot API", version="0.0.1")



text_folder = 'docs'


embedding_function = OpenAIEmbeddings()

pdf_docs = [os.path.join(text_folder, fn) for fn in os.listdir(text_folder)]

def get_pdf_text(pdf_docs):
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

def get_vectorstore(text_chunks):
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vectorstore

def get_qa_chain(vectorstore):
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(
        memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain, memory


text = get_pdf_text(pdf_docs)
text_chunks = get_text_chunks(text)
vectorstore = get_vectorstore(text_chunks)
qa_chain, memory = get_qa_chain(vectorstore)

@app.get("/ask-query")
async def query(query: str):
    # Process query

    resp = qa_chain.invoke(query)

    if(len(resp['chat_history']) >= 6):
        memory.clear()

    return {"response": resp}
Enter fullscreen mode Exit fullscreen mode

Conclusion:
This is a basic code which does the stuff as per my requirements. Please install all the requirements create a .env file and add the API key to load in the Fast Api.
If you still face any issues then let's discuss them in the comment section.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay