DEV Community

Sasmitha Manathunga
Sasmitha Manathunga

Posted on • Updated on

Build a Document QA App in 3 Simple Steps with Langchain and Streamlit

In this tutorial, we'll be building an AI-powered document QA web app using Python.

With just a few lines of code, you'll have a working document QA app that you can use to extract information from any PDF. Here's a preview of what we'll be creating in this tutorial:

Doc QA Demo

You can find the source code here.

So let's get started!

1. Set up your environment

I highly recommend that you use a package/environment management tool so that the external dependencies you're using won't affect any of your existing projects.

We'll be using the built-in venv module to create virtual environments.

First, open your terminal and create a virtual environment.

python -m venv venv
Enter fullscreen mode Exit fullscreen mode

and activate it:

venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode

Now, let's install the required dependencies:

pip install streamlit pypdf openai faiss-cpu langchain==0.0.77
Enter fullscreen mode Exit fullscreen mode

Finally, we'll need to set an environment variable for the OpenAI API key:

set OPENAI_API_KEY=<YOUR_API_KEY>
Enter fullscreen mode Exit fullscreen mode

You can get an API key here.

Now, that we're all set, let's start coding our app!

2. Create a QA chain with langchain

Create a file named utils.py, where we'll write the functions for parsing PDFs, creating a vector store, and answering questions.

First, let's import the required dependencies:

from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.vectorstores.faiss import FAISS
from pypdf import PdfReader
import streamlit as st
Enter fullscreen mode Exit fullscreen mode

Then, we'll add a function to parse PDFs

def parse_pdf(file):
    pdf = PdfReader(file)
    output = []
    for page in pdf.pages:
        text = page.extract_text()
        output.append(text)

    return "\n\n".join(output)

Enter fullscreen mode Exit fullscreen mode

We can't fit the whole document inside the prompt since GPT-3 has a limited context window. So we'll have to:

  • Split the document into smaller chunks
  • Embed those chunks in a special database called a vector store, which allows us to fetch only the relevant passages for a question by doing a semantic search
def embed_text(text):
    """Split the text and embed it in a FAISS vector store"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800, chunk_overlap=0, separators=["\n\n", ".", "?", "!", " ", ""]
    )
    texts = text_splitter.split_text(text)

    embeddings = OpenAIEmbeddings()
    index = FAISS.from_texts(texts, embeddings)

    return index
Enter fullscreen mode Exit fullscreen mode

The RecursiveCharacterTextSplitter recursively tries to split the document by the given separators. Note that the order of the separators are important as it'll first try to split the document by \n\n then, by ., and so on.

Finally, let's write a function to search the index and pass the relevant passages to GPT for question answering:

def get_answer(index, query):
    """Returns answer to a query using langchain QA chain"""

    docs = index.similarity_search(query)

    chain = load_qa_chain(OpenAI(temperature=0))
    answer = chain.run(input_documents=docs, question=query)

    return answer

Enter fullscreen mode Exit fullscreen mode

Now, let's create a simple UI for our app.

3. Build the web app with Streamlit

Streamlit makes it easy to create web apps using Python in minutes.

First, create a file named app.py and import Streamlit and the functions we made earlier.

import streamlit as st
from utils import parse_pdf, embed_text, get_answer
Enter fullscreen mode Exit fullscreen mode

Now, the whole UI can be created with just a couple of lines:

st.header("Doc QA")
uploaded_file = st.file_uploader("Upload a pdf", type=["pdf"])

if uploaded_file is not None:
    index = embed_text(parse_pdf(uploaded_file))
    query = st.text_area("Ask a question about the document")
    button = st.button("Submit")
    if button:
        st.write(get_answer(index, query))
Enter fullscreen mode Exit fullscreen mode

That's it🎉 Now, open up your terminal and run:

streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

and see your document QA app in action.

Optimizing the app

Let's see how we can optimize the app and make our life easier.

Caching the results

You'll notice the app will run embed_text() and parse_pdf() each time we ask a question. To fix this, we'll have to cache the results. An easy way to do this is by using @st.cache.

In utils.py add the following just before the function declaration.

@st.cache
def parse_pdf(file):
Enter fullscreen mode Exit fullscreen mode
@st.cache
def embed_text(text)
Enter fullscreen mode Exit fullscreen mode

Now, save the file and rerun the app. You'll see that after asking the first question subsequent ones will be faster.

Managing secrets

In step 1, we set the OpenAI API key using the command line, which can be cumbersome to type in every time we run the app using a new terminal. So let's load the API key from a file:

  • Create a directory called .streamlit at the root of your app.
  • Inside it, create a file named secrets.toml and add the following:
OPENAI_API_KEY = "<YOUR_API_KEY>"
Enter fullscreen mode Exit fullscreen mode

Put the API key inside double quotes.

Now, you no longer need to type in the API key every time you spin up a new terminal.

Important: If you're using git, make sure to add secrets.toml to your .gitignore file before committing.

Wrap-up and next steps

Congrats🎉 you made an AI-powered document QA app in just 3 easy steps. If you want to deploy this app, Streamlit Community Cloud lets you share and deploy your apps for free in just a few minutes.

I encourage you to further develop this app, for example, by adding sources to the answers and adding support for more file types. You can learn more about langchain in their well-written documentation which includes excellent examples for every use case.

If you have any questions, feel free to leave a comment below.

Happy coding💻

🙌 Hey! If you enjoy my content and want to show some love, feel free to buy me a coffee. Each cup helps me create more useful content for incredible developers like you!

Top comments (4)

Collapse
 
dongdongzhang profile image
dongdongzhang • Edited

great article! but you lack the altair==4 in requirements.txt

Collapse
 
dadoo profile image
dadoo-ai

Thank you so much !

Collapse
 
jiwanczuk profile image
Jiwanczuk

Awesome guide! Thank you very much for sharing this.

Collapse
 
mmz001 profile image
Sasmitha Manathunga

Glad you enjoyed it