DEV Community

Cover image for I Built a PDF Q&A App with RAG, FAISS, and Llama 3.1 — Here's Everything I Learned
Naimul Karim
Naimul Karim

Posted on

I Built a PDF Q&A App with RAG, FAISS, and Llama 3.1 — Here's Everything I Learned

TL;DR: I built an end-to-end RAG application that lets you upload any PDF and chat with it. It uses FAISS for vector search, sentence-transformers for embeddings, and Llama 3.1 via Groq for free LLM inference. This article walks through the full architecture, every bug I hit, and how to build it yourself.


Why I Built This

I was working through a Machine Learning course final project and kept running into the same problem: I had dozens of research papers and lecture notes as PDFs, and finding specific information meant scrolling through hundreds of pages manually.

So I built a solution. An AI assistant that reads your PDFs, understands them semantically, and answers your questions in plain English — with explanations, not just copied text.

The result is PDF Q&A Pro: a multi-tab Streamlit app powered by a RAG pipeline.

Here's what it looks like in action:

  • Upload one or more PDFs
  • Ask any question in natural language
  • Get a detailed, explained answer with the exact source page cited
  • Generate summaries, key insights, and topic analyses with one click

And the best part — it runs entirely on free APIs.


What is RAG and Why Should You Care?

Before we dive into code, let me explain the core concept: Retrieval-Augmented Generation (RAG).

The naive approach to PDF Q&A is to paste the entire document into an LLM. That breaks immediately for anything longer than a few pages — LLMs have context limits, and sending 200 pages is expensive and slow.

RAG solves this elegantly:

Instead of: [Entire PDF] + Question → LLM → Answer

RAG does:   Question → Find relevant chunks → [Top 4 chunks] + Question → LLM → Answer
Enter fullscreen mode Exit fullscreen mode

You only send the relevant parts of the document to the LLM. This makes it fast, cheap, and accurate.

The pipeline has two phases:

Indexing Phase (run once per upload):

PDF → Extract Text → Split into Chunks → Embed Chunks → Store in FAISS
Enter fullscreen mode Exit fullscreen mode

Query Phase (run per question):

Question → Embed Question → FAISS Search → Top-K Chunks → LLM → Answer
Enter fullscreen mode Exit fullscreen mode

The Tech Stack

Here's what I used and why:

Component Choice Why
PDF Loading LangChain PyPDFLoader Handles multi-page extraction with metadata
Text Splitting RecursiveCharacterTextSplitter Preserves sentence boundaries
Embeddings all-MiniLM-L6-v2 90MB, runs on CPU, excellent quality
Vector Store FAISS In-memory, millisecond search, no server needed
LLM Llama 3.1 8B via Groq Free, fast (< 2s), genuinely good quality
Frontend Streamlit Fast to build, easy to deploy
Orchestration LangChain LCEL Clean pipeline composition

Building It — Step by Step

Step 1: Install Dependencies

pip install streamlit langchain langchain-community langchain-core \
  langchain-text-splitters faiss-cpu pypdf sentence-transformers \
  python-dotenv groq torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cpu
Enter fullscreen mode Exit fullscreen mode

Use the CPU-only PyTorch build — it's 10x smaller and perfectly sufficient since we're not training anything.

Step 2: Set Up Your Free Groq API Key

  1. Go to console.groq.com
  2. Sign up with Google — takes 30 seconds
  3. Click API Keys → Create API Key
  4. Create a .env file:
GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxxxxx
Enter fullscreen mode Exit fullscreen mode

Groq's free tier gives you 14,400 requests/day. More than enough.

Step 3: The RAG Pipeline (rag_pipeline.py)

Let's build the core ML logic:

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnableLambda
from groq import Groq
from dotenv import load_dotenv
import tempfile, os

load_dotenv()

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
GROQ_MODEL      = "llama-3.1-8b-instant"
GROQ_API_KEY    = os.getenv("GROQ_API_KEY")
Enter fullscreen mode Exit fullscreen mode

PDF Loading and Chunking:

def load_and_index(uploaded_files):
    all_chunks = []

    for uploaded_file in uploaded_files:
        # Save to temp file so PyPDFLoader can read it
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
            tmp.write(uploaded_file.read())
            tmp_path = tmp.name

        loader    = PyPDFLoader(tmp_path)
        documents = loader.load()
        os.unlink(tmp_path)  # clean up

        # Tag each page with its source filename
        for doc in documents:
            doc.metadata["source_file"] = uploaded_file.name

        # Split into overlapping chunks
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,  # overlap prevents losing info at boundaries
        )
        chunks = splitter.split_documents(documents)
        all_chunks.extend(chunks)

    # Embed and index
    embeddings = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL,
        model_kwargs={"device": "cpu"},
    )
    vectorstore = FAISS.from_documents(all_chunks, embeddings)
    return vectorstore
Enter fullscreen mode Exit fullscreen mode

Why chunk_overlap=50? Imagine a sentence that spans two chunks. Without overlap, the context at the boundary gets lost. With 50-token overlap, both chunks contain the boundary text — retrieval stays accurate.

The LLM Call:

def answer_question(question: str, context: str, history: list) -> str:
    client = Groq(api_key=GROQ_API_KEY)

    messages = [
        {
            "role": "system",
            "content": """You are an intelligent assistant and expert tutor.
Given relevant excerpts from a PDF document and a question:
1. Explain the concept thoroughly in simple, clear language
2. Use the document as your primary source but add helpful context
3. Structure answers with bullet points when helpful
4. Always cite which file and page the information came from
5. End with a key takeaway in one sentence"""
        }
    ]

    # Inject last 6 messages as conversational memory
    for h in history[-6:]:
        messages.append({"role": h["role"], "content": h["content"]})

    messages.append({
        "role": "user",
        "content": f"Context:\n{context}\n\nQuestion: {question}"
    })

    response = client.chat.completions.create(
        model=GROQ_MODEL,
        messages=messages,
        temperature=0.3,
        max_tokens=1024,
    )
    return response.choices[0].message.content.strip()
Enter fullscreen mode Exit fullscreen mode

Building the Chain:

def build_qa_chain(vectorstore):
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

    def run_chain(inputs: dict):
        docs    = retriever.invoke(inputs["question"])
        context = "\n\n".join(
            f"[{doc.metadata.get('source_file')} | Page {doc.metadata.get('page', 0)+1}]\n{doc.page_content}"
            for doc in docs
        )
        answer = answer_question(
            inputs["question"],
            context,
            inputs.get("history", [])
        )
        return answer, docs

    return run_chain, retriever
Enter fullscreen mode Exit fullscreen mode

Step 4: The Streamlit Frontend (app.py)

import streamlit as st
from rag_pipeline import load_and_index, build_qa_chain
from dotenv import load_dotenv

load_dotenv()

st.set_page_config(page_title="PDF Q&A Pro", page_icon="📄", layout="wide")
st.title("📄 PDF Q&A Pro")

# Sidebar: upload
with st.sidebar:
    uploaded_files = st.file_uploader(
        "Upload PDFs", type="pdf", accept_multiple_files=True
    )
    if uploaded_files:
        file_key = "_".join(sorted(f.name for f in uploaded_files))
        if st.session_state.get("file_key") != file_key:
            with st.spinner("Indexing PDFs..."):
                vs = load_and_index(uploaded_files)
                st.session_state.qa_fn, st.session_state.retriever = build_qa_chain(vs)
                st.session_state.file_key = file_key
                st.session_state.messages = []
            st.success("Ready!")

# Chat interface
for msg in st.session_state.get("messages", []):
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"])

if question := st.chat_input("Ask anything about your PDFs..."):
    st.session_state.messages.append({"role": "user", "content": question})
    with st.chat_message("user"):
        st.markdown(question)

    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            answer, sources = st.session_state.qa_fn({
                "question": question,
                "history": st.session_state.messages[:-1],
            })
        st.markdown(answer)

        with st.expander("📚 Sources"):
            for i, doc in enumerate(sources, 1):
                st.markdown(
                    f"**Chunk {i}** — `{doc.metadata.get('source_file')}` "
                    f"| Page {doc.metadata.get('page', 0)+1}"
                )
                st.caption(doc.page_content[:300] + "...")

    st.session_state.messages.append({"role": "assistant", "content": answer})
Enter fullscreen mode Exit fullscreen mode

Step 5: Run It

python -m streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8501, upload a PDF, and ask your first question.


The Bugs I Hit (So You Don't Have To)

This is the part most tutorials skip. Here's every error I encountered and how I fixed it:

1. LangChain Import Errors

LangChain v0.2+ split into multiple packages. If you see ModuleNotFoundError, use these correct imports:

# OLD (broken in v0.2+)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate

# NEW (correct)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
Enter fullscreen mode Exit fullscreen mode

Lesson: Always check which sub-package a LangChain class lives in. The migration guide at python.langchain.com is your friend.

2. Hugging Face Free Tier Model Availability

Not all models work on the free inference API. I discovered this the hard way after getting 400 and 404 errors from flan-t5 and Mistral.

Test your models before building around them:

import requests

models = ["google/flan-t5-large", "facebook/bart-large-cnn", "deepset/roberta-base-squad2"]
token  = "hf_yourtoken"
headers = {"Authorization": f"Bearer {token}"}

for model in models:
    r = requests.post(
        f"https://router.huggingface.co/hf-inference/models/{model}",
        headers=headers,
        json={"inputs": "test"}
    )
    print(f"{model}: {r.status_code}")
Enter fullscreen mode Exit fullscreen mode

Lesson: Test API endpoints programmatically. Don't assume availability from documentation.

3. Extractive vs Generative Models

I initially used deepset/roberta-base-squad2 thinking it would answer questions. It only copies exact spans of text from the document — it can't explain or elaborate.

Extractive (roberta-squad2):
Q: "What is machine learning?"
A: "a subset of artificial intelligence"  ← just copied from PDF

Generative (Llama 3.1):
Q: "What is machine learning?"
A: "Machine learning is a branch of AI that enables computers to learn 
   patterns from data without being explicitly programmed. According to 
   page 3 of your document..."  ← actual explanation
Enter fullscreen mode Exit fullscreen mode

Lesson: For explanation-heavy use cases, you need a generative model. Extractive models are only useful for simple fact extraction.

4. The InferenceClient Version Conflict

The huggingface_hub library's InferenceClient had breaking changes across versions, causing:

'InferenceClient' object has no attribute 'post'
Enter fullscreen mode Exit fullscreen mode

The fix: bypass the wrapper entirely and call the REST API directly.

import requests

def call_hf_api(prompt, token, model_url):
    response = requests.post(
        model_url,
        headers={"Authorization": f"Bearer {token}"},
        json={"inputs": prompt, "options": {"wait_for_model": True}},
        timeout=60
    )
    return response.json()[0]["generated_text"]
Enter fullscreen mode Exit fullscreen mode

Lesson: When library wrappers cause version conflicts, the REST API is always a reliable fallback.

5. Prompt Engineering Makes or Breaks Everything

Early system prompt:

"Use only the context to answer. Do not make up information."
Enter fullscreen mode Exit fullscreen mode

Result: Short, unhelpful answers that just copied document sentences.

Improved system prompt:

"You are an expert tutor. Explain thoroughly in simple language. 
Add examples. Structure with bullets. Cite sources. End with a key takeaway."
Enter fullscreen mode Exit fullscreen mode

Result: Rich, explanatory answers that actually help the user understand.

Lesson: The system prompt is the most impactful variable in your entire pipeline. Spend real time on it.


Key Architecture Decisions

Why FAISS over Chroma/Pinecone?
FAISS runs entirely in-memory with no external server. For a student project or MVP, the simplicity is unbeatable. Chroma is great when you need persistence; Pinecone when you need scale.

Why chunk_size=500?
Too small: chunks lose context. Too large: you hit the LLM's context window and retrieval becomes less precise. 500 tokens with 50-token overlap is a well-tested sweet spot for most documents.

Why k=4 chunks?
Sending 4 × 500 = ~2000 tokens of context to the LLM gives enough information without overwhelming it or blowing the budget.

Why Groq over OpenAI?
Groq is genuinely free (not just a trial), returns responses in under 2 seconds thanks to custom LPU hardware, and Llama 3.1 8B is good enough for document Q&A. For a course project, there's no reason to pay.


What I'd Add Next

If I were to extend this project:

  1. Persistent FAISS index — Save the index to disk so re-uploading the same document doesn't re-embed everything
  2. Streaming responses — Stream LLM tokens to the UI for a ChatGPT-like feel
  3. Hybrid search — Combine FAISS semantic search with BM25 keyword search for better recall
  4. Document comparison — "How does Document A's approach differ from Document B's?"
  5. Export to Anki — Auto-generate flashcards from document content

The Full Feature List

The complete app (link at the bottom) includes:

  • ✅ Multi-PDF chat — cross-document search with source attribution
  • ✅ Per-document chat — isolated history per file
  • ✅ Conversational memory — last 6 turns injected as context
  • ✅ Document dashboard — type, complexity, tone, top topics
  • ✅ Auto-summary and key insights generation
  • ✅ Topic analysis — most discussed themes with frequency
  • ✅ Source highlighting — file name + page number on every answer
  • ✅ Download chat history, summaries, and insights
  • ✅ Token and cost tracking per query

What This Project Taught Me

Building this was genuinely humbling. The ML concepts — RAG, embeddings, vector search — took an afternoon to understand. The debugging took two days.

And that's actually the point. The error messages were where the real learning happened:

  • A 404 from a HuggingFace endpoint taught me to always test APIs before building around them
  • An import error taught me that library versioning is a real engineering concern, not just housekeeping
  • A bad answer from an extractive model taught me the difference between finding text and understanding text
  • A flat, unhelpful LLM response taught me that prompt engineering is a skill worth investing in

If you're learning ML, my advice is simple: pick a problem, build something end-to-end, and let the bugs teach you.


Resources

-GitHub Repository: github.com/naimulkarim/pdf-qa-app
-RAG Paper (Lewis et al., 2020): arxiv.org/abs/2005.11401
-Embedding Model: huggingface.co/sentence-transformers/all-MiniLM-L6-v2
-Groq (Free LLM API): console.groq.com


If this helped you, drop a ❤️ and share it with someone learning ML. And if you build something on top of this — I'd love to see it in the comments.

Top comments (0)