TL;DR: I built an end-to-end RAG application that lets you upload any PDF and chat with it. It uses FAISS for vector search, sentence-transformers for embeddings, and Llama 3.1 via Groq for free LLM inference. This article walks through the full architecture, every bug I hit, and how to build it yourself.
Why I Built This
I was working through a Machine Learning course final project and kept running into the same problem: I had dozens of research papers and lecture notes as PDFs, and finding specific information meant scrolling through hundreds of pages manually.
So I built a solution. An AI assistant that reads your PDFs, understands them semantically, and answers your questions in plain English — with explanations, not just copied text.
The result is PDF Q&A Pro: a multi-tab Streamlit app powered by a RAG pipeline.
Here's what it looks like in action:
- Upload one or more PDFs
- Ask any question in natural language
- Get a detailed, explained answer with the exact source page cited
- Generate summaries, key insights, and topic analyses with one click
And the best part — it runs entirely on free APIs.
What is RAG and Why Should You Care?
Before we dive into code, let me explain the core concept: Retrieval-Augmented Generation (RAG).
The naive approach to PDF Q&A is to paste the entire document into an LLM. That breaks immediately for anything longer than a few pages — LLMs have context limits, and sending 200 pages is expensive and slow.
RAG solves this elegantly:
Instead of: [Entire PDF] + Question → LLM → Answer
RAG does: Question → Find relevant chunks → [Top 4 chunks] + Question → LLM → Answer
You only send the relevant parts of the document to the LLM. This makes it fast, cheap, and accurate.
The pipeline has two phases:
Indexing Phase (run once per upload):
PDF → Extract Text → Split into Chunks → Embed Chunks → Store in FAISS
Query Phase (run per question):
Question → Embed Question → FAISS Search → Top-K Chunks → LLM → Answer
The Tech Stack
Here's what I used and why:
| Component | Choice | Why |
|---|---|---|
| PDF Loading | LangChain PyPDFLoader | Handles multi-page extraction with metadata |
| Text Splitting | RecursiveCharacterTextSplitter | Preserves sentence boundaries |
| Embeddings | all-MiniLM-L6-v2 | 90MB, runs on CPU, excellent quality |
| Vector Store | FAISS | In-memory, millisecond search, no server needed |
| LLM | Llama 3.1 8B via Groq | Free, fast (< 2s), genuinely good quality |
| Frontend | Streamlit | Fast to build, easy to deploy |
| Orchestration | LangChain LCEL | Clean pipeline composition |
Building It — Step by Step
Step 1: Install Dependencies
pip install streamlit langchain langchain-community langchain-core \
langchain-text-splitters faiss-cpu pypdf sentence-transformers \
python-dotenv groq torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cpu
Use the CPU-only PyTorch build — it's 10x smaller and perfectly sufficient since we're not training anything.
Step 2: Set Up Your Free Groq API Key
- Go to console.groq.com
- Sign up with Google — takes 30 seconds
- Click API Keys → Create API Key
- Create a
.envfile:
GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxxxxx
Groq's free tier gives you 14,400 requests/day. More than enough.
Step 3: The RAG Pipeline (rag_pipeline.py)
Let's build the core ML logic:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnableLambda
from groq import Groq
from dotenv import load_dotenv
import tempfile, os
load_dotenv()
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
GROQ_MODEL = "llama-3.1-8b-instant"
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
PDF Loading and Chunking:
def load_and_index(uploaded_files):
all_chunks = []
for uploaded_file in uploaded_files:
# Save to temp file so PyPDFLoader can read it
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
tmp.write(uploaded_file.read())
tmp_path = tmp.name
loader = PyPDFLoader(tmp_path)
documents = loader.load()
os.unlink(tmp_path) # clean up
# Tag each page with its source filename
for doc in documents:
doc.metadata["source_file"] = uploaded_file.name
# Split into overlapping chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50, # overlap prevents losing info at boundaries
)
chunks = splitter.split_documents(documents)
all_chunks.extend(chunks)
# Embed and index
embeddings = HuggingFaceEmbeddings(
model_name=EMBEDDING_MODEL,
model_kwargs={"device": "cpu"},
)
vectorstore = FAISS.from_documents(all_chunks, embeddings)
return vectorstore
Why chunk_overlap=50? Imagine a sentence that spans two chunks. Without overlap, the context at the boundary gets lost. With 50-token overlap, both chunks contain the boundary text — retrieval stays accurate.
The LLM Call:
def answer_question(question: str, context: str, history: list) -> str:
client = Groq(api_key=GROQ_API_KEY)
messages = [
{
"role": "system",
"content": """You are an intelligent assistant and expert tutor.
Given relevant excerpts from a PDF document and a question:
1. Explain the concept thoroughly in simple, clear language
2. Use the document as your primary source but add helpful context
3. Structure answers with bullet points when helpful
4. Always cite which file and page the information came from
5. End with a key takeaway in one sentence"""
}
]
# Inject last 6 messages as conversational memory
for h in history[-6:]:
messages.append({"role": h["role"], "content": h["content"]})
messages.append({
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
})
response = client.chat.completions.create(
model=GROQ_MODEL,
messages=messages,
temperature=0.3,
max_tokens=1024,
)
return response.choices[0].message.content.strip()
Building the Chain:
def build_qa_chain(vectorstore):
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
def run_chain(inputs: dict):
docs = retriever.invoke(inputs["question"])
context = "\n\n".join(
f"[{doc.metadata.get('source_file')} | Page {doc.metadata.get('page', 0)+1}]\n{doc.page_content}"
for doc in docs
)
answer = answer_question(
inputs["question"],
context,
inputs.get("history", [])
)
return answer, docs
return run_chain, retriever
Step 4: The Streamlit Frontend (app.py)
import streamlit as st
from rag_pipeline import load_and_index, build_qa_chain
from dotenv import load_dotenv
load_dotenv()
st.set_page_config(page_title="PDF Q&A Pro", page_icon="📄", layout="wide")
st.title("📄 PDF Q&A Pro")
# Sidebar: upload
with st.sidebar:
uploaded_files = st.file_uploader(
"Upload PDFs", type="pdf", accept_multiple_files=True
)
if uploaded_files:
file_key = "_".join(sorted(f.name for f in uploaded_files))
if st.session_state.get("file_key") != file_key:
with st.spinner("Indexing PDFs..."):
vs = load_and_index(uploaded_files)
st.session_state.qa_fn, st.session_state.retriever = build_qa_chain(vs)
st.session_state.file_key = file_key
st.session_state.messages = []
st.success("Ready!")
# Chat interface
for msg in st.session_state.get("messages", []):
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if question := st.chat_input("Ask anything about your PDFs..."):
st.session_state.messages.append({"role": "user", "content": question})
with st.chat_message("user"):
st.markdown(question)
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
answer, sources = st.session_state.qa_fn({
"question": question,
"history": st.session_state.messages[:-1],
})
st.markdown(answer)
with st.expander("📚 Sources"):
for i, doc in enumerate(sources, 1):
st.markdown(
f"**Chunk {i}** — `{doc.metadata.get('source_file')}` "
f"| Page {doc.metadata.get('page', 0)+1}"
)
st.caption(doc.page_content[:300] + "...")
st.session_state.messages.append({"role": "assistant", "content": answer})
Step 5: Run It
python -m streamlit run app.py
Open http://localhost:8501, upload a PDF, and ask your first question.
The Bugs I Hit (So You Don't Have To)
This is the part most tutorials skip. Here's every error I encountered and how I fixed it:
1. LangChain Import Errors
LangChain v0.2+ split into multiple packages. If you see ModuleNotFoundError, use these correct imports:
# OLD (broken in v0.2+)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
# NEW (correct)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
Lesson: Always check which sub-package a LangChain class lives in. The migration guide at python.langchain.com is your friend.
2. Hugging Face Free Tier Model Availability
Not all models work on the free inference API. I discovered this the hard way after getting 400 and 404 errors from flan-t5 and Mistral.
Test your models before building around them:
import requests
models = ["google/flan-t5-large", "facebook/bart-large-cnn", "deepset/roberta-base-squad2"]
token = "hf_yourtoken"
headers = {"Authorization": f"Bearer {token}"}
for model in models:
r = requests.post(
f"https://router.huggingface.co/hf-inference/models/{model}",
headers=headers,
json={"inputs": "test"}
)
print(f"{model}: {r.status_code}")
Lesson: Test API endpoints programmatically. Don't assume availability from documentation.
3. Extractive vs Generative Models
I initially used deepset/roberta-base-squad2 thinking it would answer questions. It only copies exact spans of text from the document — it can't explain or elaborate.
Extractive (roberta-squad2):
Q: "What is machine learning?"
A: "a subset of artificial intelligence" ← just copied from PDF
Generative (Llama 3.1):
Q: "What is machine learning?"
A: "Machine learning is a branch of AI that enables computers to learn
patterns from data without being explicitly programmed. According to
page 3 of your document..." ← actual explanation
Lesson: For explanation-heavy use cases, you need a generative model. Extractive models are only useful for simple fact extraction.
4. The InferenceClient Version Conflict
The huggingface_hub library's InferenceClient had breaking changes across versions, causing:
'InferenceClient' object has no attribute 'post'
The fix: bypass the wrapper entirely and call the REST API directly.
import requests
def call_hf_api(prompt, token, model_url):
response = requests.post(
model_url,
headers={"Authorization": f"Bearer {token}"},
json={"inputs": prompt, "options": {"wait_for_model": True}},
timeout=60
)
return response.json()[0]["generated_text"]
Lesson: When library wrappers cause version conflicts, the REST API is always a reliable fallback.
5. Prompt Engineering Makes or Breaks Everything
Early system prompt:
"Use only the context to answer. Do not make up information."
Result: Short, unhelpful answers that just copied document sentences.
Improved system prompt:
"You are an expert tutor. Explain thoroughly in simple language.
Add examples. Structure with bullets. Cite sources. End with a key takeaway."
Result: Rich, explanatory answers that actually help the user understand.
Lesson: The system prompt is the most impactful variable in your entire pipeline. Spend real time on it.
Key Architecture Decisions
Why FAISS over Chroma/Pinecone?
FAISS runs entirely in-memory with no external server. For a student project or MVP, the simplicity is unbeatable. Chroma is great when you need persistence; Pinecone when you need scale.
Why chunk_size=500?
Too small: chunks lose context. Too large: you hit the LLM's context window and retrieval becomes less precise. 500 tokens with 50-token overlap is a well-tested sweet spot for most documents.
Why k=4 chunks?
Sending 4 × 500 = ~2000 tokens of context to the LLM gives enough information without overwhelming it or blowing the budget.
Why Groq over OpenAI?
Groq is genuinely free (not just a trial), returns responses in under 2 seconds thanks to custom LPU hardware, and Llama 3.1 8B is good enough for document Q&A. For a course project, there's no reason to pay.
What I'd Add Next
If I were to extend this project:
- Persistent FAISS index — Save the index to disk so re-uploading the same document doesn't re-embed everything
- Streaming responses — Stream LLM tokens to the UI for a ChatGPT-like feel
- Hybrid search — Combine FAISS semantic search with BM25 keyword search for better recall
- Document comparison — "How does Document A's approach differ from Document B's?"
- Export to Anki — Auto-generate flashcards from document content
The Full Feature List
The complete app (link at the bottom) includes:
- ✅ Multi-PDF chat — cross-document search with source attribution
- ✅ Per-document chat — isolated history per file
- ✅ Conversational memory — last 6 turns injected as context
- ✅ Document dashboard — type, complexity, tone, top topics
- ✅ Auto-summary and key insights generation
- ✅ Topic analysis — most discussed themes with frequency
- ✅ Source highlighting — file name + page number on every answer
- ✅ Download chat history, summaries, and insights
- ✅ Token and cost tracking per query
What This Project Taught Me
Building this was genuinely humbling. The ML concepts — RAG, embeddings, vector search — took an afternoon to understand. The debugging took two days.
And that's actually the point. The error messages were where the real learning happened:
- A 404 from a HuggingFace endpoint taught me to always test APIs before building around them
- An import error taught me that library versioning is a real engineering concern, not just housekeeping
- A bad answer from an extractive model taught me the difference between finding text and understanding text
- A flat, unhelpful LLM response taught me that prompt engineering is a skill worth investing in
If you're learning ML, my advice is simple: pick a problem, build something end-to-end, and let the bugs teach you.
Resources
-GitHub Repository: github.com/naimulkarim/pdf-qa-app
-RAG Paper (Lewis et al., 2020): arxiv.org/abs/2005.11401
-Embedding Model: huggingface.co/sentence-transformers/all-MiniLM-L6-v2
-Groq (Free LLM API): console.groq.com
- FAISS: github.com/facebookresearch/faiss
- LangChain Docs: python.langchain.com
If this helped you, drop a ❤️ and share it with someone learning ML. And if you build something on top of this — I'd love to see it in the comments.
Top comments (0)