Building a RAG-Based PDF Question Answering System: Engineering Decisions, Failures, and Lessons

Sanjana medidi — Thu, 25 Jun 2026 09:23:41 +0000

A technical deep-dive into StudyMate AI — a Retrieval-Augmented Generation system built with LangChain, FAISS, HuggingFace, and Groq. *

As an AI/ML student preparing applications for research internships at companies like Google, I wanted to build something that went beyond the typical classifier or fine-tuning demo. I wanted a project that demonstrated systems thinking — not just model calling. The result was StudyMate AI: a RAG pipeline that lets you upload any PDF and ask questions about it, grounded strictly in the document's content.

This post documents the real engineering decisions I made, the problems I ran into, and what I learned — including the parts that didn't work the first time.

What is RAG and Why Use It?

Retrieval-Augmented Generation (RAG) is a pattern where instead of asking an LLM to answer from memory, you first retrieve relevant context from a knowledge source and inject it into the prompt. This means:

The model only answers from what's in your document
You get source attribution
Hallucinations are dramatically reduced

The alternative — fine-tuning an LLM on your documents — is expensive, slow, and overkill for a single-document use case. RAG was the right architectural choice here.

System Architecture

PDF
 ↓
PyPDFLoader
 ↓
RecursiveCharacterTextSplitter  (chunk_size=800, overlap=100)
 ↓
HuggingFace Embeddings (all-MiniLM-L6-v2)  — runs locally
 ↓
FAISS Vector Store  (in-memory)
 ↓
Custom Two-Stage Retriever
(first_page_chunks + summary chunks + content chunks + broad search)
 ↓
Groq LLM  (llama-3.1-8b-instant)
 ↓
Answer

Engineering Decision 1: HuggingFace Embeddings over OpenAI

The first decision was how to embed the document chunks. OpenAI's embedding API is the popular choice, but it's pay-per-token — during development and testing, that cost accumulates quickly.

HuggingFace's all-MiniLM-L6-v2 runs locally on your machine, costs nothing, and requires no API key. For a single-user, single-PDF system, the performance tradeoff is negligible. This is the kind of decision that matters at scale — choosing the right tool for the actual constraints, not the most popular one.

Engineering Decision 2: FAISS over Hosted Vector Databases

Pinecone and Weaviate are the production choices for vector storage. They offer persistence across sessions, horizontal scaling, and multi-user support. None of that was needed here.

FAISS runs in-memory with zero setup cost. For one user processing one PDF at a time, it's the correct tradeoff. The rule I applied: use the simplest thing that satisfies your actual constraints. Reach for hosted infrastructure when you need persistence, concurrency, or datasets too large for memory — not before.

Engineering Decision 3: Migrating from `RetrievalQA` to LCEL

During development I noticed LangChain had deprecated RetrievalQA. Rather than ignore the warning and ship deprecated code, I migrated to the current LCEL (LangChain Expression Language) chain composition pattern.

The old approach was a black box. The new approach is explicit:

retrieval_chain = (
    RunnablePassthrough.assign(
        context=RunnableLambda(retrieve_with_summary) | format_docs
    )
    | prompt
    | llm
)

Every step is visible — retrieval, formatting, prompting, generation. This matters for debugging and for understanding what the system is actually doing.

The Problem Vanilla RAG Can't Solve

This is where it got interesting.

After building the basic pipeline, I tested it with: "What is the main purpose of this document?"

The response: "I cannot find the answer in the provided documents."

But the document's purpose was clearly stated in the abstract. What went wrong?

Similarity search surfaces locally similar chunks — chunks whose text is semantically close to the query. A query about "purpose" doesn't semantically match individual chunks about methodology or findings, even though the answer exists in the document.

Vanilla RAG is optimized for specific factual questions. Document-level questions — purpose, thesis, overview — require a global view of the document that chunk-level retrieval can't provide.

The Fix: Pre-Generated Summaries + First-Page Pinning

I solved this with two additions:

1. Pre-generated summary chunks

At build time, before any user query, I generate 5 targeted summaries from the first 2,500 characters of the document and store them as special chunks in the vector store:

summaries_to_create = {
    "research_question": "What is the exact research question?",
    "methodology":       "Describe the methodology in one sentence.",
    "findings":          "What are the main findings in one sentence?",
    "conclusions":       "What are the conclusions in one sentence?",
    "limitations":       "What are the limitations in one sentence?"
}

These give the retriever a global view of the document that similarity search alone can't provide.

2. First-page pinning

Pages 0 and 1 of any academic document almost always contain the abstract and introduction — where purpose and topic live. I pin these as always-included context regardless of the query:

def retrieve_with_summary(inputs):
    query = inputs["input"] if isinstance(inputs, dict) else inputs

    summary_results    = vector_store.similarity_search(query, k=2, filter={"chunk_type": "summary"})
    content_results    = vector_store.similarity_search(query, k=3, filter={"chunk_type": "content"})
    broad_results      = vector_store.similarity_search(query, k=2)
    first_page_chunks  = [c for c in chunks if c.metadata.get("page", 99) in (0, 1)]

    seen, all_docs = set(), []
    for doc in first_page_chunks + summary_results + content_results + broad_results:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            all_docs.append(doc)
    return all_docs

After this fix, document-level questions worked correctly.

Hitting Groq's Rate Limit — and Designing Around It

Groq's free tier allows 6,000 tokens per minute (TPM). My initial implementation used ThreadPoolExecutor with multiple workers to generate summaries in parallel. The result: all 5 API calls fired within milliseconds of each other, consuming ~5,000 tokens in one second and triggering a 429 error immediately.

Rate limit reached: Limit 6000, Used 5881, Requested 3378.
Please try again in 32.59s.

This is a real distributed systems constraint — and solving it required thinking about the problem like a systems engineer, not just a model user.

Solution:

max_workers=1 — sequential generation eliminates the burst
time.sleep(35) between calls — 35s gives a safe buffer above the 32.59s reset window
Input capped at [:2500] characters — keeps each prompt to ~150 tokens, so 5 summaries stay well within the TPM limit

The tradeoff is ~3 minutes of startup time on the free tier. On Groq's Dev tier (30,000 TPM), the sleep can be removed entirely and workers restored — startup drops to under 10 seconds.

Hallucination Guardrail

The system prompt strictly instructs the model to refuse answering if the context doesn't support it:

Strict Rules:
1. Rely ONLY on the clear facts directly mentioned in the context.
2. Do NOT assume, extrapolate, or bring in outside knowledge.
3. If the context does not contain the answer, reply exactly:
   "I cannot find the answer in the provided documents."

Test result with an out-of-scope question:

Q: What is the capital of France?

I cannot find the answer in the provided documents.

Source 1 — The EUROCALL Review, Volume 25, No. 2, September 2017...
Source 2 — ...research question, description of participants...
Source 3 — ...referred their students to electronic or online resources...

The system correctly refuses rather than hallucinating, and returns the chunks it did find — making the reasoning transparent.

What I Learned

Vanilla RAG is not enough for document-level questions. Chunk similarity search is optimized for factual, specific queries. Any question requiring a global view of the document — purpose, thesis, summary — needs a separate strategy: pre-summarization, large-k retrieval, or a dedicated summary index.

Rate limits are a systems design problem. The solution isn't just adding a sleep — it's understanding the constraint (TPM budget), calculating the safe parameters (tokens per call × calls per minute), and designing the pipeline around them.

Read the deprecation warnings. RetrievalQA and langchain-community both flagged deprecation during development. Ignoring them is technical debt. Evaluating them — deciding when to migrate and when to defer — is engineering judgment.

Stack

Component	Choice
Frontend	Streamlit
LLM	Groq (llama-3.1-8b-instant)
Embeddings	HuggingFace all-MiniLM-L6-v2
Vector Store	FAISS
Chain	LangChain LCEL

DEV Community: Sanjana medidi

Building a RAG-Based PDF Question Answering System: Engineering Decisions, Failures, and Lessons

What is RAG and Why Use It?

System Architecture

Engineering Decision 1: HuggingFace Embeddings over OpenAI

Engineering Decision 2: FAISS over Hosted Vector Databases

Engineering Decision 3: Migrating from `RetrievalQA` to LCEL

The Problem Vanilla RAG Can't Solve

The Fix: Pre-Generated Summaries + First-Page Pinning

Hitting Groq's Rate Limit — and Designing Around It

Hallucination Guardrail

What I Learned

Stack

Links

DEV Community: Sanjana medidi

Building a RAG-Based PDF Question Answering System: Engineering Decisions, Failures, and Lessons

What is RAG and Why Use It?

System Architecture

Engineering Decision 1: HuggingFace Embeddings over OpenAI

Engineering Decision 2: FAISS over Hosted Vector Databases

Engineering Decision 3: Migrating from RetrievalQA to LCEL

The Problem Vanilla RAG Can't Solve

The Fix: Pre-Generated Summaries + First-Page Pinning

Hitting Groq's Rate Limit — and Designing Around It

Hallucination Guardrail

What I Learned

Stack

Links

Engineering Decision 3: Migrating from `RetrievalQA` to LCEL