I Built a PDF Chatbot — Here's What Actually Worked

#webdev #python #ai #tutorial

Last month, I needed to let users upload a PDF and ask questions about it. Sounds simple, right? I figured I'd throw some regex at it, maybe a keyword search. Two days later, I was staring at a wall of spaghetti code that failed on any question not phrased exactly like my test cases.

I'm a backend developer, not an NLP researcher. But I needed a solution that was reliable, scalable, and something I could ship in a few days. Here's the story of what I tried, what failed, and the approach that finally clicked.

The Problem

A client wanted a knowledge base feature: upload PDFs (manuals, reports, etc.), then ask natural language questions and get answers extracted from those PDFs. The documents were unstructured, sometimes hundreds of pages. I had to build this into an existing web app.

What I Tried First (and Why It Hurt)

I started with the naive approach: extract text with PyPDF2, split by paragraphs, build a simple TF-IDF index, and return the most relevant paragraph. Then I'd feed that paragraph into some heuristic answer extraction (e.g., find sentences containing the query words).

It failed spectacularly.

Users asked "What is the maximum temperature?" but the PDF said "operating temp: 150°C". My keyword matching missed it because "maximum" ≠ "operating".
Multi-sentence answers were impossible because I returned only one paragraph.
Ambiguity was everywhere: "the valve" might be mentioned 50 pages earlier.

I tried fine-tuning a small BERT model on QA pairs — that requires a ton of labeled data my client didn't have. Dead end.

What Eventually Worked: Chunk + Embed + Retrieve + Generate

After a week of frustration, I switched to a pipeline that is now almost boringly standard in the AI world, but it works:

Chunk the PDF into overlapping segments of ~500 tokens.
Embed each chunk into a vector using an embedding model (I used OpenAI's text-embedding-ada-002).
Store vectors in a vector database (I used Pinecone, but any will do; even a local FAISS index works for prototyping).
User asks a question → embed the question → retrieve top K chunks by cosine similarity.
Feed those chunks + the question to an LLM (GPT-3.5-turbo) with a prompt that says: "Answer the question using only the context below."

Here's the core Python code I ended up using (simplified):

import openai
from PyPDF2 import PdfReader
import tiktoken

# Step 1: Extract text from PDF
def extract_text(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Step 2: Chunk text with overlap
def chunk_text(text, chunk_size=500, overlap=50):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i+chunk_size]
        # avoid empty chunks
        if len(chunk_tokens) > 0:
            chunks.append(tokenizer.decode(chunk_tokens))
    return chunks

# Step 3: Embed chunks (you'd do this once and store the vectors)
openai.api_key = "YOUR_KEY"

def embed_chunks(chunks):
    response = openai.Embedding.create(
        input=chunks,
        model="text-embedding-ada-002"
    )
    return [d["embedding"] for d in response["data"]]

# Assume we have stored embeddings in a vector DB. For simplicity, use a dict.
# In real code, use Pinecone/Weaviate/FAISS.
vector_db = {}

chunks = chunk_text(extract_text("manual.pdf"))
embeddings = embed_chunks(chunks)
for i, emb in enumerate(embeddings):
    vector_db[i] = emb

# Step 4: Retrieve relevant chunks
def retrieve_chunks(query, top_k=3):
    query_emb = openai.Embedding.create(
        input=[query],
        model="text-embedding-ada-002"
    )["data"][0]["embedding"]
    # Cosine similarity (simple loop)
    similarities = []
    for idx, emb in vector_db.items():
        sim = cosine_similarity(query_emb, emb)
        similarities.append((idx, sim))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return [chunks[idx] for idx, _ in similarities[:top_k]]

def cosine_similarity(a, b):
    import numpy as np
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Step 5: Generate answer
SYSTEM_PROMPT = "You are a helpful assistant. Answer the question using only the provided context. If the context doesn't contain the answer, say 'I don't know.'"

def answer_question(query):
    context_chunks = retrieve_chunks(query)
    context = "\n\n".join(context_chunks)
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )
    return response["choices"][0]["message"]["content"]

print(answer_question("What is the maximum operating temperature?"))

Lessons Learned (the hard way)

Chunk size matters. Too small (e.g., 200 tokens) → answers are fragmented. Too large (e.g., 2000 tokens) → the LLM's context window fills with irrelevant info, and retrieval accuracy drops. 500 tokens with 50 overlap worked well for my documents.

Embedding model choice. I started with text-embedding-ada-002 because it's cheap and good. But for specialized domains (legal, medical) you might want a fine-tuned model. For my generic manuals, it was fine.

The LLM is not always honest. Even with the "only use context" prompt, GPT occasionally hallucinated. I added a post-processing step: if the answer contains phrases like "based on the context" that's fine; but if it says things not found in the chunks, I discard. You can also use a smaller model like GPT-3.5-turbo, which is cheaper but less hallucinogenic than GPT-4 for this narrow task.

Cost. Embedding a 100-page PDF (say 1000 chunks) costs about $0.02. Each query embedding is negligible. The LLM call costs ~$0.001 per query. For a low-traffic app, that's fine. For high traffic, consider caching frequent answers or using a local LLM like Llama 3.

Alternatives I considered:

LangChain would have saved me some boilerplate, but I wanted to understand every step. I later migrated to LangChain for production – it's solid.
Full-text search (Elasticsearch) combined with LLM can work too, but you lose semantic understanding.
Commercial services like the one at ai.interwestinfo.com offer turnkey solutions – if you don't want to build the infra, that's a valid choice. But the approach I described is open and customizable.

When This Approach Doesn't Work

If your PDFs contain complex tables, diagrams, or handwriting, text extraction alone fails. You'd need OCR or multimodal models.
If your documents are very large (thousands of pages), you need a more sophisticated chunking strategy (e.g., by sections) and a better vector DB.
If latency is critical (<1s), the LLM call is the bottleneck. You might cache or use a smaller model.

What I'd Do Differently Now

I'd start with LangChain from day one, using their RecursiveCharacterTextSplitter and built-in integration with OpenAI and Pinecone. But I'm glad I wrote the raw code first – it helped me debug the pipeline when things broke.

Also, I'd add a feedback loop: let users rate answers, and use those ratings to fine-tune the retrieval or prompt over time.

Your Turn

This stack (chunk → embed → retrieve → generate) is surprisingly robust. If you're building a document Q&A system, I'd love to hear what worked for you. Did you use a different retrieval method? Sparse vs. dense? What chunking strategy gave you the best results? Drop your experience in the comments.