DEV Community

zhongqiyue
zhongqiyue

Posted on

I Built a PDF Chatbot — Here's What Actually Worked

Last month, I needed to let users upload a PDF and ask questions about it. Sounds simple, right? I figured I'd throw some regex at it, maybe a keyword search. Two days later, I was staring at a wall of spaghetti code that failed on any question not phrased exactly like my test cases.

I'm a backend developer, not an NLP researcher. But I needed a solution that was reliable, scalable, and something I could ship in a few days. Here's the story of what I tried, what failed, and the approach that finally clicked.

The Problem

A client wanted a knowledge base feature: upload PDFs (manuals, reports, etc.), then ask natural language questions and get answers extracted from those PDFs. The documents were unstructured, sometimes hundreds of pages. I had to build this into an existing web app.

What I Tried First (and Why It Hurt)

I started with the naive approach: extract text with PyPDF2, split by paragraphs, build a simple TF-IDF index, and return the most relevant paragraph. Then I'd feed that paragraph into some heuristic answer extraction (e.g., find sentences containing the query words).

It failed spectacularly.

  • Users asked "What is the maximum temperature?" but the PDF said "operating temp: 150°C". My keyword matching missed it because "maximum" ≠ "operating".
  • Multi-sentence answers were impossible because I returned only one paragraph.
  • Ambiguity was everywhere: "the valve" might be mentioned 50 pages earlier.

I tried fine-tuning a small BERT model on QA pairs — that requires a ton of labeled data my client didn't have. Dead end.

What Eventually Worked: Chunk + Embed + Retrieve + Generate

After a week of frustration, I switched to a pipeline that is now almost boringly standard in the AI world, but it works:

  1. Chunk the PDF into overlapping segments of ~500 tokens.
  2. Embed each chunk into a vector using an embedding model (I used OpenAI's text-embedding-ada-002).
  3. Store vectors in a vector database (I used Pinecone, but any will do; even a local FAISS index works for prototyping).
  4. User asks a question → embed the question → retrieve top K chunks by cosine similarity.
  5. Feed those chunks + the question to an LLM (GPT-3.5-turbo) with a prompt that says: "Answer the question using only the context below."

Here's the core Python code I ended up using (simplified):

import openai
from PyPDF2 import PdfReader
import tiktoken

# Step 1: Extract text from PDF
def extract_text(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Step 2: Chunk text with overlap
def chunk_text(text, chunk_size=500, overlap=50):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i+chunk_size]
        # avoid empty chunks
        if len(chunk_tokens) > 0:
            chunks.append(tokenizer.decode(chunk_tokens))
    return chunks

# Step 3: Embed chunks (you'd do this once and store the vectors)
openai.api_key = "YOUR_KEY"

def embed_chunks(chunks):
    response = openai.Embedding.create(
        input=chunks,
        model="text-embedding-ada-002"
    )
    return [d["embedding"] for d in response["data"]]

# Assume we have stored embeddings in a vector DB. For simplicity, use a dict.
# In real code, use Pinecone/Weaviate/FAISS.
vector_db = {}

chunks = chunk_text(extract_text("manual.pdf"))
embeddings = embed_chunks(chunks)
for i, emb in enumerate(embeddings):
    vector_db[i] = emb

# Step 4: Retrieve relevant chunks
def retrieve_chunks(query, top_k=3):
    query_emb = openai.Embedding.create(
        input=[query],
        model="text-embedding-ada-002"
    )["data"][0]["embedding"]
    # Cosine similarity (simple loop)
    similarities = []
    for idx, emb in vector_db.items():
        sim = cosine_similarity(query_emb, emb)
        similarities.append((idx, sim))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return [chunks[idx] for idx, _ in similarities[:top_k]]

def cosine_similarity(a, b):
    import numpy as np
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Step 5: Generate answer
SYSTEM_PROMPT = "You are a helpful assistant. Answer the question using only the provided context. If the context doesn't contain the answer, say 'I don't know.'"

def answer_question(query):
    context_chunks = retrieve_chunks(query)
    context = "\n\n".join(context_chunks)
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )
    return response["choices"][0]["message"]["content"]

print(answer_question("What is the maximum operating temperature?"))
Enter fullscreen mode Exit fullscreen mode

Lessons Learned (the hard way)

Chunk size matters. Too small (e.g., 200 tokens) → answers are fragmented. Too large (e.g., 2000 tokens) → the LLM's context window fills with irrelevant info, and retrieval accuracy drops. 500 tokens with 50 overlap worked well for my documents.

Embedding model choice. I started with text-embedding-ada-002 because it's cheap and good. But for specialized domains (legal, medical) you might want a fine-tuned model. For my generic manuals, it was fine.

The LLM is not always honest. Even with the "only use context" prompt, GPT occasionally hallucinated. I added a post-processing step: if the answer contains phrases like "based on the context" that's fine; but if it says things not found in the chunks, I discard. You can also use a smaller model like GPT-3.5-turbo, which is cheaper but less hallucinogenic than GPT-4 for this narrow task.

Cost. Embedding a 100-page PDF (say 1000 chunks) costs about $0.02. Each query embedding is negligible. The LLM call costs ~$0.001 per query. For a low-traffic app, that's fine. For high traffic, consider caching frequent answers or using a local LLM like Llama 3.

Alternatives I considered:

  • LangChain would have saved me some boilerplate, but I wanted to understand every step. I later migrated to LangChain for production – it's solid.
  • Full-text search (Elasticsearch) combined with LLM can work too, but you lose semantic understanding.
  • Commercial services like the one at ai.interwestinfo.com offer turnkey solutions – if you don't want to build the infra, that's a valid choice. But the approach I described is open and customizable.

When This Approach Doesn't Work

  • If your PDFs contain complex tables, diagrams, or handwriting, text extraction alone fails. You'd need OCR or multimodal models.
  • If your documents are very large (thousands of pages), you need a more sophisticated chunking strategy (e.g., by sections) and a better vector DB.
  • If latency is critical (<1s), the LLM call is the bottleneck. You might cache or use a smaller model.

What I'd Do Differently Now

I'd start with LangChain from day one, using their RecursiveCharacterTextSplitter and built-in integration with OpenAI and Pinecone. But I'm glad I wrote the raw code first – it helped me debug the pipeline when things broke.

Also, I'd add a feedback loop: let users rate answers, and use those ratings to fine-tune the retrieval or prompt over time.

Your Turn

This stack (chunk → embed → retrieve → generate) is surprisingly robust. If you're building a document Q&A system, I'd love to hear what worked for you. Did you use a different retrieval method? Sparse vs. dense? What chunking strategy gave you the best results? Drop your experience in the comments.

Top comments (0)