Last month, I needed to let users upload a PDF and ask questions about it. Sounds simple, right? I figured I'd throw some regex at it, maybe a keyword search. Two days later, I was staring at a wall of spaghetti code that failed on any question not phrased exactly like my test cases.
I'm a backend developer, not an NLP researcher. But I needed a solution that was reliable, scalable, and something I could ship in a few days. Here's the story of what I tried, what failed, and the approach that finally clicked.
The Problem
A client wanted a knowledge base feature: upload PDFs (manuals, reports, etc.), then ask natural language questions and get answers extracted from those PDFs. The documents were unstructured, sometimes hundreds of pages. I had to build this into an existing web app.
What I Tried First (and Why It Hurt)
I started with the naive approach: extract text with PyPDF2, split by paragraphs, build a simple TF-IDF index, and return the most relevant paragraph. Then I'd feed that paragraph into some heuristic answer extraction (e.g., find sentences containing the query words).
It failed spectacularly.
- Users asked "What is the maximum temperature?" but the PDF said "operating temp: 150°C". My keyword matching missed it because "maximum" ≠ "operating".
- Multi-sentence answers were impossible because I returned only one paragraph.
- Ambiguity was everywhere: "the valve" might be mentioned 50 pages earlier.
I tried fine-tuning a small BERT model on QA pairs — that requires a ton of labeled data my client didn't have. Dead end.
What Eventually Worked: Chunk + Embed + Retrieve + Generate
After a week of frustration, I switched to a pipeline that is now almost boringly standard in the AI world, but it works:
- Chunk the PDF into overlapping segments of ~500 tokens.
-
Embed each chunk into a vector using an embedding model (I used OpenAI's
text-embedding-ada-002). - Store vectors in a vector database (I used Pinecone, but any will do; even a local FAISS index works for prototyping).
- User asks a question → embed the question → retrieve top K chunks by cosine similarity.
- Feed those chunks + the question to an LLM (GPT-3.5-turbo) with a prompt that says: "Answer the question using only the context below."
Here's the core Python code I ended up using (simplified):
import openai
from PyPDF2 import PdfReader
import tiktoken
# Step 1: Extract text from PDF
def extract_text(pdf_path):
reader = PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
# Step 2: Chunk text with overlap
def chunk_text(text, chunk_size=500, overlap=50):
tokenizer = tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i:i+chunk_size]
# avoid empty chunks
if len(chunk_tokens) > 0:
chunks.append(tokenizer.decode(chunk_tokens))
return chunks
# Step 3: Embed chunks (you'd do this once and store the vectors)
openai.api_key = "YOUR_KEY"
def embed_chunks(chunks):
response = openai.Embedding.create(
input=chunks,
model="text-embedding-ada-002"
)
return [d["embedding"] for d in response["data"]]
# Assume we have stored embeddings in a vector DB. For simplicity, use a dict.
# In real code, use Pinecone/Weaviate/FAISS.
vector_db = {}
chunks = chunk_text(extract_text("manual.pdf"))
embeddings = embed_chunks(chunks)
for i, emb in enumerate(embeddings):
vector_db[i] = emb
# Step 4: Retrieve relevant chunks
def retrieve_chunks(query, top_k=3):
query_emb = openai.Embedding.create(
input=[query],
model="text-embedding-ada-002"
)["data"][0]["embedding"]
# Cosine similarity (simple loop)
similarities = []
for idx, emb in vector_db.items():
sim = cosine_similarity(query_emb, emb)
similarities.append((idx, sim))
similarities.sort(key=lambda x: x[1], reverse=True)
return [chunks[idx] for idx, _ in similarities[:top_k]]
def cosine_similarity(a, b):
import numpy as np
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Step 5: Generate answer
SYSTEM_PROMPT = "You are a helpful assistant. Answer the question using only the provided context. If the context doesn't contain the answer, say 'I don't know.'"
def answer_question(query):
context_chunks = retrieve_chunks(query)
context = "\n\n".join(context_chunks)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
return response["choices"][0]["message"]["content"]
print(answer_question("What is the maximum operating temperature?"))
Lessons Learned (the hard way)
Chunk size matters. Too small (e.g., 200 tokens) → answers are fragmented. Too large (e.g., 2000 tokens) → the LLM's context window fills with irrelevant info, and retrieval accuracy drops. 500 tokens with 50 overlap worked well for my documents.
Embedding model choice. I started with text-embedding-ada-002 because it's cheap and good. But for specialized domains (legal, medical) you might want a fine-tuned model. For my generic manuals, it was fine.
The LLM is not always honest. Even with the "only use context" prompt, GPT occasionally hallucinated. I added a post-processing step: if the answer contains phrases like "based on the context" that's fine; but if it says things not found in the chunks, I discard. You can also use a smaller model like GPT-3.5-turbo, which is cheaper but less hallucinogenic than GPT-4 for this narrow task.
Cost. Embedding a 100-page PDF (say 1000 chunks) costs about $0.02. Each query embedding is negligible. The LLM call costs ~$0.001 per query. For a low-traffic app, that's fine. For high traffic, consider caching frequent answers or using a local LLM like Llama 3.
Alternatives I considered:
- LangChain would have saved me some boilerplate, but I wanted to understand every step. I later migrated to LangChain for production – it's solid.
- Full-text search (Elasticsearch) combined with LLM can work too, but you lose semantic understanding.
- Commercial services like the one at ai.interwestinfo.com offer turnkey solutions – if you don't want to build the infra, that's a valid choice. But the approach I described is open and customizable.
When This Approach Doesn't Work
- If your PDFs contain complex tables, diagrams, or handwriting, text extraction alone fails. You'd need OCR or multimodal models.
- If your documents are very large (thousands of pages), you need a more sophisticated chunking strategy (e.g., by sections) and a better vector DB.
- If latency is critical (<1s), the LLM call is the bottleneck. You might cache or use a smaller model.
What I'd Do Differently Now
I'd start with LangChain from day one, using their RecursiveCharacterTextSplitter and built-in integration with OpenAI and Pinecone. But I'm glad I wrote the raw code first – it helped me debug the pipeline when things broke.
Also, I'd add a feedback loop: let users rate answers, and use those ratings to fine-tune the retrieval or prompt over time.
Your Turn
This stack (chunk → embed → retrieve → generate) is surprisingly robust. If you're building a document Q&A system, I'd love to hear what worked for you. Did you use a different retrieval method? Sparse vs. dense? What chunking strategy gave you the best results? Drop your experience in the comments.
Top comments (0)