A few months ago, I found myself drowning in documentation. My team had accumulated hundreds of markdown files—API references, internal guides, architecture decisions—spread across a monorepo. Every time someone asked "How do we deploy to staging?" or "What's the rate limit for that endpoint?", I'd either dig through files or ping the person who wrote it (who usually forgot).
I wanted a bot I could ask questions in natural language and get accurate answers instantly. Here's the journey—including the dead ends, the "aha" moments, and the trade-offs I still live with.
The Problem (My Problem)
Our docs were well-organized, but searching was painful. grep -r found keywords but not context. "What's the max request size?" would return a file with "max_request_size: 10MB" but also a dozen irrelevant hits. I needed semantic understanding.
What I Tried That Didn't Work
Attempt 1: Naive Keyword Search
I wrote a simple script that tokenized the query and ranked files by term frequency. It was fast but dumb. "How do I reset my password?" returned nothing about "password reset" because the docs used "change password". Synonym handling was a rabbit hole I didn't want to go down.
Attempt 2: Fine-Tuning a Small Model
I thought, "Let's fine-tune a BERT model on our docs to answer questions." I spent a weekend collecting Q&A pairs from Slack history. The model learned to parrot common answers but failed on anything slightly novel. Plus, every doc update meant retraining. Not sustainable.
Attempt 3: Prompting GPT-3 with Full Docs
I tried stuffing the entire doc set into a prompt. Token limits killed me. Even with truncation, the model hallucinated answers from irrelevant sections. Cost was also a concern—each query cost pennies, but at scale it adds up.
What Eventually Worked: Retrieval-Augmented Generation (RAG)
RAG is the sweet spot: you retrieve the most relevant chunks of your docs, then feed only those into an LLM to generate the answer. No fine-tuning, no context limits, and the model stays grounded in your actual content.
Here's the high-level flow:
- Chunk your documents into pieces (e.g., 500 tokens each).
-
Embed each chunk into a vector using an embedding model (e.g., OpenAI
text-embedding-3-small). - Store vectors in a vector database (I used ChromaDB for simplicity).
- Query: embed the user's question, find the top-k similar chunks, then send those chunks + question to an LLM for the final answer.
The Code
I'll show a minimal working version in Python. You'll need openai, chromadb, and tiktoken.
import os
import openai
import chromadb
from chromadb.utils import embedding_functions
import tiktoken
# Setup
openai.api_key = os.getenv("OPENAI_API_KEY")
client = chromadb.Client()
collection = client.create_collection(
name="docs",
embedding_function=embedding_functions.OpenAIEmbeddingFunction(
api_key=openai.api_key,
model_name="text-embedding-3-small"
)
)
# Chunking function (simple by token count)
def chunk_text(text, max_tokens=500):
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i+max_tokens]
chunks.append(enc.decode(chunk_tokens))
return chunks
# Load documents (assuming each file is one doc)
docs_dir = "./my-docs"
all_chunks = []
for filename in os.listdir(docs_dir):
with open(os.path.join(docs_dir, filename), "r") as f:
text = f.read()
chunks = chunk_text(text)
for j, chunk in enumerate(chunks):
all_chunks.append({
"id": f"{filename}_{j}",
"text": chunk,
"metadata": {"source": filename}
})
# Add to ChromaDB
collection.add(
ids=[c["id"] for c in all_chunks],
documents=[c["text"] for c in all_chunks],
metadatas=[c["metadata"] for c in all_chunks]
)
# Query function
def ask(question, top_k=3):
# Retrieve relevant chunks
results = collection.query(
query_texts=[question],
n_results=top_k
)
context = "\n\n".join(results["documents"][0])
# Generate answer
response = openai.chat.completions.create(
model="gpt-4o-mini", # cheaper, fast enough
messages=[
{"role": "system", "content": "Answer the question based on the context provided. If you cannot answer, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
# Example
print(ask("What is the max request size?"))
That's it. Run it, and you have a Q&A bot over your docs.
Lessons Learned & Trade-offs
Chunking Strategy Matters
I started with fixed token windows. That sometimes split sentences awkwardly. Better: use paragraph or section boundaries. Libraries like langchain have recursive character text splitters. I wish I'd used that from day one.
Embedding Model Choice
OpenAI's embeddings are great but cost money. Free alternatives like all-MiniLM-L6-v2 (from sentence-transformers) work well for small doc sets. I benchmarked both; the difference was negligible for my use case.
Retrieval Quality
Top-k is a knob. Too few chunks and you miss context; too many and you confuse the LLM. I settled on k=3 for my docs. Also, consider re-ranking retrieved chunks with a cross-encoder for better precision.
Latency & Cost
Each query involves an embedding call + LLM call. Total latency ~2-3 seconds. Cost is ~$0.001 per query with gpt-4o-mini. For internal tools, that's fine. For public-facing, you'd want caching and maybe a cheaper model.
When NOT to Use This
- If your docs are tiny (under 50 pages), a simple keyword search + LLM prompt might be enough.
- If you need real-time updates, vector DB re-indexing can be slow. Consider a hybrid approach.
- If your questions require multi-hop reasoning (e.g., "What's the max request size for the endpoint that returns user data?"), RAG alone struggles. You'd need graph-based retrieval or iterative reasoning.
What I'd Do Differently
- Evaluate retrieval early. I assumed my chunking was fine until I saw bad answers. I should have built a small test set of questions and manually checked which chunks were retrieved.
- Use a managed vector DB from the start. ChromaDB is great for prototyping, but for production I moved to Pinecone. If you want a hosted solution with built-in embedding, services like ai.interwestinfo.com offer similar capabilities—though I haven't personally used them.
- Add logging. Without tracking queries and answers, I couldn't improve the system. Now I log every Q&A pair and periodically review failures.
Final Thoughts
Building this bot saved my team hours every week. The RAG pattern is powerful because it combines the flexibility of LLMs with the reliability of your own data. It's not perfect—hallucinations still happen, and chunking is more art than science—but it's the best practical solution I've found.
What's your approach to building document Q&A systems? Any chunking tricks I should try?
Top comments (0)