A few months ago, our team had a problem: every new developer spent days digging through scattered documentation, old slide decks, and Slack threads just to understand how our microservices talked to each other. I thought, why not build a chatbot that can answer those questions? Something like a mini GPT trained on our internal docs.
Spoiler: I made a lot of mistakes before I got anything useful. Here’s the honest story of building a retrieval-augmented generation (RAG) bot for code documentation, including the dead ends, the working approach, and the trade-offs I wish I'd known earlier.
The problem: context overload
I started simple. I dumped all our Markdown files into a single prompt and asked GPT-4: “Answer questions based on this text.” That worked for exactly one question. Then the token limit hit, the model started hallucinating random service names, and every query cost $0.15. For a team of 20 developers, that would burn through our budget in a week.
I needed a way to retrieve only the relevant pieces of documentation, not the entire encyclopedia.
What I tried that didn’t work
1. Keyword search (TF‑IDF)
I built a basic TF‑IDF index. It kind of worked – if a developer asked exactly “What does the auth service return?” and the docs contained those words, it found the right paragraph. But synonyms, paraphrases, and code-specific terms (like “auth service” vs. “login endpoint”) would miss completely. Short queries failed, and long queries drifted.
2. Naive chunking without overlap
I split documents into 500-character chunks, embedded them with a small model, and used cosine similarity to find the top 3 chunks. The results were random: often the retrieved chunks cut off in the middle of a sentence, or they missed the critical sentence that connected two paragraphs.
3. Just using the biggest model I could find
I tried throwing everything at GPT‑4 with a system prompt like “You are a developer assistant. Answer concisely.” Without proper retrieval, the model either ignored irrelevant context or made things up when it couldn't find the answer. And it was slow – 5‑10 seconds per response.
What finally worked: retrieval-augmented generation (RAG)
The breakthrough came when I stopped treating the LLM as a search engine and started treating it as a reading comprehension engine on top of a dedicated search index.
Here’s the pipeline I settled on:
- Chunk documentation into overlapping pieces (~300 tokens, with 50-token overlap).
- Embed each chunk into a vector.
- Store vectors in a fast similarity search index (I used FAISS at first, then moved to a hosted vector store).
- At query time: embed the question, find the top 5 most similar chunks, and feed only those into the LLM prompt.
- Generate answer – the LLM reads the chunks and answers (or says “I don’t know”).
I won’t dump the entire codebase here, but here’s the core Python snippet that made the difference:
import openai
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# 1. Chunking with overlap
def chunk_text(text, chunk_size=300, overlap=50):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = min(start + chunk_size, len(words))
chunk = ' '.join(words[start:end])
chunks.append(chunk)
start = end - overlap
return chunks
# 2. Embed chunks (using a free model)
model = SentenceTransformer('all-MiniLM-L6-v2')
all_chunks = [] # list of chunk strings
all_embeddings = model.encode(all_chunks)
# 3. Index with FAISS
dimension = all_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(all_embeddings))
# 4. Retrieve and answer
def answer_query(query, index, chunks, model, llm_client):
q_emb = model.encode([query])
distances, indices = index.search(q_emb, 5)
retrieved = [chunks[i] for i in indices[0]]
context = "\n---\n".join(retrieved)
prompt = f"""Use the following documentation to answer the question.
If you cannot find the answer, say "I don't know."
Documentation:
{context}
Question: {query}
Answer:"""
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.2
)
return response.choices[0].message.content
# Example usage
# openai.api_key = "sk-..."
# print(answer_query("How do I deploy the auth service?", index, all_chunks, model, openai))
That snippet alone reduced hallucination by ~80% and cut per-query cost to under $0.01 (by using gpt-4o-mini instead of full GPT‑4).
The tool that helped me skip boilerplate
I eventually moved from a local FAISS index to a managed solution because our documentation kept growing and I didn't want to re‑index every time someone updated a Confluence page. I tried a few services, including the one at ai.interwestinfo.com (which handled chunking, embedding, and retrieval out‑of‑the‑box). But the technique is what matters – any platform that offers vector search + LLM integration would work similarly.
Lessons learned & trade-offs
Chunk size matters a lot.
- 150 tokens: too little context, the LLM can't see the full picture.
- 1000 tokens: too much noise, retrieval quality drops.
- I landed on 300 tokens with 50 overlap – it gave the best balance between precision and recall.
Overlap is not optional.
Without overlap, I missed connections between sections. With 15‑20% overlap, the retriever found the right chunk more consistently.
Embedding model choice:
- Lightweight models (all-MiniLM-L6-v2) are fast and free but less accurate with code-heavy docs.
- Larger models (text‑embedding‑3‑large) increase cost but improve retrieval.
- For internal dev docs, the small model was good enough – I only needed 90%+ recall, not perfection.
Latency: the full pipeline took ~2 seconds: 200ms for embedding the query, 10ms for FAISS search, 1.5 seconds for the LLM call. Acceptable for a chat interface.
Cost:
- Embedding 10,000 chunks with a free model: $0.
- Hosted vector store: $5‑20/month depending on size.
- LLM calls (gpt-4o-mini): ~$0.002 per query. For 200 queries/day, that’s $12/month.
What I’d do differently next time
- Evaluate retrieval properly. I manually checked 50 queries – that’s not enough. I’d build a small test set with known answers and measure recall@k.
- Use hybrid search. Pure vector search can miss exact keyword matches. Next time I’ll combine BM25 + vector similarity (reciprocal rank fusion).
- Let users give feedback. I didn’t log “was this answer helpful?” – now I wish I had, so I could fine-tune chunking, overlap, or the prompt.
- Consider smaller, fine-tuned models. For a very specific domain (internal APIs), a fine-tuned 7B model might be cheaper and faster than a general-purpose LLM.
Final thoughts
RAG is not magic – it’s a careful engineering puzzle of chunking, embedding, retrieval, and prompt design. The tools (like FAISS, Pinecone, or managed APIs) are just implementation details. What matters is understanding where your pipeline breaks: bad chunks → bad retrieval → bad answers.
We now have a working bot that answers ~80% of onboarding questions correctly. The other 20% are a mix of outdated docs and ambiguous questions. That’s a huge improvement over “ask in #general and wait 30 minutes.”
What’s your approach to building AI assistants for code documentation? Have you hit similar chunking or retrieval issues?
Top comments (0)