zhongqiyue

Posted on Jun 21

I built a code Q&A bot with RAG – what worked and what failed

#webdev #python #ai #tutorial

A few months ago, our team had a problem: every new developer spent days digging through scattered documentation, old slide decks, and Slack threads just to understand how our microservices talked to each other. I thought, why not build a chatbot that can answer those questions? Something like a mini GPT trained on our internal docs.

Spoiler: I made a lot of mistakes before I got anything useful. Here’s the honest story of building a retrieval-augmented generation (RAG) bot for code documentation, including the dead ends, the working approach, and the trade-offs I wish I'd known earlier.

The problem: context overload

I started simple. I dumped all our Markdown files into a single prompt and asked GPT-4: “Answer questions based on this text.” That worked for exactly one question. Then the token limit hit, the model started hallucinating random service names, and every query cost $0.15. For a team of 20 developers, that would burn through our budget in a week.

I needed a way to retrieve only the relevant pieces of documentation, not the entire encyclopedia.

What I tried that didn’t work

1. Keyword search (TF‑IDF)

I built a basic TF‑IDF index. It kind of worked – if a developer asked exactly “What does the auth service return?” and the docs contained those words, it found the right paragraph. But synonyms, paraphrases, and code-specific terms (like “auth service” vs. “login endpoint”) would miss completely. Short queries failed, and long queries drifted.

2. Naive chunking without overlap

I split documents into 500-character chunks, embedded them with a small model, and used cosine similarity to find the top 3 chunks. The results were random: often the retrieved chunks cut off in the middle of a sentence, or they missed the critical sentence that connected two paragraphs.

3. Just using the biggest model I could find

I tried throwing everything at GPT‑4 with a system prompt like “You are a developer assistant. Answer concisely.” Without proper retrieval, the model either ignored irrelevant context or made things up when it couldn't find the answer. And it was slow – 5‑10 seconds per response.

What finally worked: retrieval-augmented generation (RAG)

The breakthrough came when I stopped treating the LLM as a search engine and started treating it as a reading comprehension engine on top of a dedicated search index.

Here’s the pipeline I settled on:

Chunk documentation into overlapping pieces (~300 tokens, with 50-token overlap).
Embed each chunk into a vector.
Store vectors in a fast similarity search index (I used FAISS at first, then moved to a hosted vector store).
At query time: embed the question, find the top 5 most similar chunks, and feed only those into the LLM prompt.
Generate answer – the LLM reads the chunks and answers (or says “I don’t know”).

I won’t dump the entire codebase here, but here’s the core Python snippet that made the difference:

import openai
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Chunking with overlap
def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start = end - overlap
    return chunks

# 2. Embed chunks (using a free model)
model = SentenceTransformer('all-MiniLM-L6-v2')
all_chunks = []  # list of chunk strings
all_embeddings = model.encode(all_chunks)

# 3. Index with FAISS
dimension = all_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(all_embeddings))

# 4. Retrieve and answer
def answer_query(query, index, chunks, model, llm_client):
    q_emb = model.encode([query])
    distances, indices = index.search(q_emb, 5)
    retrieved = [chunks[i] for i in indices[0]]
    context = "\n---\n".join(retrieved)
    prompt = f"""Use the following documentation to answer the question.
    If you cannot find the answer, say "I don't know."

    Documentation:
    {context}

    Question: {query}
    Answer:"""
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    return response.choices[0].message.content

# Example usage
# openai.api_key = "sk-..."
# print(answer_query("How do I deploy the auth service?", index, all_chunks, model, openai))

That snippet alone reduced hallucination by ~80% and cut per-query cost to under $0.01 (by using gpt-4o-mini instead of full GPT‑4).

The tool that helped me skip boilerplate

I eventually moved from a local FAISS index to a managed solution because our documentation kept growing and I didn't want to re‑index every time someone updated a Confluence page. I tried a few services, including the one at ai.interwestinfo.com (which handled chunking, embedding, and retrieval out‑of‑the‑box). But the technique is what matters – any platform that offers vector search + LLM integration would work similarly.

Lessons learned & trade-offs

Chunk size matters a lot.

150 tokens: too little context, the LLM can't see the full picture.
1000 tokens: too much noise, retrieval quality drops.
I landed on 300 tokens with 50 overlap – it gave the best balance between precision and recall.

Overlap is not optional.
Without overlap, I missed connections between sections. With 15‑20% overlap, the retriever found the right chunk more consistently.

Embedding model choice:

Lightweight models (all-MiniLM-L6-v2) are fast and free but less accurate with code-heavy docs.
Larger models (text‑embedding‑3‑large) increase cost but improve retrieval.
For internal dev docs, the small model was good enough – I only needed 90%+ recall, not perfection.

Latency: the full pipeline took ~2 seconds: 200ms for embedding the query, 10ms for FAISS search, 1.5 seconds for the LLM call. Acceptable for a chat interface.

Cost:

Embedding 10,000 chunks with a free model: $0.
Hosted vector store: $5‑20/month depending on size.
LLM calls (gpt-4o-mini): ~$0.002 per query. For 200 queries/day, that’s $12/month.

What I’d do differently next time

Evaluate retrieval properly. I manually checked 50 queries – that’s not enough. I’d build a small test set with known answers and measure recall@k.
Use hybrid search. Pure vector search can miss exact keyword matches. Next time I’ll combine BM25 + vector similarity (reciprocal rank fusion).
Let users give feedback. I didn’t log “was this answer helpful?” – now I wish I had, so I could fine-tune chunking, overlap, or the prompt.
Consider smaller, fine-tuned models. For a very specific domain (internal APIs), a fine-tuned 7B model might be cheaper and faster than a general-purpose LLM.

Final thoughts

RAG is not magic – it’s a careful engineering puzzle of chunking, embedding, retrieval, and prompt design. The tools (like FAISS, Pinecone, or managed APIs) are just implementation details. What matters is understanding where your pipeline breaks: bad chunks → bad retrieval → bad answers.

We now have a working bot that answers ~80% of onboarding questions correctly. The other 20% are a mix of outdated docs and ambiguous questions. That’s a huge improvement over “ask in #general and wait 30 minutes.”

What’s your approach to building AI assistants for code documentation? Have you hit similar chunking or retrieval issues?

DEV Community