I tried 3 ways to build a Q&A bot over my docs — here's what actually worked

#ai #python #tutorial #webdev

A few months ago, I needed to build a Q&A bot that could answer questions from a messy pile of internal documentation. Think hundreds of Markdown files, PDFs, and even some old Confluence exports. The goal was simple: let support agents ask natural language questions and get accurate answers with citations.

I thought it would be straightforward. I was wrong.

The naive approach: just dump everything into a prompt

My first attempt was embarrassingly simple: concatenate all the docs into a giant prompt and ask GPT-4 to answer. I mean, it works for small stuff, right?

import openai

with open("all_docs.txt", "r") as f:
    context = f.read()

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Answer using the provided context."},
        {"role": "user", "content": f"Context: {context}\n\nQuestion: How do I reset a user's password?"}
    ]
)
print(response.choices[0].message.content)

This worked for exactly one question. Then the token limit hit. My docs were ~50k tokens. GPT-4's context window is 8k (at the time). I tried truncating, but then the answer was missing critical details. Plus, it cost a fortune per query. Dead end.

The second try: keyword search + LLM

Next, I built a simple keyword search with Elasticsearch. Index the docs, retrieve the top 3 chunks, feed them to the LLM. This felt smarter.

from elasticsearch import Elasticsearch

es = Elasticsearch()
# index documents... (omitted for brevity)

def search_and_answer(query):
    result = es.search(index="docs", body={
        "query": {"match": {"content": query}},
        "size": 3
    })
    chunks = [hit["_source"]["content"] for hit in result["hits"]["hits"]]
    context = "\n\n".join(chunks)

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Answer using the provided context."},
            {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}
        ]
    )
    return response.choices[0].message.content

It worked better, but keyword search is dumb. A question like "How do I reset a password?" might not match a document that says "credential recovery process". I was missing a lot of relevant content. Also, the chunks were arbitrary – I split on paragraphs, but sometimes the answer was spread across two chunks. The bot gave incomplete answers.

The approach that finally worked: semantic search with reranking

I needed to understand the meaning of the question, not just the words. That meant embeddings and vector search. I also needed to rank the retrieved chunks better. Here's what I ended up with:

Chunk documents intelligently – split by sections (headers) rather than fixed token counts.
Generate embeddings for each chunk using a model (I used text-embedding-ada-002).
Store in a vector database – I chose ChromaDB for simplicity, but Pinecone or Qdrant work too.
Retrieve top-k chunks by cosine similarity.
Rerank those chunks using a cross-encoder model (like cross-encoder/ms-marco-MiniLM-L-6-v2) to get the most relevant ones.
Feed the top 2 reranked chunks to the LLM for answer generation.

Here's the core code:

import chromadb
from sentence_transformers import CrossEncoder
import openai

# Initialize ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")

# Assume documents are already chunked and embedded
# (embedding function is OpenAI's, stored in collection)

def answer_question(question, top_k=10, rerank_top=2):
    # Step 1: Embed the question
    question_embedding = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=question
    )["data"][0]["embedding"]

    # Step 2: Vector search
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k
    )

    # Step 3: Rerank
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    pairs = [(question, doc) for doc in results["documents"][0]]
    scores = cross_encoder.predict(pairs)

    # Sort by score descending
    scored_docs = sorted(zip(scores, results["documents"][0]), key=lambda x: x[0], reverse=True)
    best_chunks = [doc for _, doc in scored_docs[:rerank_top]]

    # Step 4: Generate answer
    context = "\n\n".join(best_chunks)
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Answer the question using only the provided context. If the answer is not in the context, say 'I don't know'. Cite the source."},
            {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
        ]
    )
    return response.choices[0].message.content

This worked. The reranking step was the game-changer – it filtered out irrelevant chunks that were semantically close but not actually answering the question. I tested with 50 tricky questions and got correct answers 80% of the time, compared to 55% with just vector search.

Trade-offs and limitations

Cost: Embedding each chunk once is cheap, but querying OpenAI for every question adds up. I switched to a local embedding model later (like all-MiniLM-L6-v2) to reduce costs.
Latency: Reranking adds ~200ms per query. For a chatbot, that's acceptable. If you need real-time, skip reranking and just increase top_k.
Chunking strategy: I used section headers, but some docs had no clear structure. For those, I used recursive character splitting (LangChain's RecursiveCharacterTextSplitter).
When not to use this: If your docs are tiny (like a single FAQ page), just put them in the prompt. If you need high accuracy on every question, consider fine-tuning a model on your specific domain.

I also looked into managed services like Interwest AI (https://ai.interwestinfo.com/) which offers similar functionality out of the box. For a quick prototype, that might save time. But I needed full control over chunking and reranking, so I stuck with the custom pipeline.

What I'd do differently next time

Use a local embedding model from the start to avoid API dependency.
Add a feedback loop: when users mark an answer as wrong, log the question and the retrieved chunks to improve chunking or add more documents.
Experiment with different reranking models – the cross-encoder I used is small but there are larger ones that might be more accurate.

Building a Q&A bot over your own docs is a classic problem with many solutions. The key is to not underestimate the retrieval step. Garbage in, garbage out – even with GPT-4.

What's your setup look like? Are you using vector search, fine-tuning, or something else entirely? I'd love to hear what worked for you.