A few months ago, I needed to build a Q&A bot that could answer questions from a messy pile of internal documentation. Think hundreds of Markdown files, PDFs, and even some old Confluence exports. The goal was simple: let support agents ask natural language questions and get accurate answers with citations.
I thought it would be straightforward. I was wrong.
The naive approach: just dump everything into a prompt
My first attempt was embarrassingly simple: concatenate all the docs into a giant prompt and ask GPT-4 to answer. I mean, it works for small stuff, right?
import openai
with open("all_docs.txt", "r") as f:
context = f.read()
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Answer using the provided context."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: How do I reset a user's password?"}
]
)
print(response.choices[0].message.content)
This worked for exactly one question. Then the token limit hit. My docs were ~50k tokens. GPT-4's context window is 8k (at the time). I tried truncating, but then the answer was missing critical details. Plus, it cost a fortune per query. Dead end.
The second try: keyword search + LLM
Next, I built a simple keyword search with Elasticsearch. Index the docs, retrieve the top 3 chunks, feed them to the LLM. This felt smarter.
from elasticsearch import Elasticsearch
es = Elasticsearch()
# index documents... (omitted for brevity)
def search_and_answer(query):
result = es.search(index="docs", body={
"query": {"match": {"content": query}},
"size": 3
})
chunks = [hit["_source"]["content"] for hit in result["hits"]["hits"]]
context = "\n\n".join(chunks)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Answer using the provided context."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}
]
)
return response.choices[0].message.content
It worked better, but keyword search is dumb. A question like "How do I reset a password?" might not match a document that says "credential recovery process". I was missing a lot of relevant content. Also, the chunks were arbitrary – I split on paragraphs, but sometimes the answer was spread across two chunks. The bot gave incomplete answers.
The approach that finally worked: semantic search with reranking
I needed to understand the meaning of the question, not just the words. That meant embeddings and vector search. I also needed to rank the retrieved chunks better. Here's what I ended up with:
- Chunk documents intelligently – split by sections (headers) rather than fixed token counts.
-
Generate embeddings for each chunk using a model (I used
text-embedding-ada-002). - Store in a vector database – I chose ChromaDB for simplicity, but Pinecone or Qdrant work too.
- Retrieve top-k chunks by cosine similarity.
-
Rerank those chunks using a cross-encoder model (like
cross-encoder/ms-marco-MiniLM-L-6-v2) to get the most relevant ones. - Feed the top 2 reranked chunks to the LLM for answer generation.
Here's the core code:
import chromadb
from sentence_transformers import CrossEncoder
import openai
# Initialize ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")
# Assume documents are already chunked and embedded
# (embedding function is OpenAI's, stored in collection)
def answer_question(question, top_k=10, rerank_top=2):
# Step 1: Embed the question
question_embedding = openai.Embedding.create(
model="text-embedding-ada-002",
input=question
)["data"][0]["embedding"]
# Step 2: Vector search
results = collection.query(
query_embeddings=[question_embedding],
n_results=top_k
)
# Step 3: Rerank
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(question, doc) for doc in results["documents"][0]]
scores = cross_encoder.predict(pairs)
# Sort by score descending
scored_docs = sorted(zip(scores, results["documents"][0]), key=lambda x: x[0], reverse=True)
best_chunks = [doc for _, doc in scored_docs[:rerank_top]]
# Step 4: Generate answer
context = "\n\n".join(best_chunks)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Answer the question using only the provided context. If the answer is not in the context, say 'I don't know'. Cite the source."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
This worked. The reranking step was the game-changer – it filtered out irrelevant chunks that were semantically close but not actually answering the question. I tested with 50 tricky questions and got correct answers 80% of the time, compared to 55% with just vector search.
Trade-offs and limitations
-
Cost: Embedding each chunk once is cheap, but querying OpenAI for every question adds up. I switched to a local embedding model later (like
all-MiniLM-L6-v2) to reduce costs. -
Latency: Reranking adds ~200ms per query. For a chatbot, that's acceptable. If you need real-time, skip reranking and just increase
top_k. -
Chunking strategy: I used section headers, but some docs had no clear structure. For those, I used recursive character splitting (LangChain's
RecursiveCharacterTextSplitter). - When not to use this: If your docs are tiny (like a single FAQ page), just put them in the prompt. If you need high accuracy on every question, consider fine-tuning a model on your specific domain.
I also looked into managed services like Interwest AI (https://ai.interwestinfo.com/) which offers similar functionality out of the box. For a quick prototype, that might save time. But I needed full control over chunking and reranking, so I stuck with the custom pipeline.
What I'd do differently next time
- Use a local embedding model from the start to avoid API dependency.
- Add a feedback loop: when users mark an answer as wrong, log the question and the retrieved chunks to improve chunking or add more documents.
- Experiment with different reranking models – the cross-encoder I used is small but there are larger ones that might be more accurate.
Building a Q&A bot over your own docs is a classic problem with many solutions. The key is to not underestimate the retrieval step. Garbage in, garbage out – even with GPT-4.
What's your setup look like? Are you using vector search, fine-tuning, or something else entirely? I'd love to hear what worked for you.
Top comments (0)