A few months ago, my team hit a wall: our internal documentation had grown into a chaotic jungle of Confluence pages, Google Docs, and Slack threads. New hires took weeks just to find basic answers. I thought, "Let's build an AI chatbot." Simple, right?
Spoiler: It took two months of trial and error. But I learned a ton about retrieval-augmented generation (RAG) - and what actually makes it work in production.
The Problem
We needed a system where a user could ask "How do I reset my VPN password?" and get a concise answer with a citation. Not a summary of our entire policy, not a hallucinated step - just the exact procedure.
What I Tried First
The Obvious Route: OpenAI Embeddings + Pinecone
I grabbed every doc, chunked them into 500-character pieces with 50-character overlap, generated embeddings with text-embedding-ada-002, and loaded them into a Pinecone index. Then I used GPT-4 to answer based on retrieved chunks.
It worked - on the first query. Then the bill came. We had 50,000 pages of docs. Embedding generation alone cost hundreds of dollars. And GPT-4 usage? Let’s just say the CTO started asking questions.
The Open-Source Detour: LlamaIndex + Local LLMs
I switched to LlamaIndex with sentence-transformers/all-MiniLM-L6-v2 for embeddings and a local 7B model via Ollama. No API costs! But accuracy dropped drastically. The local model couldn't follow complex instructions, and the embeddings failed to capture domain-specific jargon (like "VPN PAM" or "SAML SSO").
The Key Insight: Chunking and Hybrid Search
After banging my head against recall and latency, I realized the answer wasn't a better model - it was better retrieval.
Chunking Strategy
Fixed chunk sizes don't work for technical docs. A 200-word chunk might cut a table in half. I switched to semantic chunking using spaCy sentence boundaries, then merged sentences until they exceeded 250 tokens. This preserved context.
import spacy
from langchain.text_splitter import RecursiveCharacterTextSplitter
nlp = spacy.load("en_core_web_sm")
def semantic_chunker(text, max_tokens=250):
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
# merge sentences into chunks of ~max_tokens
chunk = []
token_count = 0
for sent in sentences:
tokens = len(sent.split())
if token_count + tokens > max_tokens and chunk:
yield " ".join(chunk)
chunk = [sent]
token_count = tokens
else:
chunk.append(sent)
token_count += tokens
if chunk:
yield " ".join(chunk)
Hybrid Search: Vector + Keyword
Vector search is great for synonyms and concepts, but terrible for exact matches like "VPN password reset". BM25 catches exact keywords but misses semantic similarity. Together, they're gold.
I built a simple hybrid retriever:
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
def __init__(self, chunks, embed_model_name="all-MiniLM-L6-v2"):
self.chunks = chunks
self.embedder = SentenceTransformer(embed_model_name)
# Vector index
self.embeddings = self.embedder.encode(chunks, show_progress_bar=True)
# BM25 index
tokenized = [chunk.split() for chunk in chunks]
self.bm25 = BM25Okapi(tokenized)
def retrieve(self, query, top_k=5):
# Vector scores
q_emb = self.embedder.encode([query])
vec_scores = np.dot(self.embeddings, q_emb.T).flatten()
# BM25 scores
bm25_scores = self.bm25.get_scores(query.split())
# Normalize and combine (equal weight)
vec_scores = (vec_scores - vec_scores.min()) / (vec_scores.max() - vec_scores.min() + 1e-8)
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
combined = 0.5 * vec_scores + 0.5 * bm25_scores
top_indices = np.argsort(combined)[-top_k:][::-1]
return [self.chunks[i] for i in top_indices]
Then I added a reranking step using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2). This scored the retrieved chunks against the query and reordered them. It added ~200ms but doubled the answer quality.
The Breakthrough: It Doesn't Need to Be Perfect
I spent weeks tuning embedding models and chunk sizes. What finally made the system usable was a simple fallback: if the top-retrieved chunk had a confidence below a threshold, the bot replied "I couldn't find that in the docs. Try rephrasing or check the #it-help channel." Users preferred honest "I don't know" over confident wrong answers.
Lessons Learned
- Chunking is more important than the embedding model. Semantic chunking + overlap of ~2 sentences gave the best results.
- Hybrid search beats pure vector for internal docs. Code snippets, error messages, and product names are exact-match queries.
- Reranking is cheap and high-impact. A lightweight cross-encoder improves recall significantly.
- Don't use GPT-4 for every query. For simple factual answers, a smaller model (like GPT-3.5 or a local 7B) works fine. Only escalate to bigger models when the answer isn't clear.
What I'd Do Differently
I should have started with a hybrid retriever from day one instead of chasing the "best" embedding model. Also, I wish I had set up a simple evaluation harness early - asking domain experts to label 100 queries and their ground-truth chunks. That would have saved weeks of guesswork.
If you're building something similar, don't get seduced by the latest AI models. Your bottleneck is almost certainly retrieval, not generation.
For a production-ready version, I later simplified things by using a managed retrieval service (like the one at https://ai.interwestinfo.com/), but building it from scratch taught me more.
What's your experience with building internal AI tools? Did you hit the same chunking issues? Let me know in the comments!
Top comments (0)