Everyone talks about how easy it is to build a RAG (Retrieval-Augmented Generation) chat assistant. Just embed some documents, throw them in a vector database, and connect an LLM, right?
Well, I just spent weeks building a production RAG system for a real knowledge base platform with nearly 2,000 articles. And let me tell you — the gap between a weekend hackathon demo and something that actually works for real users is enormous.
What is RAG, Anyway?
In case you haven't heard the acronym a thousand times this year: RAG is a technique where you retrieve relevant documents from your own knowledge base and feed them into an LLM as context, so the AI answers based on your data instead of its training cut-off.
The theory is simple. The reality is full of landmines.
The Three Landmines Nobody Warns You About
1. Chunking is Everything (And You're Probably Doing It Wrong)
My first attempt? Split by 500-token chunks with no overlap. The result was terrible — half the retrieved chunks were cut off mid-sentence, losing critical context.
What actually worked:
- Semantic chunking: Split at natural boundaries (headings, paragraph breaks)
- Overlap strategy: 15-20% overlap between adjacent chunks
- Metadata enrichment: Each chunk carries its source article title, category, and position
This single change improved answer quality by an estimated 40%.
2. Vector Search Alone Isn't Enough
Pure cosine similarity on embeddings returns results that are topically similar but often miss the mark for specific technical questions.
The winning combo:
- BM25 (keyword search) for exact technical term matching
- Vector similarity for semantic understanding
- Hybrid ranking with a weighted score (0.4 BM25 + 0.6 vector)
This hybrid approach catches both "how to configure PostgreSQL connection pooling" (BM25 wins) and "why is my database slow under load" (vector wins).
3. Context Window is a Budget, Not a Freebie
Most tutorials stuff as many retrieved chunks as possible into the prompt. But every token costs money and degrades response quality.
My approach:
- Top 5 most relevant chunks (not 10, not 20)
- Intelligent deduplication: Remove near-identical chunks
- Source attribution: Each answer links back to the original article
The Architecture That Actually Worked
User Question → Query Rewriting → Hybrid Search (BM25 + Vector)
→ Reranking → Top-5 Selection → Prompt Assembly → LLM Response
Key tech choices:
- Vector DB: Supabase (pgvector) — because it's Postgres and you already know SQL
- Embeddings: OpenAI text-embedding-3-small (fast, cheap, good enough)
- Reranking: Custom scoring function (BM25 + vector + recency boost)
- LLM: GPT-4o-mini for cost efficiency on production traffic
The Results
- ~2,000 technical articles in the knowledge base
- Sub-second query latency for most questions
- Source-linked answers — every response cites which article it came from
- Cost per query: ~$0.005 (embedding + LLM call)
What I'd Do Differently
- Start with a reranker model from day one — it's worth the extra compute
- Build a feedback loop early — let users thumbs-up/down answers
- Don't over-engineer chunking — start simple, iterate based on actual query patterns
The Bottom Line
RAG isn't magic. It's engineering. And like any engineering problem, the devil is in the details. The gap between "it works on my laptop" and "it works for thousands of users asking weird questions at 3 AM" is filled with chunking strategies, hybrid search tuning, and relentless iteration.
If you're building a RAG system right now: start simple, measure everything, and expect to rewrite your retrieval pipeline at least three times.
Want to dive deeper into building developer tools? Check out my technical knowledge base at codcompass.com — growing weekly with real-world insights on shipping software.
Top comments (0)