kol kol

Posted on Jun 5

I Built a RAG Chat Assistant from Scratch — Here's What Nobody Tells You About Production RAG

#codcompass #ai #knowledgebase #webdev

Everyone talks about how easy it is to build a RAG (Retrieval-Augmented Generation) chat assistant. Just embed some documents, throw them in a vector database, and connect an LLM, right?

Well, I just spent weeks building a production RAG system for a real knowledge base platform with nearly 2,000 articles. And let me tell you — the gap between a weekend hackathon demo and something that actually works for real users is enormous.

What is RAG, Anyway?

In case you haven't heard the acronym a thousand times this year: RAG is a technique where you retrieve relevant documents from your own knowledge base and feed them into an LLM as context, so the AI answers based on your data instead of its training cut-off.

The theory is simple. The reality is full of landmines.

The Three Landmines Nobody Warns You About

1. Chunking is Everything (And You're Probably Doing It Wrong)

My first attempt? Split by 500-token chunks with no overlap. The result was terrible — half the retrieved chunks were cut off mid-sentence, losing critical context.

What actually worked:

Semantic chunking: Split at natural boundaries (headings, paragraph breaks)
Overlap strategy: 15-20% overlap between adjacent chunks
Metadata enrichment: Each chunk carries its source article title, category, and position

This single change improved answer quality by an estimated 40%.

2. Vector Search Alone Isn't Enough

Pure cosine similarity on embeddings returns results that are topically similar but often miss the mark for specific technical questions.

The winning combo:

BM25 (keyword search) for exact technical term matching
Vector similarity for semantic understanding
Hybrid ranking with a weighted score (0.4 BM25 + 0.6 vector)

This hybrid approach catches both "how to configure PostgreSQL connection pooling" (BM25 wins) and "why is my database slow under load" (vector wins).

3. Context Window is a Budget, Not a Freebie

Most tutorials stuff as many retrieved chunks as possible into the prompt. But every token costs money and degrades response quality.

My approach:

Top 5 most relevant chunks (not 10, not 20)
Intelligent deduplication: Remove near-identical chunks
Source attribution: Each answer links back to the original article

The Architecture That Actually Worked

User Question → Query Rewriting → Hybrid Search (BM25 + Vector) 
  → Reranking → Top-5 Selection → Prompt Assembly → LLM Response

Key tech choices:

Vector DB: Supabase (pgvector) — because it's Postgres and you already know SQL
Embeddings: OpenAI text-embedding-3-small (fast, cheap, good enough)
Reranking: Custom scoring function (BM25 + vector + recency boost)
LLM: GPT-4o-mini for cost efficiency on production traffic

The Results

~2,000 technical articles in the knowledge base
Sub-second query latency for most questions
Source-linked answers — every response cites which article it came from
Cost per query: ~$0.005 (embedding + LLM call)

What I'd Do Differently

Start with a reranker model from day one — it's worth the extra compute
Build a feedback loop early — let users thumbs-up/down answers
Don't over-engineer chunking — start simple, iterate based on actual query patterns

The Bottom Line

RAG isn't magic. It's engineering. And like any engineering problem, the devil is in the details. The gap between "it works on my laptop" and "it works for thousands of users asking weird questions at 3 AM" is filled with chunking strategies, hybrid search tuning, and relentless iteration.

If you're building a RAG system right now: start simple, measure everything, and expect to rewrite your retrieval pipeline at least three times.

Want to dive deeper into building developer tools? Check out my technical knowledge base at codcompass.com — growing weekly with real-world insights on shipping software.

DEV Community