After debugging 20+ broken RAG systems, I've identified the 6 decisions that determine whether yours works. Here's how to get each one right.
The RAG Developer's Trap
Every RAG developer falls into the same trap: you build the basic pipeline, it sort of works, and then you spend weeks tweaking prompt templates — while the real problem sits untouched in your indexing pipeline.
The 80/20 rule: 80% of RAG problems come from indexing, not generation. But 80% of debugging effort goes into generation.
Let's fix that.
Decision 1: Embedding Model — The Single Biggest Lever
The mistake: Using all-MiniLM-L6-v2 for Chinese documents because it's the default in every tutorial.
Why it's wrong: It's English-trained. Drop it on Chinese text and it loses 30-50% of semantic fidelity.
| Language | Use This |
|---|---|
| Chinese |
BAAI/bge-large-zh-v1.5 (1024-dim) |
| Chinese + English |
BAAI/bge-m3 (multilingual + sparse) |
| English | text-embedding-3-large |
| Code |
jina-embeddings-v3 or voyage-code-3
|
Non-negotiable: Indexing model and query model must be byte-for-byte identical. Switch models = rebuild entire index.
Impact: +15-40% Recall@10 for Chinese RAG.
Decision 2: Chunk Size — Not a Magic Number
Physics: Too small (< 100 tokens) = semantic fragmentation. Too large (> 1000 tokens) = noise injection.
| Document Type | Sweet Spot | Overlap |
|---|---|---|
| FAQ / Short-form | 128-256 | 20 |
| Technical docs | 512 | 50 |
| Long-form articles | 768-1024 | 100 |
| Code | Function boundaries | 0 |
The method matters more than the size. Use recursive splitting, not fixed-length:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
Impact: +5-15% Recall@10.
Decision 3: Index Type — HNSW vs IVF
| Scale | Use | Why |
|---|---|---|
| < 1M vectors | HNSW | Recall > 0.95 |
| 1-5M, RAM tight | IVF + PQ | 75% memory savings |
| > 5M | IVF + PQ + Sharding | Horizontal scale |
Key nuance: HNSW has high insertion cost. Streaming docs → IVF may be better even at small scale.
Impact: +5-15% recall or 2-5x latency improvement.
Decision 4: Metadata — Not Optional
Without metadata filtering, every query scans every vector. Add department=engineering AND date > 2024-01-01 and you go from 5M → 50K vectors.
{"source": "internal_wiki", "doc_type": "design_doc",
"publish_date": "2024-06-01", "department": "engineering"}
Bonus: Inject metadata into prompts. LLMs weight credibility based on source and recency.
Impact: +10-25% precision from filtering alone.
Decision 5: Deduplication — Do It Twice
Enterprise knowledge bases are full of duplicates. Without dedup, your top-10 results might be 7 copies of the same document.
- Document-level (before chunking): MinHash + LSH, threshold 0.85
- Chunk-level (after chunking): SimHash, threshold 0.95
Impact: +10-20% effective recall.
Decision 6: Query Processing — The Other Half
| Technique | When | Cost |
|---|---|---|
| Query rewriting | Short/fuzzy queries | Low |
| HyDE | Factual Q&A, < 10 words | Medium |
| Multi-path recall + RRF | Semantic + exact-match | Medium |
| Cross-Encoder rerank | Post-retrieval refinement | Medium |
Minimum viable stack: Query rewriting + Cross-Encoder rerank.
The Optimization Priority Stack
First (biggest impact):
- Embedding model language-appropriate?
- Chunk size reasonable (256-768)?
- Deduplicating?
Second:
- Query rewriting
- Cross-Encoder reranking
- Metadata filtering
Third:
- Multi-path recall + RRF
- HyDE for short queries
How to Know If You Fixed It
Don't guess. Measure. 50 (query, ground_truth) pairs, track Recall@10 and MRR.
def recall_at_k(results, ground_truth, k=10):
return int(ground_truth in [r.id for r in results[:k]])
def mrr(results, ground_truth):
for i, r in enumerate(results):
if r.id == ground_truth:
return 1.0 / (i + 1)
return 0.0
If this saved you an afternoon of debugging, give it a unicorn and share it with someone who's still tweaking chunk sizes.
Top comments (0)