DEV Community

马国锦
马国锦

Posted on

Build Your RAG System Right the First Time: 6 Decisions That Make or Break It

After debugging 20+ broken RAG systems, I've identified the 6 decisions that determine whether yours works. Here's how to get each one right.


The RAG Developer's Trap

Every RAG developer falls into the same trap: you build the basic pipeline, it sort of works, and then you spend weeks tweaking prompt templates — while the real problem sits untouched in your indexing pipeline.

The 80/20 rule: 80% of RAG problems come from indexing, not generation. But 80% of debugging effort goes into generation.

Let's fix that.


Decision 1: Embedding Model — The Single Biggest Lever

The mistake: Using all-MiniLM-L6-v2 for Chinese documents because it's the default in every tutorial.

Why it's wrong: It's English-trained. Drop it on Chinese text and it loses 30-50% of semantic fidelity.

Language Use This
Chinese BAAI/bge-large-zh-v1.5 (1024-dim)
Chinese + English BAAI/bge-m3 (multilingual + sparse)
English text-embedding-3-large
Code jina-embeddings-v3 or voyage-code-3

Non-negotiable: Indexing model and query model must be byte-for-byte identical. Switch models = rebuild entire index.

Impact: +15-40% Recall@10 for Chinese RAG.


Decision 2: Chunk Size — Not a Magic Number

Physics: Too small (< 100 tokens) = semantic fragmentation. Too large (> 1000 tokens) = noise injection.

Document Type Sweet Spot Overlap
FAQ / Short-form 128-256 20
Technical docs 512 50
Long-form articles 768-1024 100
Code Function boundaries 0

The method matters more than the size. Use recursive splitting, not fixed-length:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
Enter fullscreen mode Exit fullscreen mode

Impact: +5-15% Recall@10.


Decision 3: Index Type — HNSW vs IVF

Scale Use Why
< 1M vectors HNSW Recall > 0.95
1-5M, RAM tight IVF + PQ 75% memory savings
> 5M IVF + PQ + Sharding Horizontal scale

Key nuance: HNSW has high insertion cost. Streaming docs → IVF may be better even at small scale.

Impact: +5-15% recall or 2-5x latency improvement.


Decision 4: Metadata — Not Optional

Without metadata filtering, every query scans every vector. Add department=engineering AND date > 2024-01-01 and you go from 5M → 50K vectors.

{"source": "internal_wiki", "doc_type": "design_doc",
 "publish_date": "2024-06-01", "department": "engineering"}
Enter fullscreen mode Exit fullscreen mode

Bonus: Inject metadata into prompts. LLMs weight credibility based on source and recency.

Impact: +10-25% precision from filtering alone.


Decision 5: Deduplication — Do It Twice

Enterprise knowledge bases are full of duplicates. Without dedup, your top-10 results might be 7 copies of the same document.

  1. Document-level (before chunking): MinHash + LSH, threshold 0.85
  2. Chunk-level (after chunking): SimHash, threshold 0.95

Impact: +10-20% effective recall.


Decision 6: Query Processing — The Other Half

Technique When Cost
Query rewriting Short/fuzzy queries Low
HyDE Factual Q&A, < 10 words Medium
Multi-path recall + RRF Semantic + exact-match Medium
Cross-Encoder rerank Post-retrieval refinement Medium

Minimum viable stack: Query rewriting + Cross-Encoder rerank.


The Optimization Priority Stack

First (biggest impact):
  - Embedding model language-appropriate?
  - Chunk size reasonable (256-768)?
  - Deduplicating?

Second:
  - Query rewriting
  - Cross-Encoder reranking
  - Metadata filtering

Third:
  - Multi-path recall + RRF
  - HyDE for short queries
Enter fullscreen mode Exit fullscreen mode

How to Know If You Fixed It

Don't guess. Measure. 50 (query, ground_truth) pairs, track Recall@10 and MRR.

def recall_at_k(results, ground_truth, k=10):
    return int(ground_truth in [r.id for r in results[:k]])

def mrr(results, ground_truth):
    for i, r in enumerate(results):
        if r.id == ground_truth:
            return 1.0 / (i + 1)
    return 0.0
Enter fullscreen mode Exit fullscreen mode

If this saved you an afternoon of debugging, give it a unicorn and share it with someone who's still tweaking chunk sizes.

Top comments (0)