kol kol

Posted on Jun 20

I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results

#codcompass #ai #knowledgebase #webdev

I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results

I built a RAG pipeline for private document search. It cost me $500 in vector database compute, weeks of debugging, and a lot of frustration. The results were mediocre — users got irrelevant answers, queries were slow, and the whole thing felt like a fancy keyword search with extra steps.

Then I audited the pipeline step by step. Turns out, I made 7 mistakes that are incredibly common in RAG systems. Fixing them transformed the pipeline from "meh" to genuinely useful.

Here's what I got wrong, and what I changed.

Mistake #1: I Chopped Documents Into Random Pieces

I was splitting documents by fixed token count — 512 tokens per chunk, done. Simple, right?

Wrong. I was destroying semantic context. A paragraph about API authentication would get split mid-sentence, with half in one chunk and half in another. When retrieval ran, the LLM got fragmented context and produced garbage.

The fix: Parent-Document retrieval with semantic chunking.

Split by natural document boundaries first (paragraphs, sections, headers) — these are your "parent documents"
Create smaller child chunks from parents for vector search
When a child chunk matches, return the full parent document to the LLM
Add 10-20% overlap between chunks so boundary information isn't lost

# What I should have done from the start
CHUNK_CONFIG = {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "separator": ["\n\n", "\n", "。", "！", "？"],
}

Query accuracy jumped 30% after this one change.

Mistake #2: I Used 0.5:0.5 Weights for Hybrid Search

My vector database supports hybrid search — combining vector similarity with keyword (BM25) matching. I left the weights at the default 50/50 split and assumed that was fine.

It wasn't. For technical documentation, exact keyword matches matter way more than the default acknowledges. Someone searching for "HNSW ef_construction" needs that exact term, not a semantically similar but wrong answer.

The fix: Dynamic weights based on query type.

Factual queries ("what is X"): 35% vector, 65% keyword
Semantic queries ("how do I build X"): 75% vector, 25% keyword
General queries: 60% vector, 40% keyword

WEIGHTS = {
    "factual": {"vector": 0.35, "keyword": 0.65},
    "semantic": {"vector": 0.75, "keyword": 0.25},
    "general": {"vector": 0.6, "keyword": 0.4},
}

The keyword weight bump for factual queries alone eliminated most of the "almost right but wrong" answers.

Mistake #3: I Blew Up My Vector Database's Memory

I set ef_construction to the maximum value because "higher is better, right?" On a 50GB+ index, this meant the index build process consumed all available RAM and crashed. Twice.

The fix: Size-appropriate HNSW parameters.

# Don't max this out — your server will cry
HNSW_CONFIG = {
    "M": 16,              # connections per node (8-32 is the sweet spot)
    "ef_construction": 200,  # not 400. Not 1000. 200.
    "ef_search": 50,       # query time, not build time
}

Index build time went from "it crashed" to 45 minutes. Memory usage dropped 70%.

Mistake #4: My Embedding Model Was Too Generic

I was using a general-purpose embedding model trained on Wikipedia and web text. My documents were technical API references and engineering runbooks. The model didn't understand my domain.

The fix: Switch to a model fine-tuned for technical/code content. The difference was night and day — suddenly "migration" and "transform" weren't treated as synonyms just because they're sometimes related in general text.

Mistake #5: I Had No Query Rewrite Layer

Users typed natural questions like "why is my build slow" and the system searched for those exact words in technical documentation that said "CI pipeline optimization" and "build duration analysis." Zero overlap. Zero results.

The fix: A lightweight LLM query rewrite step before retrieval.

User query: "why is my build slow"
→ Rewritten: "CI pipeline performance optimization build duration"
→ Retrieved: Relevant documentation ✅

This single step improved recall by 40%. The cost? About 0.001 cents per query with a small model.

Mistake #6: I Didn't Filter Duplicate Context

Retrieving top-10 chunks meant I often got the same paragraph 3 times with slightly different wording. The LLM would repeat itself, hallucinate from the repetition, and produce bloated answers.

The fix: Maximal marginal relevance (MMR) re-ranking.

# Instead of returning top-10 most similar
# Return top-10 most similar AND diverse
retrieved = vector_store.search(query, k=20)
diverse = mmr_rerank(retrieved, query, lambda_param=0.7, k=10)

Answers became more concise and covered more ground.

Mistake #7: I Never Measured Retrieval Quality

I was evaluating the whole RAG pipeline end-to-end. If the final answer was bad, I didn't know if it was the retrieval, the prompt, or the LLM.

The fix: Separate retrieval evaluation.

Track hit rate: does the retrieved context contain the answer?
Track MRR (Mean Reciprocal Rank): how high in the results is the right chunk?
Build a golden test set of 100 query-document pairs
Only optimize the generation layer once retrieval scores are solid

This saved me from chasing the wrong problems for weeks.

The Results After All 7 Fixes

Metric	Before	After
Answer relevance	~45%	~85%
Avg query latency	3.2s	1.8s
Monthly vector DB cost	$180	$95
Duplicate context in responses	60%	8%

The Takeaway

RAG isn't hard because the algorithms are complex. It's hard because there are 7+ interconnected knobs, and they all interact with each other.

My advice: fix chunking first, then weights, then embedding quality. In that order. Everything else is optimization.

What's your biggest RAG headache? Drop it in the comments — I've probably hit it too.

DEV Community

I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results

I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results

Mistake #1: I Chopped Documents Into Random Pieces

Mistake #2: I Used 0.5:0.5 Weights for Hybrid Search

Mistake #3: I Blew Up My Vector Database's Memory

Mistake #4: My Embedding Model Was Too Generic

Mistake #5: I Had No Query Rewrite Layer

Mistake #6: I Didn't Filter Duplicate Context

Mistake #7: I Never Measured Retrieval Quality

The Results After All 7 Fixes

The Takeaway

Top comments (0)