DEV Community

kol kol
kol kol

Posted on

I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results

I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results

I built a RAG pipeline for private document search. It cost me $500 in vector database compute, weeks of debugging, and a lot of frustration. The results were mediocre — users got irrelevant answers, queries were slow, and the whole thing felt like a fancy keyword search with extra steps.

Then I audited the pipeline step by step. Turns out, I made 7 mistakes that are incredibly common in RAG systems. Fixing them transformed the pipeline from "meh" to genuinely useful.

Here's what I got wrong, and what I changed.

Mistake #1: I Chopped Documents Into Random Pieces

I was splitting documents by fixed token count — 512 tokens per chunk, done. Simple, right?

Wrong. I was destroying semantic context. A paragraph about API authentication would get split mid-sentence, with half in one chunk and half in another. When retrieval ran, the LLM got fragmented context and produced garbage.

The fix: Parent-Document retrieval with semantic chunking.

  1. Split by natural document boundaries first (paragraphs, sections, headers) — these are your "parent documents"
  2. Create smaller child chunks from parents for vector search
  3. When a child chunk matches, return the full parent document to the LLM
  4. Add 10-20% overlap between chunks so boundary information isn't lost
# What I should have done from the start
CHUNK_CONFIG = {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "separator": ["\n\n", "\n", "", "", ""],
}
Enter fullscreen mode Exit fullscreen mode

Query accuracy jumped 30% after this one change.

Mistake #2: I Used 0.5:0.5 Weights for Hybrid Search

My vector database supports hybrid search — combining vector similarity with keyword (BM25) matching. I left the weights at the default 50/50 split and assumed that was fine.

It wasn't. For technical documentation, exact keyword matches matter way more than the default acknowledges. Someone searching for "HNSW ef_construction" needs that exact term, not a semantically similar but wrong answer.

The fix: Dynamic weights based on query type.

  • Factual queries ("what is X"): 35% vector, 65% keyword
  • Semantic queries ("how do I build X"): 75% vector, 25% keyword
  • General queries: 60% vector, 40% keyword
WEIGHTS = {
    "factual": {"vector": 0.35, "keyword": 0.65},
    "semantic": {"vector": 0.75, "keyword": 0.25},
    "general": {"vector": 0.6, "keyword": 0.4},
}
Enter fullscreen mode Exit fullscreen mode

The keyword weight bump for factual queries alone eliminated most of the "almost right but wrong" answers.

Mistake #3: I Blew Up My Vector Database's Memory

I set ef_construction to the maximum value because "higher is better, right?" On a 50GB+ index, this meant the index build process consumed all available RAM and crashed. Twice.

The fix: Size-appropriate HNSW parameters.

# Don't max this out — your server will cry
HNSW_CONFIG = {
    "M": 16,              # connections per node (8-32 is the sweet spot)
    "ef_construction": 200,  # not 400. Not 1000. 200.
    "ef_search": 50,       # query time, not build time
}
Enter fullscreen mode Exit fullscreen mode

Index build time went from "it crashed" to 45 minutes. Memory usage dropped 70%.

Mistake #4: My Embedding Model Was Too Generic

I was using a general-purpose embedding model trained on Wikipedia and web text. My documents were technical API references and engineering runbooks. The model didn't understand my domain.

The fix: Switch to a model fine-tuned for technical/code content. The difference was night and day — suddenly "migration" and "transform" weren't treated as synonyms just because they're sometimes related in general text.

Mistake #5: I Had No Query Rewrite Layer

Users typed natural questions like "why is my build slow" and the system searched for those exact words in technical documentation that said "CI pipeline optimization" and "build duration analysis." Zero overlap. Zero results.

The fix: A lightweight LLM query rewrite step before retrieval.

User query: "why is my build slow"
→ Rewritten: "CI pipeline performance optimization build duration"
→ Retrieved: Relevant documentation ✅
Enter fullscreen mode Exit fullscreen mode

This single step improved recall by 40%. The cost? About 0.001 cents per query with a small model.

Mistake #6: I Didn't Filter Duplicate Context

Retrieving top-10 chunks meant I often got the same paragraph 3 times with slightly different wording. The LLM would repeat itself, hallucinate from the repetition, and produce bloated answers.

The fix: Maximal marginal relevance (MMR) re-ranking.

# Instead of returning top-10 most similar
# Return top-10 most similar AND diverse
retrieved = vector_store.search(query, k=20)
diverse = mmr_rerank(retrieved, query, lambda_param=0.7, k=10)
Enter fullscreen mode Exit fullscreen mode

Answers became more concise and covered more ground.

Mistake #7: I Never Measured Retrieval Quality

I was evaluating the whole RAG pipeline end-to-end. If the final answer was bad, I didn't know if it was the retrieval, the prompt, or the LLM.

The fix: Separate retrieval evaluation.

  • Track hit rate: does the retrieved context contain the answer?
  • Track MRR (Mean Reciprocal Rank): how high in the results is the right chunk?
  • Build a golden test set of 100 query-document pairs
  • Only optimize the generation layer once retrieval scores are solid

This saved me from chasing the wrong problems for weeks.

The Results After All 7 Fixes

Metric Before After
Answer relevance ~45% ~85%
Avg query latency 3.2s 1.8s
Monthly vector DB cost $180 $95
Duplicate context in responses 60% 8%

The Takeaway

RAG isn't hard because the algorithms are complex. It's hard because there are 7+ interconnected knobs, and they all interact with each other.

My advice: fix chunking first, then weights, then embedding quality. In that order. Everything else is optimization.

What's your biggest RAG headache? Drop it in the comments — I've probably hit it too.

Top comments (0)