DEV Community

Cover image for Why Dense Search Fails in Production RAG — And How Hybrid Search Fixes It
ihsan_kutluk
ihsan_kutluk

Posted on

Why Dense Search Fails in Production RAG — And How Hybrid Search Fixes It

I built a RAG system following the standard tutorial approach — embed, store, retrieve by cosine similarity. It worked fine until I asked it a technical question and got back two completely unrelated chunks about feature engineering. That's when I started digging.

This article explains exactly why this happens — and how hybrid search with Reciprocal Rank Fusion (RRF) and an LLM reranker solves the problem. All results come from a real pipeline I built and tested.


The Problem — Dense Search Fails on Exact Keywords

Here's a concrete example. I asked my RAG system:

"What are the advantages of the Transformer architecture over traditional RNNs?"

With dense-only search (ChromaDB + all-MiniLM-L6-v2), the top 3 retrieved chunks were:

Rank Chunk ID Source Relevant?
1 chunk_4 nlp_temelleri.txt ✅ Yes — Transformer & self-attention
2 chunk_11 veri_bilimi.txt ❌ No — MSE, MAE error metrics
3 chunk_8 veri_bilimi.txt ❌ No — Feature engineering

The model saw "model evaluation" and "Transformer model performance" as semantically close — because they are, in embedding space. But they're not what I was asking about. Dense search had no way to know that.


What is Hybrid Search?

Hybrid search combines two fundamentally different retrieval strategies:

Dense Retrieval (Semantic Search)

  • Uses neural embeddings (e.g., all-MiniLM-L6-v2)
  • Captures semantic meaning: "automobile" matches "car"
  • Great for paraphrase-style queries
  • Weak at: exact technical terms, proper nouns, version numbers

Sparse Retrieval (BM25)

  • A classic probabilistic keyword matching algorithm
  • Scores documents based on term frequency and inverse document frequency (TF-IDF family)
  • Great at: exact keyword matching ("Transformer", "RNN", "CUDA")
  • Weak at: synonyms and semantic variations

Neither is perfect alone. Together, they cover each other's blind spots. A query like "Transformer architecture vs RNN" benefits from BM25 catching the exact term "Transformer" while dense search handles the conceptual framing.


Reciprocal Rank Fusion (RRF)

Once you have two ranked lists — one from dense, one from BM25 — you need to merge them intelligently. A naive approach (averaging scores) fails because the score scales are completely different: ChromaDB returns cosine distances while BM25 returns TF-IDF-based scores.

RRF solves this with a rank-based formula:

RRF_score(doc) = Σ  1 / (k + rank_i(doc))
Enter fullscreen mode Exit fullscreen mode

Where k is a constant (typically 60) and rank_i(doc) is the document's position in the i-th ranked list.

The beauty of RRF is that it only cares about rank position, not raw score magnitudes. A document that ranks #1 in dense and #3 in BM25 will score much higher than one that ranks #20 in both — regardless of the underlying score scales. This makes it robust across completely different retrieval systems.


The Reranker

After RRF produces a merged list of ~20 candidates, sending all of them to the LLM for generation would be noisy and expensive. The reranker cuts this down to the top 5 that actually matter.

Rather than another embedding model, I send all 20 candidates to Gemini in a single prompt:

Given this question: [query]
Rank the following 20 passages by relevance.
Return only: {"ranking": [idx1, idx2, idx3, idx4, idx5]}
Enter fullscreen mode Exit fullscreen mode

This is effectively a cross-encoder pattern: the LLM reads the query and all passages together, allowing it to consider interaction effects between the query and each passage — something bi-encoder embedding models cannot do. The trade-off is cost and latency, but since we're calling it once per query (not once per document), it's manageable.

The reranker also includes a retry + fallback mechanism: if the API returns a 503 UNAVAILABLE, it waits 5 seconds and retries up to 3 times. On total failure, it falls back to the top 5 from RRF directly — so the pipeline never crashes.


Real Results

Here's what happened when I ran the same query with both approaches:

Query: "What are the advantages of the Transformer architecture over traditional RNNs?"

Rank Dense Only Hybrid (Dense + BM25 + RRF)
1 chunk_4 ✅ nlp_temelleri.txt chunk_4 ✅ nlp_temelleri.txt
2 chunk_11 ❌ veri_bilimi.txt chunk_3 ✅ nlp_temelleri.txt
3 chunk_8 ❌ veri_bilimi.txt chunk_11 ❌ veri_bilimi.txt

BM25 caught "Transformer" and "RNN" as exact keywords and boosted chunk_3 — a passage about word embeddings and NLP context — from outside the top 3 into rank #2. The two irrelevant data science chunks dropped out.

Evaluation across 5 questions:

Metric Score
Overall Accuracy 80% (4/5)
Citation Coverage 14/14 successful citations
Hybrid vs Dense BM25 removed 2 irrelevant chunks
Resilience 503 errors handled via retry + fallback

Every answer cites its source inline (e.g., [1], [2]) with the actual filename, so users can verify the origin of each claim.


The Stack

Component Library
Embeddings sentence-transformers (all-MiniLM-L6-v2)
Vector DB chromadb
Sparse retrieval rank_bm25
Fusion Custom RRF implementation
Reranker + Generator Google Gemini API (google-genai)
Environment python-dotenv

Try It Yourself

🔗 github.com/jasstt/rag_project

git clone https://github.com/jasstt/rag_project.git
cd rag_project
pip install -r requirements.txt
# Add your Gemini API key to .env
python src/ingest.py
python main.py
python src/eval.py
Enter fullscreen mode Exit fullscreen mode

I'm not saying dense search is bad. For most casual queries it works fine. But the moment your users start asking technical questions — exact model names, function signatures, version numbers — BM25 starts pulling its weight. Adding it took maybe 20 minutes. Two irrelevant chunks disappeared from the results without touching anything else in the pipeline.

Top comments (1)

Collapse
 
hannune profile image
Tae Kim

We run the same BM25+dense+RRF pipeline on tariff and trade news, and the exact-match gap you describe is even sharper on HS codes and duty rate numbers — dense retrieval just conflates numerically close strings. One thing worth noting: using a cloud LLM as the reranker adds latency on every query. For our workload we swapped to a local bge-reranker-v2-m3 (cross-encoder running on GPU), which brought rerank latency down to milliseconds vs. seconds per batch.