DEV Community

Cover image for Why Dense Search Fails in Production RAG — And How Hybrid Search Fixes It
ihsan_kutluk
ihsan_kutluk

Posted on • Edited on

Why Dense Search Fails in Production RAG — And How Hybrid Search Fixes It

I built a RAG system following the standard tutorial approach — embed, store, retrieve by cosine similarity. It worked fine until I asked it a technical question and got back two completely unrelated chunks about feature engineering. That's when I started digging.

This article explains exactly why this happens — and how hybrid search with Reciprocal Rank Fusion (RRF) and an LLM reranker solves the problem. All results come from a real pipeline I built and tested.


The Problem — Dense Search Fails on Exact Keywords

Here's a concrete example. I asked my RAG system:

"What are the advantages of the Transformer architecture over traditional RNNs?"

With dense-only search (ChromaDB + all-MiniLM-L6-v2), the top 3 retrieved chunks were:

Rank Chunk ID Source Relevant?
1 chunk_4 nlp_temelleri.txt ✅ Yes — Transformer & self-attention
2 chunk_11 veri_bilimi.txt ❌ No — MSE, MAE error metrics
3 chunk_8 veri_bilimi.txt ❌ No — Feature engineering

The model saw "model evaluation" and "Transformer model performance" as semantically close — because they are, in embedding space. But they're not what I was asking about. Dense search had no way to know that.


What is Hybrid Search?

Hybrid search combines two fundamentally different retrieval strategies:

Dense Retrieval (Semantic Search)

  • Uses neural embeddings (e.g., all-MiniLM-L6-v2)
  • Captures semantic meaning: "automobile" matches "car"
  • Great for paraphrase-style queries
  • Weak at: exact technical terms, proper nouns, version numbers

Sparse Retrieval (BM25)

  • A classic probabilistic keyword matching algorithm
  • Scores documents based on term frequency and inverse document frequency (TF-IDF family)
  • Great at: exact keyword matching ("Transformer", "RNN", "CUDA")
  • Weak at: synonyms and semantic variations

Neither is perfect alone. Together, they cover each other's blind spots. A query like "Transformer architecture vs RNN" benefits from BM25 catching the exact term "Transformer" while dense search handles the conceptual framing.


Reciprocal Rank Fusion (RRF)

Once you have two ranked lists — one from dense, one from BM25 — you need to merge them intelligently. A naive approach (averaging scores) fails because the score scales are completely different: ChromaDB returns cosine distances while BM25 returns TF-IDF-based scores.

RRF solves this with a rank-based formula:

RRF_score(doc) = Σ  1 / (k + rank_i(doc))
Enter fullscreen mode Exit fullscreen mode

Where k is a constant (typically 60) and rank_i(doc) is the document's position in the i-th ranked list.

The beauty of RRF is that it only cares about rank position, not raw score magnitudes. A document that ranks #1 in dense and #3 in BM25 will score much higher than one that ranks #20 in both — regardless of the underlying score scales. This makes it robust across completely different retrieval systems.


The Reranker

After RRF produces a merged list of ~20 candidates, sending all of them to the LLM for generation would be noisy and expensive. The reranker cuts this down to the top 5 that actually matter.

Rather than another embedding model, I send all 20 candidates to Gemini in a single prompt:

Given this question: [query]
Rank the following 20 passages by relevance.
Return only: {"ranking": [idx1, idx2, idx3, idx4, idx5]}
Enter fullscreen mode Exit fullscreen mode

This is effectively a cross-encoder pattern: the LLM reads the query and all passages together, allowing it to consider interaction effects between the query and each passage — something bi-encoder embedding models cannot do. The trade-off is cost and latency, but since we're calling it once per query (not once per document), it's manageable.

The reranker also includes a retry + fallback mechanism: if the API returns a 503 UNAVAILABLE, it waits 5 seconds and retries up to 3 times. On total failure, it falls back to the top 5 from RRF directly — so the pipeline never crashes.


Real Results

Here's what happened when I ran the same query with both approaches:

Query: "What are the advantages of the Transformer architecture over traditional RNNs?"

Rank Dense Only Hybrid (Dense + BM25 + RRF)
1 chunk_4 ✅ nlp_temelleri.txt chunk_4 ✅ nlp_temelleri.txt
2 chunk_11 ❌ veri_bilimi.txt chunk_3 ✅ nlp_temelleri.txt
3 chunk_8 ❌ veri_bilimi.txt chunk_11 ❌ veri_bilimi.txt

BM25 caught "Transformer" and "RNN" as exact keywords and boosted chunk_3 — a passage about word embeddings and NLP context — from outside the top 3 into rank #2. The two irrelevant data science chunks dropped out.

Evaluation across 5 questions:

Metric Score
Overall Accuracy 80% (4/5)
Citation Coverage 14/14 successful citations
Hybrid vs Dense BM25 removed 2 irrelevant chunks
Resilience 503 errors handled via retry + fallback

Every answer cites its source inline (e.g., [1], [2]) with the actual filename, so users can verify the origin of each claim.


The Stack

Component Library
Embeddings sentence-transformers (all-MiniLM-L6-v2)
Vector DB chromadb
Sparse retrieval rank_bm25
Fusion Custom RRF implementation
Reranker + Generator Google Gemini API (google-genai)
Environment python-dotenv

Try It Yourself

🔗 github.com/jasstt/rag_project

git clone https://github.com/jasstt/rag_project.git
cd rag_project
pip install -r requirements.txt
# Add your Gemini API key to .env
python src/ingest.py
python main.py
python src/eval.py
Enter fullscreen mode Exit fullscreen mode

I'm not saying dense search is bad. For most casual queries it works fine. But the moment your users start asking technical questions — exact model names, function signatures, version numbers — BM25 starts pulling its weight. Adding it took maybe 20 minutes. Two irrelevant chunks disappeared from the results without touching anything else in the pipeline.


v1.1 Update: Community Feedback in Action

Shortly after publishing the initial version of this pipeline, I received some incredible feedback from the engineering community. I've integrated three major improvements directly into the codebase:

**1. Sentence-Aware Chunking
Instead of blindly cutting text at 500 characters, src/ingest.py now uses NLTK/regex to detect sentence boundaries. It never splits a sentence in half, and it specifically preserves table-like structures (e.g., lists with pipes or colons) by keeping those rows together. This drastically improves the semantic quality of the chunks.

*2. Skip-Rerank Optimization
LLM rerankers introduce latency. To fix this, I added a confidence check in src/rerank.py. If the top 1 result from RRF has a score significantly higher than the top 2 result (configured via SKIP_RERANK_THRESHOLD), the pipeline assumes high confidence and *skips the LLM reranker entirely
, dropping latency to near-zero for easy questions.

**3. Local Cross-Encoder Reranker
To remove the hard dependency on Gemini for reranking, I integrated cross-encoder/ms-marco-MiniLM-L-6-v2. You can now switch RERANK_MODE = "local" in the config to run a fully offline, local cross-encoder that evaluates interactions between the query and the retrieved chunks without hitting any external APIs.


Building in public is a cheat code. A huge thanks to the community for the suggestions.

Top comments (3)

Collapse
 
hannune profile image
Tae Kim

We run the same BM25+dense+RRF pipeline on tariff and trade news, and the exact-match gap you describe is even sharper on HS codes and duty rate numbers — dense retrieval just conflates numerically close strings. One thing worth noting: using a cloud LLM as the reranker adds latency on every query. For our workload we swapped to a local bge-reranker-v2-m3 (cross-encoder running on GPU), which brought rerank latency down to milliseconds vs. seconds per batch.

Collapse
 
gunjantailor profile image
Gunjan Tailor

Solid breakdown — RRF being rank-based instead of score-based is exactly why it survives mixing BM25 and cosine. One thing I'd add from going down this same road: the exact-match gap gets worse when the keyword you need was sitting in a table that blind chunking already flattened into "45.2% Q3 Europe" with no headers. BM25 can't match a term that ingestion destroyed — so some of what looks like a retrieval problem is actually an ingestion problem one step upstream. I built docnest (BM25 + ANN + RRF) around that, and found ~70% of factual queries resolve at the keyword/precomputed layer with zero LLM tokens — the reranker only earns its latency on genuinely ambiguous queries. +1 to the local cross-encoder point in the comments.

Collapse
 
ahmetozel profile image
Ahmet Özel

Strong agreement on hybrid search. One thing I would add from running a chunking/embedding API in production: a lot of "dense search fails" cases are actually upstream chunking problems. If chunks split mid-sentence or shred table rows away from their headers, even hybrid retrieval struggles because the unit being embedded is incoherent. Sentence-aware chunking plus keeping tables intact fixed more of my retrieval misses than tuning the dense/sparse weighting did. Curious whether you rerank after the hybrid merge, or rely on the fusion score alone?