ihsan_kutluk

Posted on Jun 7 • Edited on Jun 18

Why Dense Search Fails in Production RAG — And How Hybrid Search Fixes It

#rag #llm #machinelearning #python

I built a RAG system following the standard tutorial approach — embed, store, retrieve by cosine similarity. It worked fine until I asked it a technical question and got back two completely unrelated chunks about feature engineering. That's when I started digging.

This article explains exactly why this happens — and how hybrid search with Reciprocal Rank Fusion (RRF) and an LLM reranker solves the problem. All results come from a real pipeline I built and tested.

The Problem — Dense Search Fails on Exact Keywords

Here's a concrete example. I asked my RAG system:

"What are the advantages of the Transformer architecture over traditional RNNs?"

With dense-only search (ChromaDB + all-MiniLM-L6-v2), the top 3 retrieved chunks were:

Rank	Chunk ID	Source	Relevant?
1	`chunk_4`	nlp_temelleri.txt	✅ Yes — Transformer & self-attention
2	`chunk_11`	veri_bilimi.txt	❌ No — MSE, MAE error metrics
3	`chunk_8`	veri_bilimi.txt	❌ No — Feature engineering

The model saw "model evaluation" and "Transformer model performance" as semantically close — because they are, in embedding space. But they're not what I was asking about. Dense search had no way to know that.

What is Hybrid Search?

Hybrid search combines two fundamentally different retrieval strategies:

Dense Retrieval (Semantic Search)

Uses neural embeddings (e.g., all-MiniLM-L6-v2)
Captures semantic meaning: "automobile" matches "car"
Great for paraphrase-style queries
Weak at: exact technical terms, proper nouns, version numbers

Sparse Retrieval (BM25)

A classic probabilistic keyword matching algorithm
Scores documents based on term frequency and inverse document frequency (TF-IDF family)
Great at: exact keyword matching ("Transformer", "RNN", "CUDA")
Weak at: synonyms and semantic variations

Neither is perfect alone. Together, they cover each other's blind spots. A query like "Transformer architecture vs RNN" benefits from BM25 catching the exact term "Transformer" while dense search handles the conceptual framing.

Reciprocal Rank Fusion (RRF)

Once you have two ranked lists — one from dense, one from BM25 — you need to merge them intelligently. A naive approach (averaging scores) fails because the score scales are completely different: ChromaDB returns cosine distances while BM25 returns TF-IDF-based scores.

RRF solves this with a rank-based formula:

RRF_score(doc) = Σ  1 / (k + rank_i(doc))

Where k is a constant (typically 60) and rank_i(doc) is the document's position in the i-th ranked list.

The beauty of RRF is that it only cares about rank position, not raw score magnitudes. A document that ranks #1 in dense and #3 in BM25 will score much higher than one that ranks #20 in both — regardless of the underlying score scales. This makes it robust across completely different retrieval systems.

The Reranker

After RRF produces a merged list of ~20 candidates, sending all of them to the LLM for generation would be noisy and expensive. The reranker cuts this down to the top 5 that actually matter.

Rather than another embedding model, I send all 20 candidates to Gemini in a single prompt:

Given this question: [query]
Rank the following 20 passages by relevance.
Return only: {"ranking": [idx1, idx2, idx3, idx4, idx5]}

This is effectively a cross-encoder pattern: the LLM reads the query and all passages together, allowing it to consider interaction effects between the query and each passage — something bi-encoder embedding models cannot do. The trade-off is cost and latency, but since we're calling it once per query (not once per document), it's manageable.

The reranker also includes a retry + fallback mechanism: if the API returns a 503 UNAVAILABLE, it waits 5 seconds and retries up to 3 times. On total failure, it falls back to the top 5 from RRF directly — so the pipeline never crashes.

Real Results

Here's what happened when I ran the same query with both approaches:

Query: "What are the advantages of the Transformer architecture over traditional RNNs?"

Rank	Dense Only	Hybrid (Dense + BM25 + RRF)
1	`chunk_4` ✅ nlp_temelleri.txt	`chunk_4` ✅ nlp_temelleri.txt
2	`chunk_11` ❌ veri_bilimi.txt	`chunk_3` ✅ nlp_temelleri.txt
3	`chunk_8` ❌ veri_bilimi.txt	`chunk_11` ❌ veri_bilimi.txt

BM25 caught "Transformer" and "RNN" as exact keywords and boosted chunk_3 — a passage about word embeddings and NLP context — from outside the top 3 into rank #2. The two irrelevant data science chunks dropped out.

Evaluation across 5 questions:

Metric	Score
Overall Accuracy	80% (4/5)
Citation Coverage	14/14 successful citations
Hybrid vs Dense	BM25 removed 2 irrelevant chunks
Resilience	503 errors handled via retry + fallback

Every answer cites its source inline (e.g., [1], [2]) with the actual filename, so users can verify the origin of each claim.

The Stack

Component	Library
Embeddings	`sentence-transformers` (`all-MiniLM-L6-v2`)
Vector DB	`chromadb`
Sparse retrieval	`rank_bm25`
Fusion	Custom RRF implementation
Reranker + Generator	Google Gemini API (`google-genai`)
Environment	`python-dotenv`

Try It Yourself

🔗 github.com/jasstt/rag_project

git clone https://github.com/jasstt/rag_project.git
cd rag_project
pip install -r requirements.txt
# Add your Gemini API key to .env
python src/ingest.py
python main.py
python src/eval.py

I'm not saying dense search is bad. For most casual queries it works fine. But the moment your users start asking technical questions — exact model names, function signatures, version numbers — BM25 starts pulling its weight. Adding it took maybe 20 minutes. Two irrelevant chunks disappeared from the results without touching anything else in the pipeline.

v1.1 Update: Community Feedback in Action

Shortly after publishing the initial version of this pipeline, I received some incredible feedback from the engineering community. I've integrated three major improvements directly into the codebase:

**1. Sentence-Aware Chunking
Instead of blindly cutting text at 500 characters, src/ingest.py now uses NLTK/regex to detect sentence boundaries. It never splits a sentence in half, and it specifically preserves table-like structures (e.g., lists with pipes or colons) by keeping those rows together. This drastically improves the semantic quality of the chunks.

*2. Skip-Rerank Optimization
LLM rerankers introduce latency. To fix this, I added a confidence check in src/rerank.py. If the top 1 result from RRF has a score significantly higher than the top 2 result (configured via SKIP_RERANK_THRESHOLD), the pipeline assumes high confidence and *skips the LLM reranker entirely, dropping latency to near-zero for easy questions.

**3. Local Cross-Encoder Reranker
To remove the hard dependency on Gemini for reranking, I integrated cross-encoder/ms-marco-MiniLM-L-6-v2. You can now switch RERANK_MODE = "local" in the config to run a fully offline, local cross-encoder that evaluates interactions between the query and the retrieved chunks without hitting any external APIs.

Building in public is a cheat code. A huge thanks to the community for the suggestions.

Top comments (3)

Tae Kim • Jun 7

We run the same BM25+dense+RRF pipeline on tariff and trade news, and the exact-match gap you describe is even sharper on HS codes and duty rate numbers — dense retrieval just conflates numerically close strings. One thing worth noting: using a cloud LLM as the reranker adds latency on every query. For our workload we swapped to a local bge-reranker-v2-m3 (cross-encoder running on GPU), which brought rerank latency down to milliseconds vs. seconds per batch.

Gunjan Tailor • Jun 8

Solid breakdown — RRF being rank-based instead of score-based is exactly why it survives mixing BM25 and cosine. One thing I'd add from going down this same road: the exact-match gap gets worse when the keyword you need was sitting in a table that blind chunking already flattened into "45.2% Q3 Europe" with no headers. BM25 can't match a term that ingestion destroyed — so some of what looks like a retrieval problem is actually an ingestion problem one step upstream. I built docnest (BM25 + ANN + RRF) around that, and found ~70% of factual queries resolve at the keyword/precomputed layer with zero LLM tokens — the reranker only earns its latency on genuinely ambiguous queries. +1 to the local cross-encoder point in the comments.

Ahmet Özel • Jun 8

Strong agreement on hybrid search. One thing I would add from running a chunking/embedding API in production: a lot of "dense search fails" cases are actually upstream chunking problems. If chunks split mid-sentence or shred table rows away from their headers, even hybrid retrieval struggles because the unit being embedded is incoherent. Sentence-aware chunking plus keeping tables intact fixed more of my retrieval misses than tuning the dense/sparse weighting did. Curious whether you rerank after the hybrid merge, or rely on the fusion score alone?