I built a RAG system following the standard tutorial approach — embed, store, retrieve by cosine similarity. It worked fine until I asked it a technical question and got back two completely unrelated chunks about feature engineering. That's when I started digging.
This article explains exactly why this happens — and how hybrid search with Reciprocal Rank Fusion (RRF) and an LLM reranker solves the problem. All results come from a real pipeline I built and tested.
The Problem — Dense Search Fails on Exact Keywords
Here's a concrete example. I asked my RAG system:
"What are the advantages of the Transformer architecture over traditional RNNs?"
With dense-only search (ChromaDB + all-MiniLM-L6-v2), the top 3 retrieved chunks were:
| Rank | Chunk ID | Source | Relevant? |
|---|---|---|---|
| 1 | chunk_4 |
nlp_temelleri.txt | ✅ Yes — Transformer & self-attention |
| 2 | chunk_11 |
veri_bilimi.txt | ❌ No — MSE, MAE error metrics |
| 3 | chunk_8 |
veri_bilimi.txt | ❌ No — Feature engineering |
The model saw "model evaluation" and "Transformer model performance" as semantically close — because they are, in embedding space. But they're not what I was asking about. Dense search had no way to know that.
What is Hybrid Search?
Hybrid search combines two fundamentally different retrieval strategies:
Dense Retrieval (Semantic Search)
- Uses neural embeddings (e.g.,
all-MiniLM-L6-v2) - Captures semantic meaning: "automobile" matches "car"
- Great for paraphrase-style queries
- Weak at: exact technical terms, proper nouns, version numbers
Sparse Retrieval (BM25)
- A classic probabilistic keyword matching algorithm
- Scores documents based on term frequency and inverse document frequency (TF-IDF family)
- Great at: exact keyword matching ("Transformer", "RNN", "CUDA")
- Weak at: synonyms and semantic variations
Neither is perfect alone. Together, they cover each other's blind spots. A query like "Transformer architecture vs RNN" benefits from BM25 catching the exact term "Transformer" while dense search handles the conceptual framing.
Reciprocal Rank Fusion (RRF)
Once you have two ranked lists — one from dense, one from BM25 — you need to merge them intelligently. A naive approach (averaging scores) fails because the score scales are completely different: ChromaDB returns cosine distances while BM25 returns TF-IDF-based scores.
RRF solves this with a rank-based formula:
RRF_score(doc) = Σ 1 / (k + rank_i(doc))
Where k is a constant (typically 60) and rank_i(doc) is the document's position in the i-th ranked list.
The beauty of RRF is that it only cares about rank position, not raw score magnitudes. A document that ranks #1 in dense and #3 in BM25 will score much higher than one that ranks #20 in both — regardless of the underlying score scales. This makes it robust across completely different retrieval systems.
The Reranker
After RRF produces a merged list of ~20 candidates, sending all of them to the LLM for generation would be noisy and expensive. The reranker cuts this down to the top 5 that actually matter.
Rather than another embedding model, I send all 20 candidates to Gemini in a single prompt:
Given this question: [query]
Rank the following 20 passages by relevance.
Return only: {"ranking": [idx1, idx2, idx3, idx4, idx5]}
This is effectively a cross-encoder pattern: the LLM reads the query and all passages together, allowing it to consider interaction effects between the query and each passage — something bi-encoder embedding models cannot do. The trade-off is cost and latency, but since we're calling it once per query (not once per document), it's manageable.
The reranker also includes a retry + fallback mechanism: if the API returns a 503 UNAVAILABLE, it waits 5 seconds and retries up to 3 times. On total failure, it falls back to the top 5 from RRF directly — so the pipeline never crashes.
Real Results
Here's what happened when I ran the same query with both approaches:
Query: "What are the advantages of the Transformer architecture over traditional RNNs?"
| Rank | Dense Only | Hybrid (Dense + BM25 + RRF) |
|---|---|---|
| 1 |
chunk_4 ✅ nlp_temelleri.txt |
chunk_4 ✅ nlp_temelleri.txt |
| 2 |
chunk_11 ❌ veri_bilimi.txt |
chunk_3 ✅ nlp_temelleri.txt |
| 3 |
chunk_8 ❌ veri_bilimi.txt |
chunk_11 ❌ veri_bilimi.txt |
BM25 caught "Transformer" and "RNN" as exact keywords and boosted chunk_3 — a passage about word embeddings and NLP context — from outside the top 3 into rank #2. The two irrelevant data science chunks dropped out.
Evaluation across 5 questions:
| Metric | Score |
|---|---|
| Overall Accuracy | 80% (4/5) |
| Citation Coverage | 14/14 successful citations |
| Hybrid vs Dense | BM25 removed 2 irrelevant chunks |
| Resilience | 503 errors handled via retry + fallback |
Every answer cites its source inline (e.g., [1], [2]) with the actual filename, so users can verify the origin of each claim.
The Stack
| Component | Library |
|---|---|
| Embeddings |
sentence-transformers (all-MiniLM-L6-v2) |
| Vector DB | chromadb |
| Sparse retrieval | rank_bm25 |
| Fusion | Custom RRF implementation |
| Reranker + Generator | Google Gemini API (google-genai) |
| Environment | python-dotenv |
Try It Yourself
🔗 github.com/jasstt/rag_project
git clone https://github.com/jasstt/rag_project.git
cd rag_project
pip install -r requirements.txt
# Add your Gemini API key to .env
python src/ingest.py
python main.py
python src/eval.py
I'm not saying dense search is bad. For most casual queries it works fine. But the moment your users start asking technical questions — exact model names, function signatures, version numbers — BM25 starts pulling its weight. Adding it took maybe 20 minutes. Two irrelevant chunks disappeared from the results without touching anything else in the pipeline.
Top comments (1)
We run the same BM25+dense+RRF pipeline on tariff and trade news, and the exact-match gap you describe is even sharper on HS codes and duty rate numbers — dense retrieval just conflates numerically close strings. One thing worth noting: using a cloud LLM as the reranker adds latency on every query. For our workload we swapped to a local bge-reranker-v2-m3 (cross-encoder running on GPU), which brought rerank latency down to milliseconds vs. seconds per batch.