<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ihsan_kutluk</title>
    <description>The latest articles on DEV Community by ihsan_kutluk (@jasstt).</description>
    <link>https://dev.to/jasstt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972978%2F4354481d-8507-4a11-9678-46db70fe31f2.png</url>
      <title>DEV Community: ihsan_kutluk</title>
      <link>https://dev.to/jasstt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jasstt"/>
    <language>en</language>
    <item>
      <title>Why Dense Search Fails in Production RAG — And How Hybrid Search Fixes It</title>
      <dc:creator>ihsan_kutluk</dc:creator>
      <pubDate>Sun, 07 Jun 2026 21:10:38 +0000</pubDate>
      <link>https://dev.to/jasstt/why-dense-search-fails-in-production-rag-and-how-hybrid-search-fixes-it-237k</link>
      <guid>https://dev.to/jasstt/why-dense-search-fails-in-production-rag-and-how-hybrid-search-fixes-it-237k</guid>
      <description>&lt;p&gt;I built a RAG system following the standard tutorial approach — embed, store, retrieve by cosine similarity. It worked fine until I asked it a technical question and got back two completely unrelated chunks about feature engineering. That's when I started digging.&lt;/p&gt;

&lt;p&gt;This article explains exactly why this happens — and how &lt;strong&gt;hybrid search&lt;/strong&gt; with Reciprocal Rank Fusion (RRF) and an LLM reranker solves the problem. All results come from a real pipeline I built and tested.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem — Dense Search Fails on Exact Keywords
&lt;/h2&gt;

&lt;p&gt;Here's a concrete example. I asked my RAG system:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"What are the advantages of the Transformer architecture over traditional RNNs?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With &lt;strong&gt;dense-only search&lt;/strong&gt; (ChromaDB + &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;), the top 3 retrieved chunks were:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Chunk ID&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Relevant?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chunk_4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;nlp_temelleri.txt&lt;/td&gt;
&lt;td&gt;✅ Yes — Transformer &amp;amp; self-attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chunk_11&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;veri_bilimi.txt&lt;/td&gt;
&lt;td&gt;❌ No — MSE, MAE error metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chunk_8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;veri_bilimi.txt&lt;/td&gt;
&lt;td&gt;❌ No — Feature engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model saw "model evaluation" and "Transformer model performance" as semantically close — because they are, in embedding space. But they're not what I was asking about. Dense search had no way to know that.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Hybrid Search?
&lt;/h2&gt;

&lt;p&gt;Hybrid search combines two fundamentally different retrieval strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dense Retrieval (Semantic Search)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses neural embeddings (e.g., &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Captures semantic meaning: "automobile" matches "car"&lt;/li&gt;
&lt;li&gt;Great for paraphrase-style queries&lt;/li&gt;
&lt;li&gt;Weak at: exact technical terms, proper nouns, version numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sparse Retrieval (BM25)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A classic probabilistic keyword matching algorithm&lt;/li&gt;
&lt;li&gt;Scores documents based on term frequency and inverse document frequency (TF-IDF family)&lt;/li&gt;
&lt;li&gt;Great at: exact keyword matching ("Transformer", "RNN", "CUDA")&lt;/li&gt;
&lt;li&gt;Weak at: synonyms and semantic variations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither is perfect alone. Together, they cover each other's blind spots. A query like &lt;em&gt;"Transformer architecture vs RNN"&lt;/em&gt; benefits from BM25 catching the exact term "Transformer" while dense search handles the conceptual framing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reciprocal Rank Fusion (RRF)
&lt;/h2&gt;

&lt;p&gt;Once you have two ranked lists — one from dense, one from BM25 — you need to merge them intelligently. A naive approach (averaging scores) fails because the score scales are completely different: ChromaDB returns cosine distances while BM25 returns TF-IDF-based scores.&lt;/p&gt;

&lt;p&gt;RRF solves this with a rank-based formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RRF_score(doc) = Σ  1 / (k + rank_i(doc))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;k&lt;/code&gt; is a constant (typically 60) and &lt;code&gt;rank_i(doc)&lt;/code&gt; is the document's position in the i-th ranked list.&lt;/p&gt;

&lt;p&gt;The beauty of RRF is that it only cares about &lt;em&gt;rank position&lt;/em&gt;, not raw score magnitudes. A document that ranks #1 in dense and #3 in BM25 will score much higher than one that ranks #20 in both — regardless of the underlying score scales. This makes it robust across completely different retrieval systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reranker
&lt;/h2&gt;

&lt;p&gt;After RRF produces a merged list of ~20 candidates, sending all of them to the LLM for generation would be noisy and expensive. The reranker cuts this down to the top 5 that actually matter.&lt;/p&gt;

&lt;p&gt;Rather than another embedding model, I send all 20 candidates to &lt;strong&gt;Gemini&lt;/strong&gt; in a single prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Given this question: [query]
Rank the following 20 passages by relevance.
Return only: {"ranking": [idx1, idx2, idx3, idx4, idx5]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is effectively a &lt;strong&gt;cross-encoder&lt;/strong&gt; pattern: the LLM reads the query and all passages together, allowing it to consider &lt;em&gt;interaction effects&lt;/em&gt; between the query and each passage — something bi-encoder embedding models cannot do. The trade-off is cost and latency, but since we're calling it once per query (not once per document), it's manageable.&lt;/p&gt;

&lt;p&gt;The reranker also includes a &lt;strong&gt;retry + fallback mechanism&lt;/strong&gt;: if the API returns a &lt;code&gt;503 UNAVAILABLE&lt;/code&gt;, it waits 5 seconds and retries up to 3 times. On total failure, it falls back to the top 5 from RRF directly — so the pipeline never crashes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Results
&lt;/h2&gt;

&lt;p&gt;Here's what happened when I ran the same query with both approaches:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query:&lt;/strong&gt; &lt;em&gt;"What are the advantages of the Transformer architecture over traditional RNNs?"&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Dense Only&lt;/th&gt;
&lt;th&gt;Hybrid (Dense + BM25 + RRF)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chunk_4&lt;/code&gt; ✅ nlp_temelleri.txt&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chunk_4&lt;/code&gt; ✅ nlp_temelleri.txt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chunk_11&lt;/code&gt; ❌ veri_bilimi.txt&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chunk_3&lt;/code&gt; ✅ nlp_temelleri.txt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chunk_8&lt;/code&gt; ❌ veri_bilimi.txt&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chunk_11&lt;/code&gt; ❌ veri_bilimi.txt&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;BM25 caught "Transformer" and "RNN" as exact keywords and boosted &lt;code&gt;chunk_3&lt;/code&gt; — a passage about word embeddings and NLP context — from outside the top 3 into rank #2. The two irrelevant data science chunks dropped out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation across 5 questions:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Overall Accuracy&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80% (4/5)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Citation Coverage&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;14/14&lt;/strong&gt; successful citations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid vs Dense&lt;/td&gt;
&lt;td&gt;BM25 removed 2 irrelevant chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resilience&lt;/td&gt;
&lt;td&gt;503 errors handled via retry + fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every answer cites its source inline (e.g., &lt;code&gt;[1]&lt;/code&gt;, &lt;code&gt;[2]&lt;/code&gt;) with the actual filename, so users can verify the origin of each claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sentence-transformers&lt;/code&gt; (&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector DB&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chromadb&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sparse retrieval&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rank_bm25&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fusion&lt;/td&gt;
&lt;td&gt;Custom RRF implementation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reranker + Generator&lt;/td&gt;
&lt;td&gt;Google Gemini API (&lt;code&gt;google-genai&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;python-dotenv&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;🔗 &lt;strong&gt;&lt;a href="https://github.com/jasstt/rag_project" rel="noopener noreferrer"&gt;github.com/jasstt/rag_project&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/jasstt/rag_project.git
&lt;span class="nb"&gt;cd &lt;/span&gt;rag_project
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="c"&gt;# Add your Gemini API key to .env&lt;/span&gt;
python src/ingest.py
python main.py
python src/eval.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;I'm not saying dense search is bad. For most casual queries it works fine. But the moment your users start asking technical questions — exact model names, function signatures, version numbers — BM25 starts pulling its weight. Adding it took maybe 20 minutes. Two irrelevant chunks disappeared from the results without touching anything else in the pipeline.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
