<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vivek Patil</title>
    <description>The latest articles on DEV Community by Vivek Patil (@vivek_patil_f9a71d5c65994).</description>
    <link>https://dev.to/vivek_patil_f9a71d5c65994</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3342405%2F5ebc2cc7-6110-49c8-819f-c14f310ff228.png</url>
      <title>DEV Community: Vivek Patil</title>
      <link>https://dev.to/vivek_patil_f9a71d5c65994</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vivek_patil_f9a71d5c65994"/>
    <language>en</language>
    <item>
      <title>Hybrid Retrieval + RRF: How I Got 100% Retrieval Precision in a Production RAG System</title>
      <dc:creator>Vivek Patil</dc:creator>
      <pubDate>Sat, 04 Jul 2026 05:53:04 +0000</pubDate>
      <link>https://dev.to/vivek_patil_f9a71d5c65994/hybrid-retrieval-rrf-how-i-got-100-retrieval-precision-in-a-production-rag-system-1oga</link>
      <guid>https://dev.to/vivek_patil_f9a71d5c65994/hybrid-retrieval-rrf-how-i-got-100-retrieval-precision-in-a-production-rag-system-1oga</guid>
      <description>&lt;p&gt;Originally published at vivekpatil23.hashnode.dev&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem With Naive RAG Nobody Talks About&lt;/strong&gt;&lt;br&gt;
Most RAG tutorials show you the same pipeline: embed your documents, store vectors, embed the query, fetch the top-k nearest neighbors, pass to LLM. It works well enough in demos.&lt;/p&gt;

&lt;p&gt;In production, it quietly fails in two specific situations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Situation 1&lt;/strong&gt; — Exact keyword queries. A user asks "What is the ContextQuery API rate limit?" Your semantic search returns chunks about "API usage patterns" and "request throttling behavior" — conceptually related, but the exact phrase "rate limit" is buried or absent. The LLM hallucinates a number because the retrieved chunk doesn't contain one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Situation 2&lt;/strong&gt; — Short, specific queries. Semantic embeddings excel at capturing meaning but compress specificity. A 3-word query like "Chroma collection schema" gets drowned out by semantically adjacent but contextually wrong chunks.&lt;/p&gt;

&lt;p&gt;These weren't hypothetical failures. They showed up in my evaluation runs on ContextQuery — a production RAG system I built on free-tier infrastructure (NVIDIA NIM embeddings, Chroma Cloud, FastAPI, Next.js 15). My initial retrieval pipeline had a precision ceiling I couldn't break past 72% no matter how I tuned chunk size or overlap.&lt;/p&gt;

&lt;p&gt;The fix was hybrid retrieval using Reciprocal Rank Fusion. Here's exactly how it works and how I implemented it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Is Reciprocal Rank Fusion&lt;/strong&gt;&lt;br&gt;
RRF is a rank merging algorithm. Instead of picking one retrieval method and hoping it covers all query types, you run multiple retrievers independently, each producing a ranked list of chunks. RRF then merges those ranked lists into a single ranking using this formula:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;RRF_score(chunk) = Σ 1 / (k + rank_in_retriever)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where &lt;strong&gt;&lt;em&gt;k&lt;/em&gt;&lt;/strong&gt; is a smoothing constant (typically 60) and &lt;strong&gt;&lt;em&gt;rank_in_retriever&lt;/em&gt;&lt;/strong&gt; is where that chunk appeared in each retriever's results.&lt;/p&gt;

&lt;p&gt;The key insight: a chunk that appears at rank 3 in semantic search AND rank 5 in BM25 gets a higher combined score than a chunk that's rank 1 in only one method. Consensus across retrievers is the signal.&lt;/p&gt;

&lt;p&gt;No neural network. No additional model. Just math on top of your existing retrieval infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Two Retrievers I Combined&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retriever 1 — Semantic Search (NVIDIA NIM Embeddings + Chroma Cloud)&lt;/strong&gt;&lt;br&gt;
Standard dense retrieval. Query gets embedded via NVIDIA's &lt;strong&gt;&lt;em&gt;nvidia/nv-embedqa-e5-v5&lt;/em&gt;&lt;/strong&gt; model, compared against stored document embeddings in Chroma Cloud using cosine similarity. Returns top-k chunks by vector similarity.&lt;/p&gt;

&lt;p&gt;Strength: captures conceptual meaning, handles paraphrasing well.&lt;br&gt;
Weakness: misses exact keyword matches, struggles with short specific queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retriever 2 — BM25 (rank_bm25)&lt;/strong&gt;&lt;br&gt;
Classical sparse retrieval. No embeddings. Scores chunks based on term frequency, inverse document frequency, and document length normalization. The same algorithm that powered search engines before neural networks existed.&lt;/p&gt;

&lt;p&gt;Strength: exact keyword matching, short specific queries, named entities.&lt;/p&gt;

&lt;p&gt;Weakness: no semantic understanding, synonym-blind.&lt;br&gt;
These two retrievers fail in opposite situations. That's exactly why combining them works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implementation&lt;/strong&gt;&lt;br&gt;
Here's the core RRF merge function from ContextQuery:&lt;br&gt;
python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rank_bm25&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BM25Okapi&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reciprocal_rank_fusion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;semantic_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Merge semantic and BM25 ranked lists using RRF.
    Each result dict must have &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; and &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; keys.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;chunk_map&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="c1"&gt;# Score semantic results
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semantic_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chunk_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;

    &lt;span class="c1"&gt;# Score BM25 results
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chunk_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;

    &lt;span class="c1"&gt;# Sort by combined RRF score descending
&lt;/span&gt;    &lt;span class="n"&gt;ranked_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked_ids&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the BM25 retriever setup:&lt;br&gt;
Python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_bm25_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;BM25Okapi&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokenized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;BM25Okapi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bm25_retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bm25_index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BM25Okapi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;tokenized_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bm25_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenized_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;top_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full retrieval call in the FastAPI endpoint:&lt;/p&gt;

&lt;p&gt;python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Run both retrievers
&lt;/span&gt;    &lt;span class="n"&gt;semantic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;chroma_semantic_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bm25_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bm25_retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bm25_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;all_chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Merge with RRF
&lt;/span&gt;    &lt;span class="n"&gt;fused_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;reciprocal_rank_fusion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semantic_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Return top-k from merged list
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fused_results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note I fetch top-10 from each retriever before merging, then cut to top-5 after fusion. This gives RRF enough candidates to actually rerank meaningfully — fetching only top-5 before fusion defeats the purpose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation Results&lt;/strong&gt;&lt;br&gt;
I evaluated ContextQuery using a 16-question test set covering a range of query types: exact keyword queries, conceptual questions, multi-hop questions, and short specific lookups.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Semantic Only&lt;/th&gt;
&lt;th&gt;Hybrid RRF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval Precision&lt;/td&gt;
&lt;td&gt;72%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer Faithfulness&lt;/td&gt;
&lt;td&gt;81%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;87.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Latency&lt;/td&gt;
&lt;td&gt;~1800ms&lt;/td&gt;
&lt;td&gt;~2400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Retrieval precision measures whether the correct chunk appeared in the top-5 results. Faithfulness measures whether the LLM's answer was grounded in the retrieved content rather than hallucinated.&lt;/p&gt;

&lt;p&gt;The 600ms latency increase comes from running BM25 in parallel alongside the semantic search. For my use case this was an acceptable tradeoff. For latency-critical applications, you could run BM25 on a separate thread and set a timeout fallback to semantic-only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I'd Do Differently&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 12.5% faithfulness gap isn't a retrieval problem. After investigation, the remaining faithfulness failures came from chunk boundary issues — the answer to a question was split across two chunks and neither chunk alone was sufficient. The fix is smarter chunking (semantic chunking over fixed token windows), not better retrieval. Hybrid RRF solved the retrieval problem completely; chunking strategy is the next frontier.&lt;/p&gt;

&lt;p&gt;k=60 is a reasonable default but not universal. The smoothing constant k controls how much weight rank position gets versus pure presence in results. I used 60 (the standard default) and didn't tune it. If your query distribution is heavily keyword-biased, a smaller k rewards BM25 rank more aggressively. Worth experimenting with if you're not hitting the precision numbers you need.&lt;/p&gt;

&lt;p&gt;BM25 index needs to be rebuilt on document updates. Unlike the Chroma vector store which handles upserts natively, the BM25 index in my implementation is rebuilt from scratch on each document ingestion event. Fine at small scale, will become a bottleneck with large corpora. A production fix is incremental index updates or a dedicated sparse retrieval service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Embeddings: NVIDIA NIM (&lt;strong&gt;&lt;em&gt;nvidia/nv-embedqa-e5-v5&lt;/em&gt;&lt;/strong&gt;)&lt;br&gt;
Vector store: Chroma Cloud&lt;br&gt;
Sparse retrieval: rank_bm25&lt;br&gt;
Backend: FastAPI (async)&lt;br&gt;
Frontend: Next.js 15&lt;br&gt;
Observability: LangFuse&lt;br&gt;
Deployment: Render (backend) + Vercel (frontend)&lt;br&gt;
Total infra cost: $0 (all free tier)&lt;/p&gt;

&lt;p&gt;Full source: &lt;a href="https://github.com/vivekpatil200320/contextquery" rel="noopener noreferrer"&gt;https://github.com/vivekpatil200320/contextquery&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrapping Up&lt;/strong&gt;&lt;br&gt;
Hybrid retrieval isn't a complex idea — it's two retrievers whose failure modes don't overlap, merged with a formula that takes 10 lines to implement. The results in ContextQuery were significant enough that I now treat it as a default starting point rather than an optimisation.&lt;/p&gt;

&lt;p&gt;If you're building a RAG system and hitting a precision ceiling, add BM25 before you touch chunk size, overlap, or embedding models. It's the highest-leverage change in the retrieval stack.&lt;/p&gt;

&lt;p&gt;Building and writing about production AI systems — find more at &lt;a href="https://vivekpatil23.hashnode.dev" rel="noopener noreferrer"&gt;https://vivekpatil23.hashnode.dev&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
