Vivek Patil

Posted on Jul 4 • Originally published at vivekpatil23.hashnode.dev

Hybrid Retrieval + RRF: How I Got 100% Retrieval Precision in a Production RAG System

Originally published at vivekpatil23.hashnode.dev

The Problem With Naive RAG Nobody Talks About
Most RAG tutorials show you the same pipeline: embed your documents, store vectors, embed the query, fetch the top-k nearest neighbors, pass to LLM. It works well enough in demos.

In production, it quietly fails in two specific situations:

Situation 1 — Exact keyword queries. A user asks "What is the ContextQuery API rate limit?" Your semantic search returns chunks about "API usage patterns" and "request throttling behavior" — conceptually related, but the exact phrase "rate limit" is buried or absent. The LLM hallucinates a number because the retrieved chunk doesn't contain one.

Situation 2 — Short, specific queries. Semantic embeddings excel at capturing meaning but compress specificity. A 3-word query like "Chroma collection schema" gets drowned out by semantically adjacent but contextually wrong chunks.

These weren't hypothetical failures. They showed up in my evaluation runs on ContextQuery — a production RAG system I built on free-tier infrastructure (NVIDIA NIM embeddings, Chroma Cloud, FastAPI, Next.js 15). My initial retrieval pipeline had a precision ceiling I couldn't break past 72% no matter how I tuned chunk size or overlap.

The fix was hybrid retrieval using Reciprocal Rank Fusion. Here's exactly how it works and how I implemented it.

What Is Reciprocal Rank Fusion
RRF is a rank merging algorithm. Instead of picking one retrieval method and hoping it covers all query types, you run multiple retrievers independently, each producing a ranked list of chunks. RRF then merges those ranked lists into a single ranking using this formula:

RRF_score(chunk) = Σ 1 / (k + rank_in_retriever)

Where k is a smoothing constant (typically 60) and rank_in_retriever is where that chunk appeared in each retriever's results.

The key insight: a chunk that appears at rank 3 in semantic search AND rank 5 in BM25 gets a higher combined score than a chunk that's rank 1 in only one method. Consensus across retrievers is the signal.

No neural network. No additional model. Just math on top of your existing retrieval infrastructure.

The Two Retrievers I Combined

Retriever 1 — Semantic Search (NVIDIA NIM Embeddings + Chroma Cloud)
Standard dense retrieval. Query gets embedded via NVIDIA's nvidia/nv-embedqa-e5-v5 model, compared against stored document embeddings in Chroma Cloud using cosine similarity. Returns top-k chunks by vector similarity.

Strength: captures conceptual meaning, handles paraphrasing well.
Weakness: misses exact keyword matches, struggles with short specific queries.

Retriever 2 — BM25 (rank_bm25)
Classical sparse retrieval. No embeddings. Scores chunks based on term frequency, inverse document frequency, and document length normalization. The same algorithm that powered search engines before neural networks existed.

Strength: exact keyword matching, short specific queries, named entities.

Weakness: no semantic understanding, synonym-blind.
These two retrievers fail in opposite situations. That's exactly why combining them works.

Implementation
Here's the core RRF merge function from ContextQuery:
python

from rank_bm25 import BM25Okapi
from typing import List, Dict, Any

def reciprocal_rank_fusion(
    semantic_results: List[Dict],
    bm25_results: List[Dict],
    k: int = 60
) -> List[Dict]:
    """
    Merge semantic and BM25 ranked lists using RRF.
    Each result dict must have 'id' and 'content' keys.
    """
    scores: Dict[str, float] = {}
    chunk_map: Dict[str, Dict] = {}

    # Score semantic results
    for rank, chunk in enumerate(semantic_results):
        chunk_id = chunk["id"]
        scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank + 1)
        chunk_map[chunk_id] = chunk

    # Score BM25 results
    for rank, chunk in enumerate(bm25_results):
        chunk_id = chunk["id"]
        scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank + 1)
        chunk_map[chunk_id] = chunk

    # Sort by combined RRF score descending
    ranked_ids = sorted(scores, key=lambda x: scores[x], reverse=True)
    return [chunk_map[chunk_id] for chunk_id in ranked_ids]

And the BM25 retriever setup:
Python

def build_bm25_index(chunks: List[str]) -> BM25Okapi:
    tokenized = [chunk.lower().split() for chunk in chunks]
    return BM25Okapi(tokenized)

def bm25_retrieve(
    query: str,
    bm25_index: BM25Okapi,
    chunks: List[Dict],
    top_k: int = 10
) -> List[Dict]:
    tokenized_query = query.lower().split()
    scores = bm25_index.get_scores(tokenized_query)
    top_indices = sorted(
        range(len(scores)), key=lambda i: scores[i], reverse=True
    )[:top_k]
    return [chunks[i] for i in top_indices]

The full retrieval call in the FastAPI endpoint:

python

async def retrieve(query: str, top_k: int = 5) -> List[Dict]:
    # Run both retrievers
    semantic_results = await chroma_semantic_search(query, top_k=10)
    bm25_results = bm25_retrieve(query, bm25_index, all_chunks, top_k=10)

    # Merge with RRF
    fused_results = reciprocal_rank_fusion(semantic_results, bm25_results)

    # Return top-k from merged list
    return fused_results[:top_k]

Note I fetch top-10 from each retriever before merging, then cut to top-5 after fusion. This gives RRF enough candidates to actually rerank meaningfully — fetching only top-5 before fusion defeats the purpose.

Evaluation Results
I evaluated ContextQuery using a 16-question test set covering a range of query types: exact keyword queries, conceptual questions, multi-hop questions, and short specific lookups.

Metric	Semantic Only	Hybrid RRF
Retrieval Precision	72%	100%
Answer Faithfulness	81%	87.5%
Avg Latency	~1800ms	~2400ms

Retrieval precision measures whether the correct chunk appeared in the top-5 results. Faithfulness measures whether the LLM's answer was grounded in the retrieved content rather than hallucinated.

The 600ms latency increase comes from running BM25 in parallel alongside the semantic search. For my use case this was an acceptable tradeoff. For latency-critical applications, you could run BM25 on a separate thread and set a timeout fallback to semantic-only.

What I'd Do Differently

The 12.5% faithfulness gap isn't a retrieval problem. After investigation, the remaining faithfulness failures came from chunk boundary issues — the answer to a question was split across two chunks and neither chunk alone was sufficient. The fix is smarter chunking (semantic chunking over fixed token windows), not better retrieval. Hybrid RRF solved the retrieval problem completely; chunking strategy is the next frontier.

k=60 is a reasonable default but not universal. The smoothing constant k controls how much weight rank position gets versus pure presence in results. I used 60 (the standard default) and didn't tune it. If your query distribution is heavily keyword-biased, a smaller k rewards BM25 rank more aggressively. Worth experimenting with if you're not hitting the precision numbers you need.

BM25 index needs to be rebuilt on document updates. Unlike the Chroma vector store which handles upserts natively, the BM25 index in my implementation is rebuilt from scratch on each document ingestion event. Fine at small scale, will become a bottleneck with large corpora. A production fix is incremental index updates or a dedicated sparse retrieval service.

Stack Summary

Embeddings: NVIDIA NIM (nvidia/nv-embedqa-e5-v5)
Vector store: Chroma Cloud
Sparse retrieval: rank_bm25
Backend: FastAPI (async)
Frontend: Next.js 15
Observability: LangFuse
Deployment: Render (backend) + Vercel (frontend)
Total infra cost: $0 (all free tier)

Full source: https://github.com/vivekpatil200320/contextquery

Wrapping Up
Hybrid retrieval isn't a complex idea — it's two retrievers whose failure modes don't overlap, merged with a formula that takes 10 lines to implement. The results in ContextQuery were significant enough that I now treat it as a default starting point rather than an optimisation.

If you're building a RAG system and hitting a precision ceiling, add BM25 before you touch chunk size, overlap, or embedding models. It's the highest-leverage change in the retrieval stack.

Building and writing about production AI systems — find more at https://vivekpatil23.hashnode.dev

Top comments (4)

Tae Kim • Jul 4

The 12.5% faithfulness gap is the more interesting result: RRF surfaces the right chunk, but when the answer spans a boundary, precision and faithfulness decouple — you retrieve something, just not the thing containing the full claim. Semantic chunking helps, but the harder fix is query-aware boundary detection where the chunker knows what questions the corpus will answer. The split-answer case is where purely recall-optimized retrieval hits its ceiling and context assembly becomes the bottleneck.

Vivek Patil • Jul 6

Exactly right — and you've named the thing I glossed over in the writeup. RRF optimizes for recall of the right chunk, but when a claim is semantically distributed across a boundary, you're retrieving near the answer, not the answer itself. The precision metric I used doesn't catch this because it only checks chunk presence, not claim completeness.
The query-aware boundary detection angle is something I haven't seen implemented cleanly anywhere — most semantic chunking still uses embedding similarity between adjacent sentences rather than prospective question coverage. If you've seen an approach that handles the split-answer case well I'd genuinely want to read it — it's the next thing I'm digging into for ContextQuery v2.

Tae Kim • Jul 6

The cleanest split-answer handling I've seen is late chunking combined with a coverage probe: instead of deciding boundaries by embedding similarity, you embed the query alongside each candidate sentence during chunking and check whether any single sentence satisfies more than a threshold of the question's intent — if not, extend the chunk until it does. It's computationally heavier at index time but the boundary becomes query-aware rather than structure-aware, which is exactly what the split-answer case needs. Happy to share the sketch of how we did the coverage probe if that's useful for ContextQuery v2.

Vivek Patil • Jul 14

That's the cleanest framing I've heard for this problem — moving the boundary decision from index time structure to query time intent coverage is the right inversion. The threshold probe idea makes sense: you're essentially asking "does this chunk fully contain the claim?" before committing to the boundary, rather than discovering the split at retrieval time when it's too late to fix.
The computational cost at index time is a fair tradeoff for ContextQuery's use case — the corpus is relatively stable, so heavy indexing amortizes well. The part I'd want to understand is how you calibrate the intent threshold without it becoming a tuning nightmare per corpus.
Yes — please share the sketch. Genuinely useful for v2 and I'll credit the approach in the writeup if I implement it.