DEV Community

Vivek Patil
Vivek Patil

Posted on • Originally published at vivekpatil23.hashnode.dev

Hybrid Retrieval + RRF: How I Got 100% Retrieval Precision in a Production RAG System

Originally published at vivekpatil23.hashnode.dev

The Problem With Naive RAG Nobody Talks About
Most RAG tutorials show you the same pipeline: embed your documents, store vectors, embed the query, fetch the top-k nearest neighbors, pass to LLM. It works well enough in demos.

In production, it quietly fails in two specific situations:

Situation 1 — Exact keyword queries. A user asks "What is the ContextQuery API rate limit?" Your semantic search returns chunks about "API usage patterns" and "request throttling behavior" — conceptually related, but the exact phrase "rate limit" is buried or absent. The LLM hallucinates a number because the retrieved chunk doesn't contain one.

Situation 2 — Short, specific queries. Semantic embeddings excel at capturing meaning but compress specificity. A 3-word query like "Chroma collection schema" gets drowned out by semantically adjacent but contextually wrong chunks.

These weren't hypothetical failures. They showed up in my evaluation runs on ContextQuery — a production RAG system I built on free-tier infrastructure (NVIDIA NIM embeddings, Chroma Cloud, FastAPI, Next.js 15). My initial retrieval pipeline had a precision ceiling I couldn't break past 72% no matter how I tuned chunk size or overlap.

The fix was hybrid retrieval using Reciprocal Rank Fusion. Here's exactly how it works and how I implemented it.

What Is Reciprocal Rank Fusion
RRF is a rank merging algorithm. Instead of picking one retrieval method and hoping it covers all query types, you run multiple retrievers independently, each producing a ranked list of chunks. RRF then merges those ranked lists into a single ranking using this formula:

RRF_score(chunk) = Σ 1 / (k + rank_in_retriever)

Where k is a smoothing constant (typically 60) and rank_in_retriever is where that chunk appeared in each retriever's results.

The key insight: a chunk that appears at rank 3 in semantic search AND rank 5 in BM25 gets a higher combined score than a chunk that's rank 1 in only one method. Consensus across retrievers is the signal.

No neural network. No additional model. Just math on top of your existing retrieval infrastructure.

The Two Retrievers I Combined

Retriever 1 — Semantic Search (NVIDIA NIM Embeddings + Chroma Cloud)
Standard dense retrieval. Query gets embedded via NVIDIA's nvidia/nv-embedqa-e5-v5 model, compared against stored document embeddings in Chroma Cloud using cosine similarity. Returns top-k chunks by vector similarity.

Strength: captures conceptual meaning, handles paraphrasing well.
Weakness: misses exact keyword matches, struggles with short specific queries.

Retriever 2 — BM25 (rank_bm25)
Classical sparse retrieval. No embeddings. Scores chunks based on term frequency, inverse document frequency, and document length normalization. The same algorithm that powered search engines before neural networks existed.

Strength: exact keyword matching, short specific queries, named entities.

Weakness: no semantic understanding, synonym-blind.
These two retrievers fail in opposite situations. That's exactly why combining them works.

Implementation
Here's the core RRF merge function from ContextQuery:
python

from rank_bm25 import BM25Okapi
from typing import List, Dict, Any

def reciprocal_rank_fusion(
    semantic_results: List[Dict],
    bm25_results: List[Dict],
    k: int = 60
) -> List[Dict]:
    """
    Merge semantic and BM25 ranked lists using RRF.
    Each result dict must have 'id' and 'content' keys.
    """
    scores: Dict[str, float] = {}
    chunk_map: Dict[str, Dict] = {}

    # Score semantic results
    for rank, chunk in enumerate(semantic_results):
        chunk_id = chunk["id"]
        scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank + 1)
        chunk_map[chunk_id] = chunk

    # Score BM25 results
    for rank, chunk in enumerate(bm25_results):
        chunk_id = chunk["id"]
        scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank + 1)
        chunk_map[chunk_id] = chunk

    # Sort by combined RRF score descending
    ranked_ids = sorted(scores, key=lambda x: scores[x], reverse=True)
    return [chunk_map[chunk_id] for chunk_id in ranked_ids]
Enter fullscreen mode Exit fullscreen mode

And the BM25 retriever setup:
Python

def build_bm25_index(chunks: List[str]) -> BM25Okapi:
    tokenized = [chunk.lower().split() for chunk in chunks]
    return BM25Okapi(tokenized)

def bm25_retrieve(
    query: str,
    bm25_index: BM25Okapi,
    chunks: List[Dict],
    top_k: int = 10
) -> List[Dict]:
    tokenized_query = query.lower().split()
    scores = bm25_index.get_scores(tokenized_query)
    top_indices = sorted(
        range(len(scores)), key=lambda i: scores[i], reverse=True
    )[:top_k]
    return [chunks[i] for i in top_indices]
Enter fullscreen mode Exit fullscreen mode

The full retrieval call in the FastAPI endpoint:

python

async def retrieve(query: str, top_k: int = 5) -> List[Dict]:
    # Run both retrievers
    semantic_results = await chroma_semantic_search(query, top_k=10)
    bm25_results = bm25_retrieve(query, bm25_index, all_chunks, top_k=10)

    # Merge with RRF
    fused_results = reciprocal_rank_fusion(semantic_results, bm25_results)

    # Return top-k from merged list
    return fused_results[:top_k]
Enter fullscreen mode Exit fullscreen mode

Note I fetch top-10 from each retriever before merging, then cut to top-5 after fusion. This gives RRF enough candidates to actually rerank meaningfully — fetching only top-5 before fusion defeats the purpose.

Evaluation Results
I evaluated ContextQuery using a 16-question test set covering a range of query types: exact keyword queries, conceptual questions, multi-hop questions, and short specific lookups.

Metric Semantic Only Hybrid RRF
Retrieval Precision 72% 100%
Answer Faithfulness 81% 87.5%
Avg Latency ~1800ms ~2400ms

Retrieval precision measures whether the correct chunk appeared in the top-5 results. Faithfulness measures whether the LLM's answer was grounded in the retrieved content rather than hallucinated.

The 600ms latency increase comes from running BM25 in parallel alongside the semantic search. For my use case this was an acceptable tradeoff. For latency-critical applications, you could run BM25 on a separate thread and set a timeout fallback to semantic-only.

What I'd Do Differently

The 12.5% faithfulness gap isn't a retrieval problem. After investigation, the remaining faithfulness failures came from chunk boundary issues — the answer to a question was split across two chunks and neither chunk alone was sufficient. The fix is smarter chunking (semantic chunking over fixed token windows), not better retrieval. Hybrid RRF solved the retrieval problem completely; chunking strategy is the next frontier.

k=60 is a reasonable default but not universal. The smoothing constant k controls how much weight rank position gets versus pure presence in results. I used 60 (the standard default) and didn't tune it. If your query distribution is heavily keyword-biased, a smaller k rewards BM25 rank more aggressively. Worth experimenting with if you're not hitting the precision numbers you need.

BM25 index needs to be rebuilt on document updates. Unlike the Chroma vector store which handles upserts natively, the BM25 index in my implementation is rebuilt from scratch on each document ingestion event. Fine at small scale, will become a bottleneck with large corpora. A production fix is incremental index updates or a dedicated sparse retrieval service.

Stack Summary

Embeddings: NVIDIA NIM (nvidia/nv-embedqa-e5-v5)
Vector store: Chroma Cloud
Sparse retrieval: rank_bm25
Backend: FastAPI (async)
Frontend: Next.js 15
Observability: LangFuse
Deployment: Render (backend) + Vercel (frontend)
Total infra cost: $0 (all free tier)

Full source: https://github.com/vivekpatil200320/contextquery

Wrapping Up
Hybrid retrieval isn't a complex idea — it's two retrievers whose failure modes don't overlap, merged with a formula that takes 10 lines to implement. The results in ContextQuery were significant enough that I now treat it as a default starting point rather than an optimisation.

If you're building a RAG system and hitting a precision ceiling, add BM25 before you touch chunk size, overlap, or embedding models. It's the highest-leverage change in the retrieval stack.

Building and writing about production AI systems — find more at https://vivekpatil23.hashnode.dev

Top comments (0)