BM25 + Dense Fusion: When Keyword Search Saves Your RAG

#rag #ai #llm #search

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A user types ERR_CONN_RESET_4290 into your support bot. The fix lives in a runbook that names that exact code three times. Your retriever returns five chunks about connection timeouts, retry policies, and TLS handshakes. The right document is not in the top 50. The model writes a confident, generic answer about checking your network, and the user closes the chat angrier than they opened it.

That failure has a name. It is the lexical gap, and pure vector search walks into it every day. Embeddings are trained to capture meaning, and ERR_CONN_RESET_4290 has almost no meaning to a sentence transformer. It is a token soup that gets averaged into a vector sitting somewhere near every other error string in the corpus. The exact match that a human would spot in half a second is invisible to cosine similarity.

Where dense retrieval quietly fails

Dense embeddings win on paraphrase. Ask "how do I cancel my plan" and they will find a document titled "terminating your subscription" even though they share zero words. That is the whole reason vector search took over RAG.

The cost is that they smear out anything that depends on the exact characters. The categories where this bites:

Identifiers: order numbers, user IDs, ticket references, commit SHAs.
Product codes and SKUs: MX-4400-BLK, iPhone15,2, part numbers.
Error codes and log lines: ERR_CONN_RESET_4290, HTTP 429, stack frames.
Rare proper nouns: a person, a config flag, an internal service name the model has never seen.
Exact-phrase legal or policy language where one wrong word changes the meaning.

An embedding model maps all of these into a space optimized for semantic neighborhoods. A SKU has no semantic neighborhood. It needs to be matched, not understood.

BM25 is the old tool that still wins here

BM25 is a ranking function from the 1990s built on term frequency and inverse document frequency. It scores a document by how often the query terms appear in it, weighted down by how common those terms are across the whole corpus, and adjusted for document length. No training. No GPU. No vectors.

For exact-term queries it is hard to beat. If ERR_CONN_RESET_4290 appears once in the entire corpus, BM25 ranks that document first because the term's inverse document frequency is enormous. The runbook surfaces. The model gets the right context.

Here is a minimal BM25 retriever using the rank_bm25 library over a tokenized corpus.

from rank_bm25 import BM25Okapi
import re

def tokenize(text: str) -> list[str]:
    # keep alphanumerics and codes like ERR_CONN_4290
    return re.findall(r"[a-z0-9_]+", text.lower())

class BM25Index:
    def __init__(self, docs: list[str]):
        self.docs = docs
        tokenized = [tokenize(d) for d in docs]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query: str, k: int) -> list[int]:
        scores = self.bm25.get_scores(tokenize(query))
        ranked = sorted(
            range(len(scores)),
            key=lambda i: scores[i],
            reverse=True,
        )
        return ranked[:k]

That is the whole sparse side. It returns document indices ranked by lexical relevance. Notice the tokenizer keeps underscores and digits, so an error code stays one token instead of getting shredded.

Why not just pick one

The trap is treating this as a choice. Dense or sparse. It is not. Each fails on the other's strength.

A query like "what is your refund window for damaged items" wants the dense retriever, because the matching policy doc might say "we accept returns for goods that arrive defective within 30 days" with barely a shared word. A query like MX-4400-BLK out of stock wants BM25, because the SKU is the whole question.

Real production traffic is a mix of both, often inside the same query: "is MX-4400-BLK covered by your damage policy" needs the SKU matched exactly and the policy matched semantically. You want both retrievers running and their results merged.

Reciprocal rank fusion: merging without tuning scores

The naive merge is to normalize both score lists and add them. This breaks constantly. BM25 scores are unbounded and corpus-dependent. Cosine similarities sit in a narrow band near the top. Putting them on the same scale means picking weights that drift the moment your corpus changes.

Reciprocal rank fusion sidesteps the whole problem. It ignores the raw scores and looks only at the rank position of each document in each list. A document that ranks high in either list gets a high fused score. The formula for one document is the sum over every ranked list of 1 / (k + rank), where k is a constant (60 is the common default from the original Cormack et al. paper) and rank is the document's position in that list.

The k constant flattens the contribution of the top ranks so a single list cannot dominate. It is the reason RRF stays stable across wildly different score distributions. You never normalize anything.

from collections import defaultdict

def reciprocal_rank_fusion(
    rankings: list[list[int]],
    k: int = 60,
) -> list[int]:
    scores: dict[int, float] = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

That is the entire fusion step. It takes any number of ranked lists and returns one merged ranking. It does not care whether a list came from BM25, a vector store, or a third retriever you bolt on later.

The hybrid retriever in about 50 lines

Now wire the two retrievers and fuse them. The dense side here uses a sentence-transformer and cosine similarity over an in-memory matrix, so the example runs without a vector database. Swap DenseIndex for your Qdrant, pgvector, or Pinecone client and the fusion code does not change.

import numpy as np
from sentence_transformers import SentenceTransformer

class DenseIndex:
    def __init__(self, docs: list[str]):
        self.docs = docs
        self.model = SentenceTransformer(
            "all-MiniLM-L6-v2"
        )
        emb = self.model.encode(docs, normalize_embeddings=True)
        self.emb = np.asarray(emb)

    def search(self, query: str, k: int) -> list[int]:
        q = self.model.encode(
            [query], normalize_embeddings=True
        )[0]
        sims = self.emb @ q  # cosine, vectors are normalized
        return list(np.argsort(sims)[::-1][:k])

class HybridRetriever:
    def __init__(self, docs: list[str]):
        self.docs = docs
        self.bm25 = BM25Index(docs)
        self.dense = DenseIndex(docs)

    def search(
        self, query: str, k: int = 5, pool: int = 50
    ) -> list[str]:
        sparse_hits = self.bm25.search(query, pool)
        dense_hits = self.dense.search(query, pool)
        fused = reciprocal_rank_fusion(
            [sparse_hits, dense_hits]
        )
        return [self.docs[i] for i in fused[:k]]

Each retriever pulls a wider pool (50 here) so a document that one side ranks at position 40 still gets a chance to climb under fusion. You fuse the two pools and cut to the final k. The exact-term query lands its document through the BM25 list, the paraphrase query lands its document through the dense list, and the merged query lands both.

Making it production-shaped

The in-memory version teaches the idea. A few things change when you run this for real.

Run the retrievers in parallel. They share no state. Fire the BM25 query and the vector query concurrently and fuse when both return. The latency of the hybrid step becomes the slower of the two, not the sum.

Push BM25 into your store if it has one. Postgres has full-text search with ts_rank. Elasticsearch and OpenSearch ship BM25 as the default scorer. Qdrant, Weaviate, and Milvus now support sparse vectors and hybrid queries natively. You often do not need a separate rank_bm25 index at all, and you avoid holding the whole corpus in memory.

Weight the lists if your traffic skews. RRF can take per-list weights by scaling each contribution. If your corpus is code-heavy and exact matches matter more, multiply the BM25 list's contribution. Start at equal weight and only move it after an eval tells you to.

Rerank the fused top-k. Hybrid fusion gets the right document into the candidate set. A cross-encoder reranker over the fused top 50 then orders them by true relevance. Fusion fixes recall. Reranking fixes precision. They stack.

When you can skip the dense side entirely

If your corpus is small and your queries are almost all exact-term lookups (a parts catalog, a log search, an internal ID resolver), BM25 alone may beat a vector store on both accuracy and cost. No embedding model, no index build, no GPU. Measure before you reach for the heavier tool. The point of hybrid is not that dense is always right. It is that the two retrievers fail in opposite directions, and fusing them covers the gap that either one leaves open.

The next time your bot whiffs on an error code or a SKU, the fix is probably not a better embedding model. It is forty lines of BM25 and a rank fusion you can read in one sitting.

If this was useful

Hybrid retrieval is one of the patterns the RAG Pocket Guide works through end to end — alongside chunking, reranking, and the eval methodology that tells you whether a change like this actually moved recall on your corpus. If your retrieval layer keeps missing the exact thing the user typed, that is the part of the book to start with.