Hybrid Retrieval Fusion: RRF vs Weighted vs Learned: When Each Wins

#rag #ai #search #benchmark

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Hybrid retrieval is BM25 plus a dense retriever plus a fusion step. The first two get all the attention. The fusion step is where most teams paste the textbook default (RRF with k=60), stop thinking, and walk away. That single decision often costs more nDCG than swapping the embedding model.

Three fusion strategies, three failure modes, and one specific case where learned fusion beats both of the simpler ones.

Why fusion matters more than people think

A team I worked with had a hybrid setup: BM25 over Postgres full-text, dense retrieval through OpenAI's text-embedding-3-large, top-100 from each, fused with default RRF. The eval suite said nDCG@10 = 0.62.

Same retrievers, same top-100 candidates, same eval queries. They switched the fusion to a weighted blend with proper min-max normalisation and a 0.4/0.6 split favouring dense. nDCG@10 went to 0.70. Eight points from a fifteen-line change.

That gap exists because the two retrievers disagree on the shape of relevance. BM25 is sharp on lexical overlap and goes to zero fast. Dense scores cluster in a narrow band (cosine often between 0.3 and 0.8). RRF throws away both shapes and only looks at rank position. That's why it's safe. It's also why it's almost never optimal.

Reciprocal Rank Fusion (k=60)

RRF was introduced in a 2009 TREC paper by Cormack, Clarke, and Buettcher. The formula is one line:

def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    """
    rankings: list of ranked doc-id lists, one per retriever.
                Position 0 = best.
    k: smoothing constant. The paper uses 60. Most people copy that.
    """
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
    return scores


def rrf_topk(rankings: list[list[str]], k_rrf: int = 60,
             top_k: int = 10) -> list[tuple[str, float]]:
    scored = rrf(rankings, k_rrf)
    return sorted(scored.items(), key=lambda x: -x[1])[:top_k]

That's it. No score normalisation, no per-retriever tuning, no training data. You hand it ranked lists, you get a fused ranking. It works on day one with anything that produces an order.

Why it sticks: it's score-shape-invariant. BM25 returning 14.7 and a cosine returning 0.83 are both flattened to "rank 1". You can't break it by feeding it scores in mismatched units.

Why k=60: the original paper picked it empirically on TREC data. The constant softens the dominance of the top-1 result, so a doc that's rank 1 on retriever A and rank 12 on retriever B can still beat a doc that's rank 1 on A and absent from B. Smaller k makes top ranks more dominant. Larger k flattens the curve. On most corpora, anywhere from 40 to 80 produces nearly identical results.

The honest limitation: RRF can't express "I trust the dense retriever 1.5x more on this domain". Every retriever contributes the same shape, weighted equally. If one of your retrievers is noticeably stronger, RRF is leaving points on the table.

Weighted fusion, and the score-normalisation trap

The instinct is straightforward. Take both score lists, blend them, weight them.

final = w_bm25 * bm25_score + w_dense * dense_score

This blows up the moment you actually try it. BM25 scores on a typical corpus range from 0 to ~30. Cosine similarity sits between -1 and 1 (and, with normalised embeddings, usually 0.3 to 0.9). Add them naively and BM25 wins every fusion by accident, because the numbers are bigger.

You have to normalise both into the same range first. Min-max into [0, 1] is the standard move:

def min_max_normalize(scores: dict[str, float]) -> dict[str, float]:
    if not scores:
        return {}
    vals = list(scores.values())
    lo, hi = min(vals), max(vals)
    if hi - lo < 1e-9:
        # all scores equal, degenerate case, return uniform 1.0
        return {doc_id: 1.0 for doc_id in scores}
    return {doc_id: (s - lo) / (hi - lo) for doc_id, s in scores.items()}


def weighted_fusion(
    bm25: dict[str, float],
    dense: dict[str, float],
    w_bm25: float = 0.4,
    w_dense: float = 0.6,
) -> dict[str, float]:
    bm25_n = min_max_normalize(bm25)
    dense_n = min_max_normalize(dense)
    all_ids = set(bm25_n) | set(dense_n)
    fused = {}
    for doc_id in all_ids:
        # missing from one retriever = 0 contribution
        fused[doc_id] = (
            w_bm25 * bm25_n.get(doc_id, 0.0)
            + w_dense * dense_n.get(doc_id, 0.0)
        )
    return fused

Two things often go wrong here.

Missing-doc penalty. A doc that BM25 returned but the dense retriever didn't shouldn't get a hard 0 on the dense side; that's a tail-penalty bias. Some teams cap the missing score at the bottom-10 percentile of the present scores. Pick a policy and document it.

Per-query vs corpus-level normalisation. Min-max over a single query's top-100 is the right default. Some implementations normalise over the entire corpus, precomputed once. That kills sensitivity, because the per-query score distribution is what carries the signal.

Once you have normalisation right, the weights matter. On most corpora, dense gets 0.5–0.7 because semantic match generalises better. On product-catalog or code-search corpora where exact terms matter (SKU numbers, function names), BM25 weight goes up. The only honest way to find the right weight is grid search over your eval set, which the bench script below handles.

Learned fusion with LightGBM

Weighted fusion treats the weights as static. The real world isn't static. For a short keyword query like "outbox pattern", BM25 should win. For "what happens when a downstream consumer is slow", dense should dominate. Learned fusion lets the model decide per query.

The trick is to frame it as learning-to-rank. For each (query, doc) pair, gather features: BM25 score, dense score, BM25 rank, dense rank, length ratios, query type signals. Use LightGBM's LambdaRank objective to learn a scoring function that maximises nDCG directly.

import lightgbm as lgb
import numpy as np


def build_features(
    query: str,
    candidates: list[str],
    bm25: dict[str, float],
    dense: dict[str, float],
    doc_lengths: dict[str, int],
) -> np.ndarray:
    # rank lookups (1-indexed, 999 = not present)
    bm25_order = sorted(bm25, key=lambda d: -bm25[d])
    dense_order = sorted(dense, key=lambda d: -dense[d])
    bm25_rank = {d: i + 1 for i, d in enumerate(bm25_order)}
    dense_rank = {d: i + 1 for i, d in enumerate(dense_order)}

    rows = []
    q_len = len(query.split())
    for doc_id in candidates:
        rows.append([
            bm25.get(doc_id, 0.0),
            dense.get(doc_id, 0.0),
            bm25_rank.get(doc_id, 999),
            dense_rank.get(doc_id, 999),
            1.0 / bm25_rank.get(doc_id, 999),
            1.0 / dense_rank.get(doc_id, 999),
            doc_lengths.get(doc_id, 0),
            q_len,
            doc_lengths.get(doc_id, 0) / max(q_len, 1),
        ])
    return np.array(rows, dtype=np.float32)


def train_learned_fusion(
    X: np.ndarray,
    y: np.ndarray,            # relevance labels (0..4 typical)
    group: list[int],         # docs per query
) -> lgb.Booster:
    dataset = lgb.Dataset(X, label=y, group=group)
    params = {
        "objective": "lambdarank",
        "metric": "ndcg",
        "ndcg_eval_at": [10],
        "learning_rate": 0.05,
        "num_leaves": 31,
        "min_data_in_leaf": 20,
        "verbosity": -1,
    }
    # early stopping in real code — left out for brevity
    return lgb.train(params, dataset, num_boost_round=200)

LightGBM expects features grouped by query. group=[100, 100, 100] means three queries with 100 candidates each. Labels are graded relevance: 0 (irrelevant), 1 (somewhat), 2 (relevant), 3 (highly relevant), 4 (perfect).

This needs labelled data. A few thousand graded query-doc pairs is the entry ticket. Below that, you're better off with weighted fusion and a careful grid search. Above 1k QPS in production with click logs that you can mine into implicit labels, learned fusion typically beats tuned weighted fusion by 3–6 points nDCG@10.

The trade is real: more infrastructure, model retraining cadence, the eternal feature-drift question. Worth it only when retrieval quality is on the critical path and you have the labels.

A 100-line fusion benchmark

Stop arguing about RRF vs weighted on Slack. Run all three on your corpus. This script does it in about a hundred lines.

# bench_fusion.py — drop into your repo, swap in your own retrievers.
import json
import math
import numpy as np
from pathlib import Path


def load_qrels(path: str) -> dict[str, dict[str, int]]:
    """TREC qrels format: query_id  0  doc_id  relevance"""
    qrels: dict[str, dict[str, int]] = {}
    for line in Path(path).read_text().splitlines():
        qid, _, doc_id, rel = line.split()
        qrels.setdefault(qid, {})[doc_id] = int(rel)
    return qrels


def ndcg_at_k(ranking: list[str], rels: dict[str, int], k: int = 10) -> float:
    dcg = 0.0
    for i, doc_id in enumerate(ranking[:k]):
        gain = (2 ** rels.get(doc_id, 0)) - 1
        dcg += gain / math.log2(i + 2)
    ideal_rels = sorted(rels.values(), reverse=True)[:k]
    idcg = sum(((2 ** r) - 1) / math.log2(i + 2)
               for i, r in enumerate(ideal_rels))
    return dcg / idcg if idcg > 0 else 0.0


def fuse_rrf(rankings, k_rrf=60):
    s = {}
    for r in rankings:
        for rank, d in enumerate(r):
            s[d] = s.get(d, 0.0) + 1.0 / (k_rrf + rank + 1)
    return sorted(s, key=lambda d: -s[d])


def minmax(scores):
    if not scores:
        return {}
    lo, hi = min(scores.values()), max(scores.values())
    if hi - lo < 1e-9:
        return {d: 1.0 for d in scores}
    return {d: (v - lo) / (hi - lo) for d, v in scores.items()}


def fuse_weighted(bm25_s, dense_s, w_b=0.4, w_d=0.6):
    bn, dn = minmax(bm25_s), minmax(dense_s)
    ids = set(bn) | set(dn)
    fused = {d: w_b * bn.get(d, 0.0) + w_d * dn.get(d, 0.0) for d in ids}
    return sorted(fused, key=lambda d: -fused[d])


def run_bench(queries_path: str, qrels_path: str, runs_path: str):
    # runs.json: {qid: {"bm25": {doc:score,...}, "dense": {doc:score,...}}}
    qrels = load_qrels(qrels_path)
    runs = json.loads(Path(runs_path).read_text())
    rrf_scores, w_scores = [], []
    grid = [(0.2, 0.8), (0.3, 0.7), (0.4, 0.6), (0.5, 0.5), (0.6, 0.4)]
    grid_scores = {g: [] for g in grid}

    for qid, retriever_runs in runs.items():
        bm25_s = retriever_runs["bm25"]
        dense_s = retriever_runs["dense"]
        bm25_rank = sorted(bm25_s, key=lambda d: -bm25_s[d])
        dense_rank = sorted(dense_s, key=lambda d: -dense_s[d])
        rels = qrels.get(qid, {})

        rrf_scores.append(ndcg_at_k(fuse_rrf([bm25_rank, dense_rank]), rels))
        w_scores.append(
            ndcg_at_k(fuse_weighted(bm25_s, dense_s, 0.4, 0.6), rels)
        )
        for wb, wd in grid:
            grid_scores[(wb, wd)].append(
                ndcg_at_k(fuse_weighted(bm25_s, dense_s, wb, wd), rels)
            )

    print(f"RRF k=60       nDCG@10: {np.mean(rrf_scores):.4f}")
    print(f"Weighted 0.4/0.6 nDCG@10: {np.mean(w_scores):.4f}")
    print("Grid search:")
    for g, scores in sorted(grid_scores.items(),
                            key=lambda kv: -np.mean(kv[1])):
        print(f"  bm25={g[0]} dense={g[1]} : {np.mean(scores):.4f}")


if __name__ == "__main__":
    run_bench("queries.tsv", "qrels.txt", "runs.json")

Feed it a TREC-format qrels file, a JSON of per-query retriever scores, and you get a comparison table you can defend in a design review. Most teams discover their "tuned" weights are off by 0.1–0.2 from the actual optimum.

The gotcha: score shapes across retrievers

The trap nobody warns you about: score normalisation hides the fact that different retrievers produce different shapes, not just different scales.

BM25 scores follow a long-tail distribution: a couple of strong hits, then a drop, then a long mediocre tail. Cosine similarity over normalised embeddings produces a tight bell. Dot products from non-normalised embeddings can be unbounded and follow yet another shape.

Min-max normalisation maps both to [0, 1] but preserves the shape. A 0.9 BM25-normalised score means "this doc is dramatically better than the rest of the candidates". A 0.9 cosine-normalised score often just means "this doc is slightly better than its neighbours, which are also close to the top". Same number, different meaning.

Two practical workarounds:

Z-score before min-max. Standardise to mean 0 and variance 1, then squash with a sigmoid or min-max. This punishes outliers less aggressively and aligns the shapes better.

Rank-based features in learned fusion. When you let LightGBM see the raw rank position alongside the score, it implicitly learns shape compensation. That's why learned fusion edges out weighted fusion on heterogeneous retrievers, even when the score-level gap is small.

If you're sticking with weighted fusion, swap min-max for (score - mean) / std, clip outliers to ±3 sigma, then sigmoid. Cleaner blend, less retriever-domination drift across queries.

What fusion strategy are you running in production, and have you actually benchmarked it against the alternatives, or is it still the default RRF you copy-pasted from a tutorial?

If this was useful

Fusion is one of those parts of a RAG pipeline that looks like a one-liner until you measure it. The RAG Pocket Guide walks through the retrieval-and-reranking chain end to end: chunking choices, hybrid setups, fusion strategies, the cross-encoder rerank layer that sits on top, and the eval setup that tells you whether any of it is moving the needle. The chapter on fusion expands on the score-shape issue and walks through training a learned fuser on click logs.