Felipe Araújo

Posted on Jun 19

Replacing Cross-Encoder Reranking with a Weighted Hybrid Score

#ai #nlp #performance #rag

My RAG pipeline had a bottleneck, and the fix turned out to be just simple math.

The problem

My retrieval pipeline for Uma Busca de Gelo e Fogo, a RAG system over the full A Song of Ice and Fire corpus (~66k paragraphs), follows a fairly standard hybrid retrieval setup:

Dense retrieval (ChromaDB, bge-m3)
    → BM25 sparse (BM25Okapi)
        → RRF fusion
            → Cross-encoder rerank (bge-reranker-v2-m3)
                → Top chunks → LLM

The cross-encoder (bge-reranker-v2-m3) was doing its job, reordering the fused candidates by genuine semantic relevance. The problem was the cost: on CPU, reranking just 10 chunks took 8.6 seconds. The full search pipeline averaged 12.57 seconds per query. That's before the LLM even starts generating a response.

In a chat interface, 12 seconds feels painfully slow. Nobody wants to wait 12 seconds for a search step.

The insight: I already had the signals, I just wasn't using them

Before the cross-encoder ever runs, the pipeline already computes three independent relevance signals for every candidate chunk:

bm25_score — lexical relevance from BM25Okapi
dense_cosine — semantic similarity from the dense retrieval step (1 − cosine distance from ChromaDB)
rrf_score — the fused rank score from Reciprocal Rank Fusion

All three were being computed, used internally for fusion, and then discarded before reranking. The cross-encoder was reading the raw chunk text and recomputing relevance from scratch — expensive and, in part, redundant with signals already sitting in memory.

So the question became: what if, instead of running a second neural model, I just combined the signals I already had?

The solution: reranking as a weighted sum

The replacement is genuinely this simple. For each chunk, after min-max normalizing each signal (since BM25 and cosine live on very different scales):

def lightweight_rerank(chunks, weights):
    norm_bm25 = normalize([c["bm25_score"] for c in chunks])
    norm_dense = normalize([c["dense_cosine"] for c in chunks])
    norm_rrf = normalize([c["rrf_score"] for c in chunks])

    for chunk, b, d, r in zip(chunks, norm_bm25, norm_dense, norm_rrf):
        chunk["final_score"] = (
            weights["bm25"] * b +
            weights["dense"] * d +
            weights["rrf"] * r
        )

    return sorted(chunks, key=lambda c: c["final_score"], reverse=True)

Default weights: bm25=0.3, dense=0.5, rrf=0.2. No transformer forward pass. No GPU. No 1.5GB model loaded into memory. Just a weighted sum of numbers that were already sitting in the pipeline.

I kept the original cross-encoder code completely intact, gated behind an environment variable (RERANKER_MODE=cross_encoder|lightweight), so I could A/B test instead of betting the whole pipeline on a hunch.

The result: ~13x faster

Metric	Cross-Encoder	Lightweight	Improvement
Reranking step (10 chunks)	8.6s	~0.01s	~860x faster
Full search pipeline	12.57s	0.96s	13.1x faster
Model RAM footprint	~1.5GB	0	—

The reranking step itself went from being the dominant cost in the pipeline to being essentially free, a handful of arithmetic operations on already-computed numbers. The full pipeline now responds in under a second instead of nearly 13.

But does the ranking still make sense?

Speed alone doesn't matter if the lightweight reranker is just putting irrelevant chunks first. So I compared the actual ordering it produces against the cross-encoder's ordering, across 18 test queries:

Metric	Value
Overlap@10	1.000
NDCG@10	0.889
MRR	0.458

Overlap@10 = 1.000 means both methods return the exact same 10 candidate chunks, the only difference is the order they're placed in. NDCG@10 = 0.889 confirms that order is, overall, quite close to what the cross-encoder would produce. MRR = 0.458 is the more honest number: the cross-encoder's top pick isn't always the lightweight reranker's top pick, though it usually lands in the top 2-3.

That gap matters, and I'm not going to pretend it doesn't, which is part of why this isn't the end of the story (more on that below).

Why this works at all

The honest technical reason this isn't crazy: BM25, dense cosine similarity, and RRF score are already decent relevance signals on their own, that's the whole premise of hybrid retrieval. The cross-encoder's job was to refine an already-reasonable ordering, not to find relevance from nothing. When Overlap@10 is 1.000, the heavy lifting (deciding which 10 chunks matter) was already done upstream by retrieval and fusion. The cross-encoder, in this setup, was mostly fine-tuning an ordering that was largely correct already, and a weighted sum of existing signals can approximate that fine-tuning at a fraction of the cost.

I also tried something more rigorous than guessing the weights: I used the cross-encoder's own scores as a training signal and fit a linear regression (numpy.linalg.lstsq) over the three normalized signals. The result: R² = 0.128. In plain terms, the three signals combined linearly explain only about 13% of the variance in what the cross-encoder actually scores. The cross-encoder is doing something genuinely non-linear, picking up on relationships between query and text that a weighted sum of BM25/cosine/RRF can't represent, no matter how the weights are tuned.

That's a useful negative result. It tells me the manual default weights are about as good as this approach is going to get, there's no hidden linear combination waiting to be discovered. If I want to close the remaining gap with the cross-encoder, linear combination of these three signals isn't the path.

What this means in practice

For my use case, a chat interface where someone asks questions about ASOIAF lore and expects a fast, conversational response, this trade was worth it. Going from 12.57s to 0.96s per search step is the difference between "usable in a live chat" and "noticeably broken." And the lightweight reranker isn't reordering chunks randomly; it's a reasonable approximation that gets the same candidate set, mostly in the same order.

What I'm not claiming: that this is a drop-in replacement for a cross-encoder in every RAG system. If your bottleneck isn't latency, or if you need maximum precision regardless of cost, the cross-encoder is still doing real work that a weighted sum can't fully replicate (that R²=0.128 result makes that explicit).

What's next

This change was isolated to the reranking step specifically so I could measure it cleanly. But the lightweight reranker isn't the end of the optimization work, it's a starting point for a few things I'm actively testing now:

Switching the generation model. I'm currently experimenting with Qwen3.6 instead of Llama 3.3 70B for the generation step, to see if it handles the retrieved context more reliably.
Re-embedding with optimized chunking. I'm rebuilding the corpus embeddings with a revised chunking strategy, which should change the quality of what dense retrieval surfaces in the first place, upstream of anything the reranker does.
Prompt adjustments for the generation step, to make sure the LLM anchors its answers more tightly to the retrieved context, independent of which reranker is feeding it chunks.

Each of those is a separate variable, and I'm keeping them isolated rather than changing everything at once — which is exactly how I caught the reranker's actual impact in the first place. The next article will cover what happens when those land.

Following this rebuild in public. The project is live at buscadegeloefogo.vercel.app, and the source is on GitHub.

DEV Community