Siddharth Pandey

Posted on Jun 21

Why Your Reranker Isn't Helping Your RAG Pipeline (And How to Prove It)

#rag #opensource #llm #typescript

You add a cross-encoder reranker to your RAG pipeline, measure answer quality on a test set, see a marginal improvement on 3 of 8 questions, and ship it. Six weeks later your p99 retrieval latency has climbed 200ms per query and you're paying Cohere API costs on every call. Nobody has revisited the decision because there's no data to revisit. The reranker is in the pipeline now. It probably helps.

That "probably" is the problem. RAGScope · npm gives you a per-query metric that tells you exactly whether your reranker is earning its cost — or actively making things worse.

What "Reranker Gain" Actually Measures

When your RAG app runs a query, it emits OpenTelemetry spans: a retrieval span carrying chunk IDs, scores, and content; an optional reranking span; and an LLM span containing the full prompt text. RAGScope receives these via OTLP on port 4321 and analyzes the full trace end-to-end.

The rerank-gain metric answers one question: did the reranker pull the chunks the LLM actually used toward the top of the list? RAGScope compares each chunk's retrieval rank (its position before reranking) against its reranked rank (its position after), then measures the average rank improvement of the chunks that ended up in the LLM's prompt. A chunk that was retrieved 8th but reranked to 2nd and appeared in the prompt counts as a large positive gain. A chunk that was retrieved 2nd, reranked to 9th, and got dropped from the prompt counts as a loss.

The metric only appears in the score when the trace contains a reranker span. When it does, the weights renormalize automatically — precision drops from 40% to 35%, efficiency from 30% to 25%, and rerank-gain takes a 15% slice alongside uniqueness at 15% and coverage at 10%. Traces without a reranker span score exactly as before, so you can compare directly.

Reading the Signal — What Good and Bad Look Like

A reranker earning its cost looks like this:

│  ✓  rerank-gain  88  █████████░  used chunks promoted avg +3.0 ranks

The chunks the LLM actually used were promoted an average of 3 positions by the reranker. That means the reranker is doing its job: surfacing the relevant material higher so it reaches the prompt and lands near the edges where the LLM attends most.

A reranker not earning its cost:

│  ✗  rerank-gain  25  ███░░░░░░░  used chunks demoted avg -2.0 ranks
│  → Reranker is not surfacing the chunks the LLM actually uses

The chunks the LLM used were demoted by the reranker. They reached the prompt despite the reranker, not because of it. The reranker added latency and cost and then moved the useful material further back in the queue.

The key insight is that RAGScope measures gain on the chunks that actually appeared in the LLM's prompt — not on the full ranked list. A reranker can shuffle 10 results around impressively while consistently pushing the 3 chunks the LLM uses toward position 7, 8, and 9. That's not a reranker working; that's a reranker actively degrading retrieval for this query type.

What to Do When the Reranker Is Hurting

The first step is query segmentation. RAGScope scores every query your pipeline processes individually. Run it for a day and you'll have a distribution of rerank-gain scores broken down by query type. If your reranker earns a score of 80+ on factual lookups but consistently scores below 30 on comparison queries, you have a model-query-type mismatch, not a broken reranker.

The second step is checking your reranker's training domain. Cross-encoders trained on MS MARCO work well for web-search-style queries. If your documents are internal API docs, legal contracts, or medical literature, the reranker may be applying a relevance signal that's semantically misaligned with your content. A low rerank-gain score on a specific document type is a strong signal to evaluate a domain-specific model.

If the rerank-gain score is consistently low across query types, the simplest intervention is removing the reranker entirely and routing that latency budget into a higher TOP_K with tighter similarity thresholds. RAGScope's precision metric will tell you immediately whether that trade works: if precision improves and efficiency holds, you've recovered the latency without losing quality.

Conclusion

A reranker is not always additive. It introduces latency, API cost, and an additional failure mode on every query — and most teams have no per-query signal to determine whether it's paying for itself. Aggregate quality metrics on a test set don't expose query-level degradation.

RAGScope's rerank-gain metric gives you that signal query by query, live in your terminal as the pipeline runs. Start it with npx ragscope start, add OTLP instrumentation to your retrieval and reranker calls, and you'll know within the first few queries whether the reranker is earning its place in your pipeline.

GitHub · npm

Key Takeaways

rerank-gain measures the average rank improvement of the chunks the LLM actually used — not the full ranked list, which can mask per-query degradation.
The metric only appears when the trace contains a reranker span; weights renormalize automatically (precision 35%, efficiency 25%, rerank-gain 15%, uniqueness 15%, coverage 10%).
A reranker with consistently negative rerank-gain is demoting the chunks the LLM uses — adding cost and latency for a net-negative retrieval outcome.
Query segmentation reveals whether the reranker works for some query types but not others, pointing to model-query-type mismatch.
If rerank-gain is consistently low across query types, removing the reranker and increasing TOP_K is often a better trade — RAGScope's precision score will validate it immediately.

Top comments (4)

Tae Kim • Jun 21

Reranker latency cost surprised me too. I tried a cross-encoder on a RAG pipeline and the gains on average precision were real but marginal on most queries. What helped was selective reranking: only call the reranker when the top-k similarity scores are clustered close together, meaning retrieval is uncertain. When the top chunk is clearly ahead, skip it. Reduced calls by about 60 percent with almost no precision drop on the cases that mattered.

Siddharth Pandey • Jun 23

Selective reranking based on score clustering is a smart gate — essentially using retrieval confidence as a skip signal. RAGScope's rerank-gain metric measures this exact outcome: whether chunks the LLM actually used moved up in rank after reranking. You'd see that 60% of skipped calls had a clear top chunk (high scoreNormalized, no clustering), and rerank-gain would still be high because the uncertain cases — where it ran — did the heavy lifting.

Tae Kim • Jun 24

The 60% stat makes the gating decision tractable — if a clear top chunk is already there, you're paying latency for nothing. My gate was simpler: skip reranking when top-1 BM25 score exceeded a threshold AND bi-encoder agreed. Not as principled as rerank-gain but cheap to compute inline. The gap you're describing — measuring whether uncertain cases actually benefited — is exactly what I was missing. Makes me want to add that metric to see how much I'm over- or under-gating.

Siddharth Pandey • Jun 24

Exactly — and that's the gap rerank-gain closes. Try it once on a real trace and you'll see immediately whether your uncertain cases are actually getting meaningful rank movement or just adding latency for noise. Would love to hear what you find.