You add a cross-encoder reranker to your RAG pipeline, measure answer quality on a test set, see a marginal improvement on 3 of 8 questions, and ship it. Six weeks later your p99 retrieval latency has climbed 200ms per query and you're paying Cohere API costs on every call. Nobody has revisited the decision because there's no data to revisit. The reranker is in the pipeline now. It probably helps.
That "probably" is the problem. RAGScope · npm gives you a per-query metric that tells you exactly whether your reranker is earning its cost — or actively making things worse.
What "Reranker Gain" Actually Measures
When your RAG app runs a query, it emits OpenTelemetry spans: a retrieval span carrying chunk IDs, scores, and content; an optional reranking span; and an LLM span containing the full prompt text. RAGScope receives these via OTLP on port 4321 and analyzes the full trace end-to-end.
The rerank-gain metric answers one question: did the reranker pull the chunks the LLM actually used toward the top of the list? RAGScope compares each chunk's retrieval rank (its position before reranking) against its reranked rank (its position after), then measures the average rank improvement of the chunks that ended up in the LLM's prompt. A chunk that was retrieved 8th but reranked to 2nd and appeared in the prompt counts as a large positive gain. A chunk that was retrieved 2nd, reranked to 9th, and got dropped from the prompt counts as a loss.
The metric only appears in the score when the trace contains a reranker span. When it does, the weights renormalize automatically — precision drops from 40% to 35%, efficiency from 30% to 25%, and rerank-gain takes a 15% slice alongside uniqueness at 15% and coverage at 10%. Traces without a reranker span score exactly as before, so you can compare directly.
Reading the Signal — What Good and Bad Look Like
A reranker earning its cost looks like this:
│ ✓ rerank-gain 88 █████████░ used chunks promoted avg +3.0 ranks
The chunks the LLM actually used were promoted an average of 3 positions by the reranker. That means the reranker is doing its job: surfacing the relevant material higher so it reaches the prompt and lands near the edges where the LLM attends most.
A reranker not earning its cost:
│ ✗ rerank-gain 25 ███░░░░░░░ used chunks demoted avg -2.0 ranks
│ → Reranker is not surfacing the chunks the LLM actually uses
The chunks the LLM used were demoted by the reranker. They reached the prompt despite the reranker, not because of it. The reranker added latency and cost and then moved the useful material further back in the queue.
The key insight is that RAGScope measures gain on the chunks that actually appeared in the LLM's prompt — not on the full ranked list. A reranker can shuffle 10 results around impressively while consistently pushing the 3 chunks the LLM uses toward position 7, 8, and 9. That's not a reranker working; that's a reranker actively degrading retrieval for this query type.
What to Do When the Reranker Is Hurting
The first step is query segmentation. RAGScope scores every query your pipeline processes individually. Run it for a day and you'll have a distribution of rerank-gain scores broken down by query type. If your reranker earns a score of 80+ on factual lookups but consistently scores below 30 on comparison queries, you have a model-query-type mismatch, not a broken reranker.
The second step is checking your reranker's training domain. Cross-encoders trained on MS MARCO work well for web-search-style queries. If your documents are internal API docs, legal contracts, or medical literature, the reranker may be applying a relevance signal that's semantically misaligned with your content. A low rerank-gain score on a specific document type is a strong signal to evaluate a domain-specific model.
If the rerank-gain score is consistently low across query types, the simplest intervention is removing the reranker entirely and routing that latency budget into a higher TOP_K with tighter similarity thresholds. RAGScope's precision metric will tell you immediately whether that trade works: if precision improves and efficiency holds, you've recovered the latency without losing quality.
Conclusion
A reranker is not always additive. It introduces latency, API cost, and an additional failure mode on every query — and most teams have no per-query signal to determine whether it's paying for itself. Aggregate quality metrics on a test set don't expose query-level degradation.
RAGScope's rerank-gain metric gives you that signal query by query, live in your terminal as the pipeline runs. Start it with npx ragscope start, add OTLP instrumentation to your retrieval and reranker calls, and you'll know within the first few queries whether the reranker is earning its place in your pipeline.
Key Takeaways
- rerank-gain measures the average rank improvement of the chunks the LLM actually used — not the full ranked list, which can mask per-query degradation.
- The metric only appears when the trace contains a reranker span; weights renormalize automatically (precision 35%, efficiency 25%, rerank-gain 15%, uniqueness 15%, coverage 10%).
- A reranker with consistently negative rerank-gain is demoting the chunks the LLM uses — adding cost and latency for a net-negative retrieval outcome.
- Query segmentation reveals whether the reranker works for some query types but not others, pointing to model-query-type mismatch.
- If rerank-gain is consistently low across query types, removing the reranker and increasing TOP_K is often a better trade — RAGScope's precision score will validate it immediately.
Top comments (0)