When the Reranker Hurts: Recall@5 Cases Where Two-Stage Retrieval Loses to One

#rag #ai #llm #benchmark

Book: RAG Pocket Guide
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I spoke with was debugging a sudden recall regression. Their RAG pipeline had been fine for months. Then product moved the search box from the docs site to an internal admin tool, and answers got worse overnight. Same corpus and embedding model, and the same bge-reranker-v2-m3 in stage two. Recall@5 on their golden set dropped from 0.83 to 0.71.

The instinct was that something in the index drifted. The actual cause was simpler. The new query distribution was short, typo-heavy admin queries: "refnd 14d policy", "webhook timeout pcfg". The cross-encoder reordered the candidate set worse than the bi-encoder alone.

Pulling the reranker out raised recall@5 back to 0.82.

The advice that ships with every RAG tutorial is "always rerank." On average, it's good advice. Wrong often enough that you should know when.

The benchmarks tell on themselves

If you read the public reranker numbers carefully, they say two things at once. Public reranker numbers from Pinecone, BAAI, and the cross-encoder literature all show strong average lifts. The Pinecone reranker post reports double-digit NDCG gains on TREC and meaningful averages over dense retrieval. The Defense of Cross-Encoders paper shows large cross-encoders beating dense bi-encoders by several NDCG points across BEIR. Those are real wins.

But every one of those papers also has per-dataset breakdowns. On a few BEIR sets the gain is under one point. On at least one, depending on the cross-encoder you pick, the rerank stage barely moves the needle or moves it the wrong way. The averaging hides the variance.

You ship in one corpus, with one query distribution and one set of hard negatives. The averaged graph does not tell you which side of the variance you are on.

Four cases where the reranker hurts

These are the patterns that keep coming up across teams I work with. None are exotic, and all four are visible in your traces if you look.

1. Domain mismatch between reranker training and your corpus

Most off-the-shelf cross-encoders are trained on MS MARCO — web passages, factoid queries. If your corpus is legal contracts, medical guidelines, source code, or compliance policy, the reranker is being asked to score relevance in a register it never saw at scale. It will not refuse. It will produce confident, wrong scores.

The bi-encoder has the same problem in principle, but the bi-encoder is just a similarity function. Its mistakes are diffuse rather than confident. A poorly-fit cross-encoder is confident and wrong. It will pull a hard negative to the top because the surface tokens look like a passage answer in MS MARCO.

You see this most loudly on legal-style retrieval where two clauses share most of their vocabulary and only one answers the question.

2. Short or typo-heavy queries

Cross-encoders are query-sensitive in a way bi-encoders are not. The bi-encoder embeds your query into a smoothed semantic neighbourhood; small spelling errors mostly survive that. The cross-encoder reads the query as text and uses every token through full attention. A 3-word misspelled admin query gives it almost nothing to attend to, and what it does attend to is noise.

This is the case the team above hit. Their query log was 60% under 5 tokens, with a typo rate of about 18%. The reranker had nothing useful to lift on those queries. On the ambiguous ones, it actively reordered the right answer down.

3. Hard negatives the reranker mis-rates

Every corpus has documents that look like they answer the question but do not. A pricing page for the EU plan vs the US plan. A v1 API spec next to a v2 API spec. The Q3 incident postmortem next to the Q4 one. Bi-encoders treat these as topically close; whichever is technically more relevant tends to ride the top of the dense list because the embedding similarity is decided by topic-level features.

A cross-encoder will pick between them based on textual evidence. If the wrong one happens to share more surface phrases with the query, it will boost it. You traded a topic-level mistake (often acceptable) for a specificity-level mistake, which is almost always unacceptable to the LLM that consumes top-k.

4. Top-k truncated below the reranker's effective range

The two-stage pattern only works if stage one returns a candidate set wide enough that the right answer is in it often enough. The reranker can only re-order what it sees. Suppose your bi-encoder recall@50 is barely better than recall@5. Then you are running a slow rescoring pass on a candidate set that already had the right answer near the top. There is nothing left to lift.

I see this most often when teams cut top-k from 50 to 20 or 10 to fit a latency budget without measuring what happened to recall@k upstream.

A cheap experiment you can run today

You do not need BEIR to find out which side of the variance your corpus is on. Take a frozen eval set of (query, gold chunk) pairs, your bi-encoder, and your reranker. Sentence-Transformers ships both stages. The script below uses synthetic-but-typical data in place of your eval set so it runs end-to-end.

from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np

bi = SentenceTransformer("BAAI/bge-small-en-v1.5")
ce = CrossEncoder("BAAI/bge-reranker-v2-m3")

# In production: load your real (query, gold_id) pairs
# and your real document corpus.
corpus = [
    "Refunds are permitted within 14 days, "
    "except for digital goods after download.",
    "Refunds are permitted for any subscription "
    "plan within 30 days of purchase.",
    "All sales are final on enterprise contracts.",
    "Returns must be initiated through the customer "
    "support portal within the refund window.",
    "Webhook timeouts are configured per integration "
    "in the admin panel under Integrations > Limits.",
]
queries = [
    ("digital goods refund?", 0),
    ("how long do I have to refund a subscription", 1),
    ("can I refund an enterprise contract", 2),
    ("webhook timeout settings", 4),
    ("refnd 14d policy", 0),  # typo case
]

doc_emb = bi.encode(corpus, normalize_embeddings=True)

def recall_at_k(retrieved_ids, gold_id, k):
    return int(gold_id in retrieved_ids[:k])

bi_recalls, ce_recalls = [], []
for q, gold in queries:
    q_emb = bi.encode([q], normalize_embeddings=True)
    sims = (q_emb @ doc_emb.T).flatten()
    bi_top = np.argsort(-sims)[:50].tolist()
    bi_recalls.append(recall_at_k(bi_top, gold, 5))

    pairs = [(q, corpus[i]) for i in bi_top]
    ce_scores = ce.predict(pairs)
    ce_top = [bi_top[i] for i in np.argsort(-ce_scores)]
    ce_recalls.append(recall_at_k(ce_top, gold, 5))

print("recall@5 bi-encoder:", np.mean(bi_recalls))
print("recall@5 + rerank:  ", np.mean(ce_recalls))

Run it on your real eval set with at least 200 pairs and stratify the result by query length and query class (typo vs clean, short vs long, in-domain vs new-segment). The aggregate number will hide the regressions. The strata will not.

A decision rule that works

One rule survives every audit:

Only rerank when bi-encoder recall@50 is meaningfully higher than recall@5. A 25-point delta or more is the floor.

The logic is mechanical. The reranker reorders the top-50. If 90% of the gold chunks are already in your bi-encoder's top-5, the reranker has at most 10% of queries to help with — and on those it has to be right more often than wrong, against an opinionated cross-encoder's failure modes. The math rarely works.

If recall@50 is 0.95 and recall@5 is 0.78, the reranker has 17 points of headroom and a real corpus to lift. That is the regime the published benchmarks live in. That is where reranking earns its slot.

If recall@50 is 0.84 and recall@5 is 0.81, the reranker has 3 points to fight over. Adding 80–200ms of latency to fight for 3 points is a bad trade. You also re-expose yourself to all four failure modes above.

What to do instead when reranking is not the answer

Three moves cover most of what reranking would have done, and they do not have the failure modes above.

Hybrid retrieval. BM25 or SPLADE in parallel with your dense retriever, fused with reciprocal rank fusion. Often beats reranker-on-top-of-dense in settings where queries have rare terms or exact phrases.

Fine-tune the reranker on your domain. Off-the-shelf rerankers fail on domain mismatch. A 5,000-pair fine-tune of bge-reranker-v2-m3 on your own (query, hard-negative, positive) triples often closes the gap. Cheaper than people expect: a single GPU afternoon for most corpora.

Query rewriting before retrieval. For the short or typo-heavy case, a 100ms LLM call that rewrites the query into a fuller form often beats a reranker stage. The reranker was failing because the query was bad. Fix the query.

The honest framing

Reranking is not magic. It is a useful tool with a well-known failure surface. The teams that get the most out of it are the ones who measured first, picked stage two only when stage one had room to be lifted, and stratified their evals so a single average score could not lie to them.

If you are running a reranker in production right now and you cannot answer "what is my bi-encoder recall@50 minus recall@5", you do not know whether your reranker is helping or hurting. Find out before your next incident does.

If this was useful

The two-stage retrieval pattern is one of the chapters in the RAG Pocket Guide — including the failure modes above, the eval methodology that catches them, and the fine-tuning recipe for when off-the-shelf rerankers do not fit your corpus.