Abdullah Shahin

Posted on Jun 3

RAG reranking for production agents: four approaches, four failure modes

#ai #rag #agents

Most agents that "hallucinate" in production aren't actually hallucinating. The right context existed in the index. It just didn't make it to the top of the retrieval window.

Reranking is the layer that decides whether your agent sees the answer or the noise. And the choice between reranker types shapes the failure mode you'll spend the next quarter debugging.

I keep seeing teams pick a reranker the way you'd pick a vector DB — benchmark on a public dataset, ship the winner, move on. That works for retrieval-augmented chatbots. It doesn't work for agents, because the failure modes are different in a way the benchmarks don't surface — and because, as we learned the hard way building HiveIn, there is no single reranker that fits every retrieval call you make once you have more than one shape of query.

The shape of the silent failure:

User → Agent: "Cancel my subscription."
Agent → Retrieval: query embedding
Retrieval → Agent: top-5 = [pricing FAQ, tier comparison, upgrade flow, …] (the correct doc was in top-50 but didn't reach top-5)
Agent → Tool: cancel_account(wrong_target_id)
Tool → User: "Done." (wrong action executed — nobody knows yet)

The right doc existed. The reranker didn't surface it. The agent acted anyway. That's the gap this article is about.

The four approaches, and what each one breaks on

1. Bi-encoder top-k, no rerank

Just vector search. Cosine similarity over the query embedding and the document embeddings, take top-k, hand to the model.

P50 latency: ~30ms
Cost: near-zero per query
Quality ceiling: low

Failure mode: topically similar but query-mismatched. Bi-encoders score on topic overlap, not query-answer fit. "How do I cancel my subscription" pulls the pricing FAQ, the tier comparison page, and the upgrade flow — all topical, none answering the question. The model gets handed a context window full of adjacent documents and either confabulates an answer that sounds right, or — if it's an agent — confidently fires the wrong tool against the wrong target.

This is the default and it's almost always wrong for agent workloads. The latency is great. Everything else is a problem.

2. Cross-encoder rerankers (Cohere Rerank, BGE-reranker, Voyage rerank-2)

Top-50 from the bi-encoder gets re-scored by a cross-encoder that processes (query, candidate) pairs jointly, attending across both. Top-5 goes to the model.

P50 latency: 100–300ms
Cost: per-token, scales with candidate count × candidate length
Quality ceiling: high

Failure mode: P99 latency and provider drift. The mean looks fine. The tail breaks SLAs because cross-encoders fundamentally can't batch across queries the way bi-encoders can — each query+candidate pair is its own forward pass. Hosted rerankers compound this with provider-side queueing during peak load.

The other thing nobody tells you: when the provider quietly rolls a new reranker version, your offline eval suite doesn't catch it. Your top-1 results shift, your agent's behavior shifts, and the only signal is a slow drift in user complaints over the following week. Cross-encoders are a black box you don't own.

3. Late-interaction models (ColBERT, ColBERTv2, JaColBERT)

Token-level similarity computed at retrieval time, using pre-computed per-token embeddings. Sits between bi-encoder and cross-encoder on the quality/latency curve.

P50 latency: ~50ms
Cost: at query time, cheap. At storage time, expensive.
Quality ceiling: high

Failure mode: index storage at scale. Per-token embeddings inflate your index size 10–30x versus a bi-encoder. Works great when your corpus is small or your infra budget is large. Becomes operationally untenable somewhere around 10M+ documents — the index stops fitting on the box you wanted it to fit on, and the next box up doubles your retrieval-tier cost.

A lot of teams adopt ColBERT during prototyping when the corpus is small, then quietly migrate off it 18 months later when the cost curve catches up. If you can predict that trajectory in advance, skip it.

4. LLM-as-reranker

Take the top-N candidates from the bi-encoder, format them into a prompt, and ask a small LLM to rank them for the query. Sometimes this is GPT-4o-mini, sometimes a fine-tuned 1B model, sometimes the same model that's about to use the retrieved context.

P50 latency: 500ms–2s
Cost: tokens × N, plus the inference call itself
Quality ceiling: highest

Failure mode: stochastic ordering and cache hostility. Same query, same candidates, same model — the LLM can return a different ordering on a repeat call. You can lower the temperature, but you can't eliminate it without losing the reasoning that made you choose an LLM reranker in the first place. And caching is harder than the other approaches because the prompt encodes both the query and the candidates, so cache keys explode.

LLM rerankers are the highest-ceiling option and the most expensive thing to operate. They're rarely the right default. They're often the right escalation — used selectively when the cheaper rerankers are uncertain.

The decision matrix

Approach	P50 latency	Quality ceiling	Where it breaks
Bi-encoder only	30ms	Low	Query-intent mismatch
Cross-encoder	200ms	High	P99 tail, provider drift
Late-interaction	50ms	High	Index storage at scale
LLM rerank	1s	Highest	Stochasticity, cost, cache

A reasonable default for an agent stack today: bi-encoder for the cheap recall pass, cross-encoder on the top-50, LLM rerank reserved for cases where the cross-encoder's top-1 score is ambiguous.

What "score" actually means (and why it bites you)

Before going further, the part that trips up almost every team building this for the first time: the number a reranker returns is not the same kind of number a vector search returns, and the numbers different rerankers return are not comparable to each other.

A bi-encoder score is a cosine similarity (or a normalized dot product). It lives in roughly [-1, 1], the magnitudes drift by embedding model and normalization scheme, and it's a measurement of topical similarity in the embedding space — not a probability that the chunk answers the query.

A cross-encoder score depends entirely on which cross-encoder. Cohere returns a 0–1 calibrated relevance probability you can almost reason about across queries. BGE-reranker emits raw logits where the absolute number is meaningless — only the ranking within a query matters; comparing scores across two different queries tells you nothing. Voyage normalizes differently again. ColBERT's score is the sum of max-similarity across token pairs, which is unbounded and scales with query length — a score of 8.4 for a four-token query means something completely different than 8.4 for a twenty-token query. LLM-as-reranker scores are usually fabrications the model attaches after the fact to justify the ordering it already chose; treat them as ordinal at best.

Here's the same idea laid out as a reference:

Scorer	Range	What the number actually means
Bi-encoder cosine	[-1.0, 1.0]	Topical similarity in embedding space — not a probability of relevance
Cohere Rerank	[0.0, 1.0]	Calibrated relevance probability — almost comparable across queries
BGE-reranker	Unbounded raw logits	Only within-query ranking is meaningful — absolute number is noise
Voyage rerank-2	[0.0, 1.0]	Normalized within Voyage's training distribution; not portable
ColBERT max-sim sum	Unbounded	Scales with query length — same number means different things at different lengths
RRF fusion	≈ 1/(k + rank)	Tiny absolute values — high-confidence cutoffs are sub-0.1
DBSF fusion	Distribution-normalized	High-confidence cutoffs are ~1.0+ — ~16x bigger number for the same idea
LLM-as-reranker	Whatever the model returned	Post-hoc justification — treat as ordinal, not numeric

And then there's hybrid retrieval, where you're already fusing dense and sparse scores via either Reciprocal Rank Fusion or Distribution-Based Score Fusion — and those two produce wildly different number ranges. We use both modes for different query shapes in HiveIn's retrieval layer, and the "high confidence" threshold we use for one is more than an order of magnitude different from the threshold for the other. Same retrieval pipeline. Same documents. Same idea of "the model is confident." Two totally different absolute numbers.

The trap I keep seeing teams fall into is this: they swap a reranker, port over their old if score > 0.7 threshold, and silently lose half their gates because 0.7 meant something completely different in the old scoring space. Or worse, they layer reranking onto an existing retrieval pipeline and start comparing the post-rerank score against thresholds that were calibrated for the raw retrieval score.

The score's distribution matters more than the absolute number. Distributions are per-(model, query-class). You cannot compare across rerankers, and you cannot compare across fusion modes. Anything you build on top of the score has to be calibrated against the specific pipeline producing it.

The agent-specific dimension nobody benchmarks

For chatbots, reranking is a quality-vs-latency tradeoff and a sane default mostly works. For agents, there's a third axis the benchmarks don't measure: how silent is the failure.

A chatbot user who gets a bad answer re-prompts. The damage is a moment of annoyance.

An agent that gets bad retrieval makes a confident tool call against the wrong target. It fires the email to the wrong customer. It hits the API with the wrong record ID. It executes the workflow it thinks the retrieved doc was describing, and the retrieved doc was describing something else. The retrieval failure becomes a tool-execution incident, and by the time anyone notices, the action has already happened.

The pattern that keeps showing up in the agent post-mortems I read, and in the traces we work through ourselves, is roughly this: when the top-1 reranker score sits below the corpus's historical 25th percentile for that query class, the probability that the next tool call is wrong rises sharply — often roughly double the baseline rate. The reranker already knew. The system just didn't let that knowledge inform the next decision.

What we learned building HiveIn's retrieval layer

The reason I'm convinced reranking is a policy problem and not a ranking problem is that we tried to make it a ranking problem first, and a single reranker stopped working almost immediately.

The first lesson was that no single reranker fit every retrieval call we make. HiveIn's planner queries memory for different shapes of context — tool definitions, prior workflow decisions, policy guidelines, memory snapshots. A reranker tuned for "find the right tool for this intent" was wrong for "find the most recent decision about this topic" was wrong for "find every chunk of this guideline that bears on this query." We tried picking one. Then we tried picking the best for the dominant case. Both ended up being bad in the cases they weren't tuned for.

What we landed on is a multi-signal rerank that blends retrieval confidence with term coverage, multi-chunk presence within a source artifact, query-decomposition breadth, and recency — with weights that shift based on the query shape itself. A short keyword query and a decomposed multi-sentence query don't get the same blend, because what "good" means is different for each.

The second lesson — and the one I'd put first in retrospect — is that the rerank gate cannot be a single number. The thresholds we use to decide "the retrieval layer is confident enough to skip reranking" are wildly different absolute values depending on which fusion strategy is running underneath, and we had to calibrate them per fusion mode. If we'd hard-coded one threshold, every config switch would have silently broken the gate. The same hard-coded magic number reads as "very confident" in one mode and "barely above noise" in the other.

The third lesson is the one that ties this back to agents specifically: reranking can hurt when retrieval is already confident. We added a confidence-aware taper that backs off the reranker's influence the more certain the underlying retrieval was — at full confidence, the rerank weights drop to zero and the raw retrieval score wins. Without this, the recency and coherence signals would occasionally demote a chunk that the underlying hybrid retrieval was already very sure about, in favor of a fresher-but-slightly-off-topic chunk. That kind of silent demotion is exactly the failure mode where the agent confidently acts on the wrong context — the right doc was retrieved, the right doc was retrieved first, and reranking pushed it to position three.

The taper looks roughly like this:

Raw retrieval score	Rerank influence	What happens to the ordering
Below threshold	1.0 (full)	Multi-signal blend decides everything
At threshold	1.0 (full)	Still fully reranked
Above threshold	Linearly tapering toward 0	Reranker influence fades; retrieval starts to dominate
At maximum	0.0	Pure retrieval — reranker doesn't touch ordering

The shape isn't novel — it's the same idea as "trust the strong signal when you have one" — but wiring it into the rerank pipeline turned out to matter more than any of the other reranker tuning we did.

None of these are clever ideas. They're things that broke in production until we changed the shape of the problem. The shape we ended up with is: retrieval and reranking are a pipeline of confidence signals, not a single ranking step, and the downstream system needs to read the whole pipeline's output to decide whether to act.

What scales: reranking as a policy input

The teams shipping reliable agents aren't picking one reranker and tuning it forever. They're treating reranking as a layered policy:

Cheap recall pass. Bi-encoder top-50. Fast, cacheable, intentionally over-recalls.
Quality reranker on the top-50. Cross-encoder or ColBERT — whichever fits your corpus shape and storage budget.
Multi-signal blend, not single-score. Whatever reranker you put on top, treat its output as one signal among several — term coverage, breadth, recency, artifact coherence are all cheap to compute alongside.
LLM rerank for ambiguous cases only. When the top-1 score from step 2 is borderline, escalate the top-5 to an LLM ranker before the agent gets to act.
Trace the score distribution as a first-class signal. Not just "did we retrieve" — log the full score distribution per query, surface drift in the dashboard the same way you'd surface latency drift, and wire the score into the gate that decides whether the next tool call gets to execute.

End-to-end, that looks like:

User query arrives
Bi-encoder top-50 — ~30ms, intentionally over-recalls
Quality reranker on the top-50 — cross-encoder or ColBERT, whichever fits the corpus
Multi-signal blend — retrieval + term coverage + coherence + breadth + recency, with weights that shift by query shape
If top-1 score is borderline → escalate the top-5 to an LLM rerank
Trace the score distribution — log it per query, surface drift in the dashboard
Tool-execution gate consumes the score:
- Above threshold → ✅ agent acts
- Below threshold → ⚠️ surface low-confidence, ask user, or abort

The last step is where reranking stops being a retrieval problem and starts being a policy problem. The reranker score becomes input to the tool-execution gate, alongside the policy classes the agent is allowed to invoke. That's the layer where you actually stop bad actions from happening — not by making retrieval perfect, but by making the system honest about when retrieval isn't confident enough to act on.

The framing that keeps proving itself: an agent should be allowed to act in proportion to its confidence in what it's acting on. Reranking is one of the cleanest measurements of that confidence you'll ever get. Most stacks throw it away as soon as the top-5 gets passed to the model.

I'm building hivein.ai in this space — runtime tool-execution policy and observability for production agents, including retrieval-confidence as a first-class signal in the policy layer. We're in invite-only beta and looking for design partners actively shipping agents to prod.

If your stack has hit the shape of this problem — silent retrieval failures becoming tool-execution incidents — I'd genuinely like to compare notes. Drop a comment, or the landing page agent is the fastest way to describe your setup and see whether the patterns line up.

DEV Community