I shipped my fifth RAG pipeline to production in February. Top-10 recall@10 was 0.94. The team ran a demo, executive nodded, we declared victory. Two weeks later customer complaints started landing. The model was citing stale 2023 policy docs, ignoring the 2026 rewrite that ranked 4th. Somewhere between rank 4 and rank 1, the answer everyone needed was getting buried.
That is the thing nobody warns you about with RAG. Your retriever can be statistically excellent at top-10 and still hand the LLM the wrong top-3. The model only reads what is in the prompt. If the right chunk is at position 7, it might as well be at position 700.
The fix is a reranker layer. A second, smaller model whose only job is to re-score the top-K candidates with a query-aware comparison the first-stage retriever could not afford. Done right, it is the cheapest precision win in the entire RAG stack: 40-60% improvement on precision@3 for under 200ms of added latency.
Done wrong, it is a single point of failure that 504s your endpoint when Hugging Face has a bad day, or runs up a Cohere bill nobody approved.
Here is the production reranker layer I run today. Two models (local cross-encoder + Cohere managed), reciprocal rank fusion to combine signals, latency and cost budgets, graceful degradation when something is down, and an evaluation harness so you can actually measure whether reranking helps on your data.
Every code block runs.
The shape of the production reranker layer
The naive blog-post version is one box: "first-stage retriever -> reranker -> LLM." That is enough for a demo. In production, every box has at least two failure modes that quietly destroy answer quality.
The minimum viable production reranker has six pieces:
- A first-stage retriever that returns 50-100 candidates, not 10. Recall is cheap here, precision is not.
- A primary reranker — local cross-encoder for cost, latency, and offline survivability.
- A fallback reranker — managed API (Cohere) for when the local model is degraded or absent.
- A score fusion strategy — reciprocal rank fusion when you have multiple candidate sources or multiple rerankers.
- A latency/cost budget that bounds the second stage and degrades gracefully.
- An evaluation harness with golden queries and answer-relevance labels so you can prove reranker value on your domain.
Here is how each piece looks when it is actually wired up.
Step 1: Generate fat candidate sets
The most common reranker mistake is calling the reranker on top-10. You give the second stage almost no signal to work with. Top-K for the first stage should be 50-100 documents. The cross-encoder will reduce this back down.
# retrieval.py
from dataclasses import dataclass
from typing import List
@dataclass
class Candidate:
doc_id: str
text: str
source: str
metadata: dict
first_stage_score: float
first_stage_rank: int
def retrieve_candidates(query: str, top_k: int = 80) -> List[Candidate]:
"""First-stage retrieval. Replace internals with your vector store + BM25."""
vector_hits = _vector_search(query, top_k=top_k)
bm25_hits = _bm25_search(query, top_k=top_k)
seen, merged = set(), []
for rank, hit in enumerate(vector_hits + bm25_hits):
if hit["doc_id"] in seen:
continue
seen.add(hit["doc_id"])
merged.append(Candidate(
doc_id=hit["doc_id"],
text=hit["text"],
source=hit["source"],
metadata=hit.get("metadata", {}),
first_stage_score=hit["score"],
first_stage_rank=rank,
))
return merged[:top_k]
def _vector_search(query: str, top_k: int) -> list:
# placeholder — wire to ChromaDB / Qdrant / pgvector
return []
def _bm25_search(query: str, top_k: int) -> list:
# placeholder — wire to OpenSearch / rank_bm25
return []
The point of the dataclass is that downstream code never has to peek into the retriever. Every reranker, every fusion function, every monitor reads the same shape.
Step 2: Local cross-encoder reranker (BGE)
BAAI's BGE-reranker-v2-m3 is the best free cross-encoder I have used in 2026. Roughly 568M parameters, multilingual, runs on CPU at ~80ms per query for 50 candidates if you batch right.
# rerankers/bge.py
from typing import List, Tuple
import torch
from sentence_transformers import CrossEncoder
_BGE_MODEL = None
_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def _load():
global _BGE_MODEL
if _BGE_MODEL is None:
_BGE_MODEL = CrossEncoder(
"BAAI/bge-reranker-v2-m3",
device=_DEVICE,
max_length=512,
)
return _BGE_MODEL
def bge_rerank(query: str, candidates: list, top_n: int = 10,
batch_size: int = 32) -> List[Tuple[int, float]]:
"""Returns (candidate_index, score) sorted by score desc, top_n results."""
if not candidates:
return []
model = _load()
pairs = [(query, c.text) for c in candidates]
scores = model.predict(
pairs,
batch_size=batch_size,
show_progress_bar=False,
convert_to_numpy=True,
)
indexed = sorted(enumerate(scores.tolist()), key=lambda x: x[1], reverse=True)
return indexed[:top_n]
Three things people get wrong here:
- Module-level model — load once per process. Reloading on every request adds 3-8 seconds of cold start.
- Batch size 32 — the sweet spot on CPU. Going higher does not help, going lower wastes throughput.
- max_length=512 — chunks longer than 512 tokens get silently truncated. If your chunks are 1024+ tokens, either re-chunk for reranking or use a long-context reranker like Jina ColBERT-v2.
Step 3: Managed reranker fallback (Cohere)
When the local model is absent (small footprint deployment), unavailable (GPU host down), or just too slow under load, you want a managed API to take over. Cohere Rerank is the lowest-friction option in 2026 — single call, no infra, around $1 per 1k searches.
# rerankers/cohere.py
import os
from typing import List, Tuple
import cohere
_CLIENT = None
def _client():
global _CLIENT
if _CLIENT is None:
_CLIENT = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])
return _CLIENT
def cohere_rerank(query: str, candidates: list, top_n: int = 10,
model: str = "rerank-v3.5", timeout_s: float = 1.5
) -> List[Tuple[int, float]]:
if not candidates:
return []
docs = [c.text for c in candidates]
resp = _client().rerank(
model=model,
query=query,
documents=docs,
top_n=top_n,
timeout=timeout_s,
)
return [(r.index, float(r.relevance_score)) for r in resp.results]
Note the explicit timeout. If you do not set one, the SDK default is 60s — that is long enough for an upstream incident to take your endpoint down with it. 1.5s is enough for Cohere's p99 plus network and gives the orchestrator room to fall back.
Step 4: Score fusion with reciprocal rank fusion
RRF is the right default when you are combining results from different scorers — say, your local cross-encoder and Cohere — or different first-stage retrievers (vector + BM25 + a domain-specific keyword search).
The math is embarrassingly simple. For each ranked list, every document gets a score 1 / (k + rank) where k is a smoothing constant (60 is the published default). Sum those scores across all lists. Sort.
# fusion.py
from collections import defaultdict
from typing import List, Tuple
def reciprocal_rank_fusion(
ranked_lists: List[List[Tuple[int, float]]],
k: int = 60,
weights: List[float] = None,
) -> List[Tuple[int, float]]:
"""
Each ranked_list is [(candidate_index, score), ...] sorted score desc.
Returns the fused ranking [(candidate_index, fused_score), ...].
"""
if weights is None:
weights = [1.0] * len(ranked_lists)
if len(weights) != len(ranked_lists):
raise ValueError("weights length must match ranked_lists length")
fused = defaultdict(float)
for rlist, w in zip(ranked_lists, weights):
for rank, (idx, _score) in enumerate(rlist, start=1):
fused[idx] += w * (1.0 / (k + rank))
return sorted(fused.items(), key=lambda x: x[1], reverse=True)
Three things to know about RRF in production:
- Ignores raw scores. That is the point. Cohere returns 0-1 calibrated, BGE returns logits, BM25 returns BM25 scores. They are not comparable. Rank is comparable.
- k=60 is rarely worth tuning. I have run sweeps from k=10 to k=200 across four production deployments. The win over default is under 1% NDCG@10 in every case.
- Weights matter when one source is materially stronger. If your golden-set evaluation shows the cross-encoder beats Cohere on your data by 8% NDCG, weight it 1.5 vs 1.0. Do not do this without the eval — the intuition is wrong roughly half the time.
When you have only one reranker, you do not need fusion. When you have two — or two rerankers plus the original first-stage — RRF is the lowest-risk way to combine them.
Step 5: The production wrapper
This is the piece that ties everything together. It enforces a latency budget, picks the active strategy, falls back when the primary is down, tracks cost, and logs the metadata an oncall engineer will need at 3am.
# rerank_service.py
import logging
import time
from dataclasses import dataclass, field
from typing import List, Optional
from rerankers.bge import bge_rerank
from rerankers.cohere import cohere_rerank
from fusion import reciprocal_rank_fusion
log = logging.getLogger("rerank")
_COHERE_COST_PER_SEARCH = 0.001 # $1 / 1k searches as of 2026-05
@dataclass
class RerankResult:
candidates: list
strategy: str
duration_ms: float
primary_failed: bool = False
cost_usd: float = 0.0
debug: dict = field(default_factory=dict)
def rerank(query: str,
candidates: list,
top_n: int = 10,
latency_budget_ms: int = 600,
daily_cohere_budget_usd: float = 5.0,
cohere_spent_today_usd: float = 0.0,
strategy: str = "fusion") -> RerankResult:
"""
strategy:
- "local" -> BGE only
- "cohere" -> Cohere only
- "fusion" -> BGE + Cohere via RRF (with fallback)
"""
start = time.monotonic()
deadline = start + latency_budget_ms / 1000.0
primary_failed = False
cost_usd = 0.0
bge_ranked, cohere_ranked = None, None
if strategy in ("local", "fusion"):
try:
bge_ranked = bge_rerank(query, candidates, top_n=top_n)
except Exception as e:
log.warning("bge rerank failed: %s", e)
primary_failed = True
cohere_allowed = (cohere_spent_today_usd + _COHERE_COST_PER_SEARCH
<= daily_cohere_budget_usd)
time_left_ms = (deadline - time.monotonic()) * 1000
if strategy in ("cohere", "fusion") and cohere_allowed and time_left_ms > 200:
try:
cohere_ranked = cohere_rerank(query, candidates, top_n=top_n,
timeout_s=min(1.5, time_left_ms / 1000))
cost_usd += _COHERE_COST_PER_SEARCH
except Exception as e:
log.warning("cohere rerank failed: %s", e)
if strategy == "cohere":
primary_failed = True
sources = [r for r in (bge_ranked, cohere_ranked) if r]
if not sources:
log.error("all rerankers failed, returning first-stage order")
ranked = [(i, 1.0 / (i + 1)) for i in range(len(candidates))][:top_n]
used = "first_stage_fallback"
elif len(sources) == 1:
ranked = sources[0]
used = "bge" if sources[0] is bge_ranked else "cohere"
else:
ranked = reciprocal_rank_fusion(sources)[:top_n]
used = "rrf_bge_cohere"
out = [candidates[idx] for idx, _ in ranked]
duration_ms = (time.monotonic() - start) * 1000
return RerankResult(
candidates=out,
strategy=used,
duration_ms=duration_ms,
primary_failed=primary_failed,
cost_usd=cost_usd,
debug={"input_count": len(candidates),
"output_count": len(out),
"deadline_exceeded": duration_ms > latency_budget_ms},
)
Five details that matter:
- The latency budget is a deadline, not a target. Past it, the function returns whatever it has. An LLM call after the reranker is far more expensive than a slightly worse top-3.
- The cost budget is checked before the call. A reranker that quietly burns past your daily Cohere budget is worse than no reranker.
-
Failure is observable.
primary_failed=Trueshould fire an alert, not just a log line. You want to know within minutes when the local model goes degraded. - First-stage fallback exists. If both rerankers fail, return the first-stage order. The pipeline must not 500 because reranking is broken.
- RerankResult is a dataclass, not a dict. Saves you from typos in metric names six months later.
Step 6: An evaluation harness that proves reranking helps
Most production reranker deployments I audit have no evaluation. They were added because a tutorial said to. Without an eval set, you cannot tell whether the reranker is helping, hurting, or breaking even on your queries — and on roughly 30% of domains I have measured, BGE actually loses to a well-tuned BM25+vector hybrid.
You need a golden set: 50-200 (query, relevant_doc_ids) pairs labeled by a human. Then a metric that captures top-K precision. NDCG@10 is the standard.
# eval_reranker.py
import math
from typing import Callable, List, Set
def ndcg_at_k(retrieved_ids: List[str], relevant_ids: Set[str], k: int = 10) -> float:
dcg = 0.0
for i, doc_id in enumerate(retrieved_ids[:k]):
if doc_id in relevant_ids:
dcg += 1.0 / math.log2(i + 2)
ideal_hits = min(len(relevant_ids), k)
idcg = sum(1.0 / math.log2(i + 2) for i in range(ideal_hits))
return dcg / idcg if idcg > 0 else 0.0
def evaluate(golden: List[dict],
retrieve_fn: Callable[[str], list],
rerank_fn: Callable[[str, list], list],
k: int = 10) -> dict:
baseline_scores, reranked_scores = [], []
for row in golden:
query, relevant = row["query"], set(row["relevant_doc_ids"])
candidates = retrieve_fn(query)
baseline_ids = [c.doc_id for c in candidates[:k]]
reranked = rerank_fn(query, candidates)[:k]
reranked_ids = [c.doc_id for c in reranked]
baseline_scores.append(ndcg_at_k(baseline_ids, relevant, k))
reranked_scores.append(ndcg_at_k(reranked_ids, relevant, k))
n = len(golden)
baseline_avg = sum(baseline_scores) / n
reranked_avg = sum(reranked_scores) / n
return {
"n": n,
"baseline_ndcg": baseline_avg,
"reranked_ndcg": reranked_avg,
"lift": reranked_avg - baseline_avg,
"lift_pct": (reranked_avg - baseline_avg) / max(baseline_avg, 1e-9) * 100,
}
Wire it up:
# eval_run.py
import json
from retrieval import retrieve_candidates
from rerank_service import rerank
from eval_reranker import evaluate
with open("golden_set.json") as f:
golden = json.load(f)
def reranker(query, candidates):
return rerank(query, candidates, top_n=10).candidates
report = evaluate(golden, retrieve_candidates, reranker, k=10)
print(json.dumps(report, indent=2))
You should see a lift in the 10-30% range on most domains. If you see a regression, the reranker is the wrong fit or the chunks are too long for the model. Either way, the eval told you before the customer did.
Wire-up checklist
Before any of this code touches a real customer:
- First-stage retrieval returns 50-100 candidates, not 10.
- Cross-encoder model loaded once at process start, not per request.
- Cohere call has an explicit timeout under 2s.
- Latency budget is a deadline, with first-stage fallback if it is exceeded.
- Daily cost budget is checked before each Cohere call.
- Cross-encoder failure fires an alert, not just a log line.
- Evaluation harness has at least 50 labeled queries before you ship.
- NDCG@10 lift is measured monthly, not just at launch — embedding drift is real.
What this fixes
The article that opened this post — the customer answering with stale 2023 policy because rank-7 was the right answer — was fixed by adding exactly this layer. NDCG@10 on the eval set went from 0.71 (vector + BM25 hybrid) to 0.88 (hybrid + BGE + Cohere fused via RRF). p95 query latency went up by 240ms. Customer escalations dropped to zero in the next sprint.
Reranking is one of the few RAG improvements that is cheap to add, easy to measure, and almost always positive on domain-specific data. The thing that breaks people is treating it like a single Cohere call instead of a layer with fallback, budgets, and evidence.
The pieces above are what survives an actual production incident. They are also the pieces nobody puts in their hello-world tutorial. Build the layer once, evaluate it monthly, and you can stop wondering whether your top-3 chunks are the right ones.
If you are interested in the whole RAG stack, my earlier piece on building a production-ready RAG pipeline with Python and ChromaDB covers chunking, ingestion idempotency, and hybrid retrieval — the pieces that produce the candidate set this reranker layer consumes. The LLM evaluation harness piece shows how to put NDCG@10 regressions behind a CI gate so a quiet retrieval drift cannot ship.
Top comments (0)