DEV Community

Cover image for Reranker Selection: Cross-Encoder vs LLM-as-Reranker vs ColBERT: Which Earns Its Latency
Gabriel Anhaia
Gabriel Anhaia

Posted on

Reranker Selection: Cross-Encoder vs LLM-as-Reranker vs ColBERT: Which Earns Its Latency


Three rerankers. Three latency budgets. Three accuracy ceilings. Pick the wrong one and your RAG pipeline pays the latency tax without the recall payoff.

The team you join always has a reranker. They almost never have a benchmark that says why that reranker. Somebody read a blog post in 2024, dropped in bge-reranker-base, and the line in the config file never moved again. Meanwhile the corpus drifted, queries got longer, p95 latency crept up, and nobody opened the lid.

This post opens the lid. Real code for the three shapes: cross-encoder, LLM-as-reranker, ColBERT-style late interaction. A reproducible bench. And the one gotcha that breaks every multi-reranker setup if you skip it.

The three reranker shapes in 60 seconds

A bi-encoder (your retriever) embeds query and document independently. Cheap, parallelisable, indexable. Good recall, mediocre precision at the top of the list.

A reranker reads query and document together and scores the pair. That joint reading is what lifts precision. The three shapes differ in how they do the joint reading.

  • Cross-encoder. A single transformer eats [query, SEP, doc] and emits one scalar. Fast on GPU, sub-100ms for top-K=50. The workhorse.
  • LLM-as-reranker. You ask Claude or GPT to rank or score documents. Slowest, most flexible, charges per token.
  • ColBERT / late interaction. Query and doc are encoded into per-token vectors. A MaxSim operator scores token-level matches. Middle ground in latency, particularly strong on long documents.

Three architectures, three cost shapes. The benchmark below tells you which one earns its latency on your corpus.

Cross-encoder: the default that's right 70% of the time

BAAI's bge-reranker-v2-m3 is what most production setups should start with. It's multilingual, sub-100ms on a single A10G for top-50, and Apache-2.0 licensed. Cohere's hosted rerank-v3.5 is the equivalent if you'd rather not run GPUs.

Here's a runnable scoring loop using FlagEmbedding, the official wrapper from BAAI:

# pip install -U FlagEmbedding torch
from FlagEmbedding import FlagReranker
from typing import Iterable

# normalize=True maps logits to [0, 1] via sigmoid.
# use_fp16 cuts memory ~40% on Ampere+ with negligible recall loss.
reranker = FlagReranker(
    "BAAI/bge-reranker-v2-m3",
    use_fp16=True,
    normalize=True,
)

def rerank(
    query: str,
    candidates: Iterable[str],
    top_k: int = 10,
) -> list[tuple[int, float]]:
    pairs = [(query, doc) for doc in candidates]
    # compute_score returns one scalar per pair
    scores = reranker.compute_score(pairs)
    ranked = sorted(
        enumerate(scores),
        key=lambda kv: kv[1],
        reverse=True,
    )
    return ranked[:top_k]

# top_50 came out of your bi-encoder / BM25 hybrid
top_50 = retriever.search(query, k=50)
top_10 = rerank(query, [c.text for c in top_50], top_k=10)
Enter fullscreen mode Exit fullscreen mode

The shape of the numbers on a single A10G, batch of 50, doc length capped at 512 tokens: roughly 60-90ms wall clock per query in fp16. CPU-only on a modern Xeon: 700-1500ms. You don't want that in a chat path.

Two knobs people forget:

  1. Document truncation. The model has a 512-token context. Long PDFs get silently truncated. Either pre-chunk (the recommended path) or use the long-context variant bge-reranker-v2-minicpm-layerwise.
  2. Batch size on the GPU. The wrapper batches internally, but if you call it once per candidate from an async handler you'll get serial latency. Always pass the full pair list in one call.

LLM-as-reranker: slow, smart, charges per question

You can ask an LLM to rank or score. Two patterns work in production: pointwise scoring (one call per doc, easy to parallelise, more tokens) and listwise ranking (one call for the whole shortlist, fewer tokens, model has to fit the candidates).

Listwise with Claude looks like this:

# pip install anthropic
import json
from anthropic import Anthropic

client = Anthropic()

LISTWISE_PROMPT = """You rerank search results for relevance to the query.

Query: {query}

Candidates (id and text):
{candidates}

Return a JSON array of objects {{"id": <int>, "score": <0-1>}}
sorted by score desc. Score 1.0 = directly answers the query,
0.0 = unrelated. Return ONLY the JSON, no prose."""

def llm_rerank(query: str, candidates: list[str], top_k: int = 10):
    body = "\n".join(f"[{i}] {c}" for i, c in enumerate(candidates))
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": LISTWISE_PROMPT.format(
                query=query, candidates=body,
            ),
        }],
    )
    # the model returns a JSON array; parse defensively
    scored = json.loads(resp.content[0].text)
    scored.sort(key=lambda x: x["score"], reverse=True)
    ids = [int(x["id"]) for x in scored[:top_k]]
    return [(i, candidates[i]) for i in ids]
Enter fullscreen mode Exit fullscreen mode

Two things will hurt you here. First, the model occasionally invents an id that wasn't in the input. Validate every id is in range before you index back into candidates. Second, listwise is sensitive to candidate order in the prompt. Shuffle them before sending, otherwise you'll bake retrieval-rank bias into the rerank.

Cost math, May 2026: Haiku 4.5 at roughly $1/$5 per million in/out tokens. A typical rerank of 50 candidates of 250 tokens each plus a small JSON output runs ~$0.015 per query. At 100 queries/second sustained that's $54/hour. Not viable for high-volume chat. Viable for ops dashboards, legal search, internal knowledge bases where the queries arrive at human pace.

The accuracy ceiling, though, is the highest of the three. A capable model can reason about negation, conditionals, and intent in ways a 568M-parameter cross-encoder cannot. When the query is "show me API endpoints that don't require auth", an LLM gets the don't correctly. The cross-encoder doesn't.

ColBERT / late interaction: the long-document specialist

ColBERTv2 (Khattab et al., 2022; the active maintained fork is colbert-ai on PyPI) splits the difference. It encodes each query token and each document token to its own vector, then scores a query–doc pair as:

score(Q, D) = sum over each query token q of:
                max over doc tokens d of: cosine(q, d)
Enter fullscreen mode Exit fullscreen mode

That's MaxSim. Each query token finds its best-matching token in the doc, and the doc's score is the sum. The intuition: a long document only needs to match a few key tokens really well for it to be relevant, and ColBERT's per-token scoring catches that pattern where a single-vector dense retriever flattens it.

Sketched usage with the maintained library:

# pip install colbert-ai torch
from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint

cfg = ColBERTConfig(doc_maxlen=300, nbits=2)
ckpt = Checkpoint("colbert-ir/colbertv2.0", colbert_config=cfg)

def colbert_score(query: str, docs: list[str]) -> list[float]:
    Q = ckpt.queryFromText([query])           # (1, q_tokens, dim)
    D = ckpt.docFromText(docs, bsize=32)[0]   # (n_docs, d_tokens, dim)
    # MaxSim is built into the checkpoint scoring helper
    return ckpt.score(Q, D).tolist()
Enter fullscreen mode Exit fullscreen mode

Where ColBERT wins: long docs (legal, contracts, research papers, manuals) where the relevance signal is concentrated in a few sentences buried inside 3,000 tokens of context. The single-vector summary that a cross-encoder works with loses that signal; ColBERT's per-token MaxSim recovers it.

Where ColBERT hurts: memory. Storing per-token vectors at fp16 for a 10M-doc corpus is roughly 50× the footprint of a single-vector index. The PLAID engine and 2-bit quantisation in ColBERTv2 bring it down a lot, but it's still a serious GPU and disk commitment compared to a bi-encoder.

Bench setup: what to measure and how

You can't pick a reranker by reading blog posts (this one included). You measure on your own corpus.

The standard public bench is MS MARCO passage. It's English, it's web search, and it's well-labelled. Run it first as a sanity check that the model behaves as advertised. Then run a second, custom bench on your own corpus and your own queries. The second bench is the one that matters; the first is a smoke test.

Methodology, what to record per query:

Metric Why
Recall@10 Did the right doc make the final top-10?
nDCG@10 Weighted by rank position; rewards top-1 over top-9
MRR@10 Sensitive to the very top spot
p50 / p95 latency The number your SRE cares about
Cost per 1k queries The number your CFO cares about

A minimal bench loop:

import time, statistics
from dataclasses import dataclass

@dataclass
class Run:
    name: str
    recall_at_10: float
    ndcg_at_10: float
    p95_ms: float
    cost_per_1k: float

def bench(name, rerank_fn, queries, gold, candidates_by_q, cost_per_q):
    latencies, hits, ndcgs = [], 0, []
    for q in queries:
        t0 = time.perf_counter()
        top10 = rerank_fn(q.text, candidates_by_q[q.id], top_k=10)
        latencies.append((time.perf_counter() - t0) * 1000)
        ids = [c.id for c, _ in top10]
        if gold[q.id] in ids:
            hits += 1
        ndcgs.append(ndcg_at_k(ids, gold[q.id], k=10))
    return Run(
        name=name,
        recall_at_10=hits / len(queries),
        ndcg_at_10=statistics.mean(ndcgs),
        p95_ms=statistics.quantiles(latencies, n=20)[18],
        cost_per_1k=cost_per_q * 1000,
    )
Enter fullscreen mode Exit fullscreen mode

Use the same top-50 candidate set across all three rerankers so you're measuring the reranker, not the retriever. Use 1,000 queries minimum or your latency percentiles are noise. Run twice and confirm the second run matches the first within a percent. Cold caches lie.

A representative shape from a custom bench against a 1.2M-doc support-knowledge-base corpus, 1k held-out labelled queries, single A10G:

Reranker Recall@10 nDCG@10 p95 latency $/1k queries
No rerank (bi-encoder top-10) 0.71 0.58 12ms ~0
BGE reranker v2-m3 (fp16) 0.86 0.74 84ms ~0 (self-host)
Cohere rerank v3.5 0.88 0.76 140ms $2.00
Claude Haiku 4.5 listwise 0.91 0.81 1100ms ~$15
ColBERTv2 (PLAID) 0.87 0.77 220ms ~0 (self-host)

Your numbers won't match these. That's the point. Run the bench.

Per-corpus verdict

The headline reading from the table: every reranker beats no reranker by a wide margin. The choice between them is a trade between latency and that last 5 nDCG points.

A pattern that holds across many corpora:

  • High-volume chat / search where p95 must stay under 300ms: cross-encoder. BGE if you own the GPU, Cohere if you don't.
  • Low-volume / high-stakes (legal, medical, ops on-call search): LLM-as-reranker. The cost is fine when QPS is low and the cost of a wrong answer is high.
  • Long-doc corpora (technical manuals, legal contracts, research libraries): ColBERT. Its per-token scoring is the difference-maker once docs cross ~1,500 tokens.

When a cross-encoder is good enough and you keep paying for an LLM reranker, you're burning money to move nDCG by 5-7 points that your users wouldn't notice. When a cross-encoder is not enough, when your eval shows it plateau-ing well below the LLM's number on the queries that matter, you have a real reason to spend.

The gotcha: reranker scores aren't comparable across models

This one bites every team that runs more than one reranker.

Scores from different rerankers live on different distributions. BGE-v2 normalised through sigmoid lands in [0, 1] and a relevant pair sits around 0.55-0.75. Cohere's rerank scores are also [0, 1] but a relevant pair sits closer to 0.85-0.95. Claude returns whatever it wants from your prompt. ColBERT's MaxSim is unbounded: a relevant pair on a long doc might score 18, a short one might score 4.

If your code does anything like this:

# WRONG: works for one reranker, silently breaks when you swap
THRESHOLD = 0.7
relevant = [c for c, s in scored if s >= THRESHOLD]
Enter fullscreen mode Exit fullscreen mode

...then swapping reranker turns your filter into either a sieve or a wall. With Cohere's scale, 0.7 keeps almost everything. With ColBERT's unbounded MaxSim, 0.7 keeps nothing.

The fix is per-corpus calibration. Pick a held-out labelled set, sweep thresholds, and pick the one that hits your target precision-recall point, for that reranker, on that corpus.

import numpy as np

def calibrate_threshold(scores, labels, target_recall=0.9):
    # scores and labels aligned; label=1 means truly relevant
    scores = np.array(scores)
    labels = np.array(labels)
    # sweep candidate thresholds from the actual score distribution
    candidates = np.quantile(scores, np.linspace(0.5, 0.99, 50))
    best_t, best_precision = None, -1.0
    for t in candidates:
        kept = scores >= t
        if kept.sum() == 0:
            continue
        recall = labels[kept].sum() / max(labels.sum(), 1)
        if recall < target_recall:
            continue
        precision = labels[kept].sum() / kept.sum()
        if precision > best_precision:
            best_precision = precision
            best_t = t
    return best_t, best_precision
Enter fullscreen mode Exit fullscreen mode

Store the calibrated threshold per (reranker, corpus) pair. Re-run calibration on every model swap and every meaningful corpus update. Treat it as a versioned artifact, not a constant in code.

The other half of this gotcha: when you A/B two rerankers in production, never compare raw scores. Compare downstream metrics like answer correctness, click-through, user thumbs. The score from reranker A says nothing about reranker B's relevance, and vice versa.


If this was useful

Reranker choice is one of the few RAG tuning levers that pays back on day one. The wrong default costs you precision or latency or both, and a one-day bench tells you which. The RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production has a full chapter on reranker selection with the bench loop, the score-calibration patterns, and the production failure modes that don't show up until you've shipped. If this post helped you frame the trade-offs, the book takes you the rest of the way.

Which reranker are you running in production today, and when was the last time you re-benched it against the alternatives? Drop your numbers in the comments.

RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production

Top comments (0)