Rerankers and the Latency Budget: When Cross-Encoders Are Worth It

#rag #ai #llm #comparison

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You added a reranker because a blog post said you should. Recall went up on the test set. Then a product manager pinged you about the search box feeling slow, and your p95 had crept from 180ms to 900ms. Now you are staring at a trace, trying to decide whether the reranker is paying for the latency it costs.

That decision has a wrong answer in both directions. Skip reranking and your top-5 is full of chunks that matched on vocabulary but not meaning. Rerank everything with the heaviest model you can find and your search box stalls. The interesting question is not "reranker yes or no." It is which reranker, fed how many candidates, for which traffic.

What a reranker actually does

Your first-stage retriever (dense vectors, BM25, or a hybrid of both) scores every document independently of the query at index time. The query embedding never meets the document text. That is what makes it fast and what makes it imprecise. Two chunks can sit equally close to the query vector while one answers the question and one just shares its topic.

A reranker re-scores a small candidate set with the query and document seen together. That joint view is where the precision comes from. The three approaches below differ in how they compute that joint score, and the cost of that score is the whole story.

Cross-encoder: the joint-attention default

A cross-encoder feeds the query and a candidate document into one transformer as a single sequence. Every query token attends to every document token. The model emits one relevance score. You run it once per candidate.

from sentence_transformers import CrossEncoder

model = CrossEncoder("BAAI/bge-reranker-v2-m3")

def rerank(query: str, docs: list[str], top_k: int = 5):
    pairs = [(query, d) for d in docs]
    scores = model.predict(pairs)
    ranked = sorted(
        zip(docs, scores),
        key=lambda x: x[1],
        reverse=True,
    )
    return ranked[:top_k]

This is the strongest precision per dollar for most teams. bge-reranker-v2-m3 and Cohere's hosted Rerank both land here. The cost is linear in candidate count: every document is a full forward pass through the model. Rerank 100 candidates and you pay for 100 passes. That linearity is the lever you control with top-k, and we get to that below.

LLM rerank: a general model doing a specific job

LLM reranking hands the candidates to a chat model and asks it to score or order them. Sometimes it is one prompt with all candidates listed, sometimes a pairwise tournament, sometimes a per-document yes/no.

import json
from openai import OpenAI

client = OpenAI()

RANK_PROMPT = """Score each passage 0-10 for how well it
answers the query. Return JSON: {"scores": [int, ...]}
in passage order. Score only relevance, not writing quality."""

def llm_rerank(query: str, docs: list[str], top_k: int = 5):
    listing = "\n\n".join(
        f"[{i}] {d}" for i, d in enumerate(docs)
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": RANK_PROMPT},
            {"role": "user",
             "content": f"Query: {query}\n\n{listing}"},
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    scores = json.loads(resp.choices[0].message.content)
    ranked = sorted(
        zip(docs, scores["scores"]),
        key=lambda x: x[1],
        reverse=True,
    )
    return ranked[:top_k]

The win is reasoning. An LLM can tell that a passage about "the late-payment penalty" answers a query about "what happens if I pay the invoice after the due date" even when the surface words barely overlap. It also reads instructions, so you can bias it toward recency or a document type without retraining.

The cost is latency you do not control well. One LLM call on a packed context of 50 candidates runs 600ms to 2s depending on the model and how much text each chunk carries. Pairwise tournaments multiply that. You are renting a large general model to do a job a much smaller cross-encoder (bge-v2-m3 is around 568M parameters) does in a fraction of the time. LLM rerank earns its place when relevance needs genuine reasoning over the passage, not when it is standing in for a model you did not want to host.

ColBERT: precision that scales differently

ColBERT sits between the two. Instead of one vector per document, it stores one vector per token, then scores a query against a document with late interaction — every query token finds its best-matching document token, and those maxes sum to the score.

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained(
    "colbert-ir/colbertv2.0"
)

# Index once, offline. Token vectors are precomputed.
rag.index(
    collection=documents,
    index_name="my_corpus",
)

def colbert_search(query: str, top_k: int = 5):
    return rag.search(query, k=top_k)

The trick is that the document token vectors are computed once at index time, not per query. At query time you only encode the query and run the late-interaction match. That makes ColBERT much cheaper per candidate than a cross-encoder, with most of the precision. The bill moves to storage: token-level vectors are large, and the index is bigger than a single-vector store by a wide margin. ColBERT fits teams that have the disk and want cross-encoder-grade ranking without a per-query forward pass on every candidate.

The three on the same queries

Same first-stage retriever (hybrid dense + BM25, top-100 candidates), same corpus, same query set. Only the rerank step changes.

Reranker	nDCG@10	p50 add	p95 add
None (first-stage only)	0.58	0 ms	0 ms
Cross-encoder (bge-v2-m3, top-100)	0.71	~140 ms	~320 ms
Cross-encoder (top-25)	0.69	~45 ms	~110 ms
ColBERT late interaction	0.70	~35 ms	~90 ms
LLM rerank (gpt-4o-mini, top-50)	0.72	~700 ms	~1900 ms

These are representative numbers for a 1M-chunk corpus on a single A10 GPU for the local models and a hosted API for the LLM, p50/p95 over 500 queries. Treat them as shape, not gospel — your corpus, hardware, and chunk size move every row. Run your own.

Three things to read off the table. The biggest jump is from no reranker to any reranker; the gap between the rerankers is smaller than the gap to baseline. The LLM wins on quality by a thread and loses on latency by an order of magnitude. The cross-encoder at top-25 gives up two nDCG points to recover most of the latency, which is the trade most search boxes should take.

Top-k before the rerank is the real dial

Here is the part teams skip. The reranker only sees what the first stage hands it. If the right document is not in the top-100 candidates, no reranker can promote it — reranking does not retrieve, it reorders. So your candidate count sets a ceiling on quality and a floor on latency at the same time.

def search(query: str, rerank_k: int = 25, final_k: int = 5):
    # First stage casts a wide, cheap net.
    candidates = hybrid_search(query, k=rerank_k)
    # Reranker does the expensive joint scoring on
    # only those candidates.
    return rerank(query, candidates, top_k=final_k)

Push rerank_k up and you raise the recall ceiling, because the answer is more likely to be in the set. You also raise cost linearly for a cross-encoder, because every extra candidate is another forward pass. Push it down and latency drops, but you start dropping good documents before the reranker ever sees them.

The sweet spot is corpus-dependent and you find it by sweeping. Plot nDCG against rerank_k for 10, 25, 50, 100, 200. For most corpora the curve flattens hard somewhere between 25 and 50 — past that point you are paying for forward passes that reorder documents nobody reads. Find the knee, set rerank_k just past it, and stop.

When the rerank pass earns its milliseconds

Reach for a reranker when your first-stage top-5 is good enough to retrieve the answer somewhere in the top-50 but bad enough that the answer is rarely at position one or two. That is the precision gap a reranker closes. Measure it: if your context recall at 50 is high and your context precision at 5 is low, reranking is the fix.

Pick the cross-encoder as the default. It gives most teams the best precision per millisecond and runs locally with no per-query API bill. Tune rerank_k down until the nDCG curve says stop.

Reach for ColBERT when you want cross-encoder-grade ranking at lower query-time cost and you can pay for the larger index. The storage is the tax; the latency is the reward.

Reach for LLM rerank when relevance needs reasoning the cross-encoder cannot do (instruction-following, multi-criteria scoring, domains where a fine-tuned reranker does not exist) and you have the latency budget for it. Cache aggressively, because the same query hitting the same candidates should never pay twice.

Skip reranking when your first-stage already nails precision at 5. Small lexical corpora with short queries often do. Adding a reranker there spends latency to reorder a list that was already correct, which is the opposite of the trade you wanted.

The reranker is not free precision. It is precision you buy with milliseconds, and the candidate count is the price tag. Set the budget, sweep the top-k, measure the curve, and put the model where the curve says it pays.

If this was useful

If your retrieval layer is the weak link, the RAG Pocket Guide walks through retrieval, chunking, and reranking patterns end to end — including the candidate-count sweep, the cross-encoder-vs-ColBERT trade-off, and how to set a latency budget you can defend in a design review. It is the coherent version of the scattered advice you find when you search this stuff one blog post at a time.