- Book: RAG Pocket Guide
- Also by me: Prompt Engineering Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Picture the bug a team I talked to was chewing through. The user asks about the refund clause in their MSA. The retriever pulls the right contract. The right paragraph is in the index. It comes back at rank 7. The model uses the top 5 and confidently answers from a renewal clause two pages earlier. Wrong answer, perfect citation style.
The fix is rarely better embeddings. It is a reranker.
Vector search by itself is a recall machine with a precision problem. It finds the neighborhood. It cannot tell you which house on the block is the one you want. A cross-encoder reranker reads the query and each candidate together and scores actual relevance based on the pair, going beyond embedding proximity. You retrieve top-50 fast, rerank to top-5 carefully, hand the 5 to the LLM.
Build a 50-query test set you can trust
Before you swap models, build a labeled set you can reproduce in an afternoon. Pick 50 real user queries from your logs. For each one, mark which document IDs in your corpus would actually answer it. One to three positives per query is fine.
# data/labels.jsonl — one row per query
# {"qid": "q01", "query": "...", "positives": ["doc_142"]}
# {"qid": "q02", "query": "...", "positives": ["doc_88","doc_91"]}
The metric is recall@5: of the documents that should be in the top 5, how many are. If a query has 2 positives and you retrieve 1 of them at rank 4, recall@5 for that query is 0.5. Average across the 50 queries.
def recall_at_k(retrieved_ids, positives, k=5):
top = retrieved_ids[:k]
hits = sum(1 for d in top if d in positives)
return hits / max(1, len(positives))
Tiny set, real signal. You will see swings of 10 to 25 points when a reranker helps and swings under 3 points when it does not. That is enough to make a decision.
Stage one: top-50 from the vector DB
The retriever stays the same. You only widen the funnel. Most setups already pull top-5 directly from a vector store. Pull top-50 instead, then send those candidates to the reranker.
# stage 1: dense retrieval, k=50
from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")
def retrieve_candidates(query_vec, k=50):
hits = client.search(
collection_name="docs",
query_vector=query_vec,
limit=k,
)
return [
{"id": h.id, "text": h.payload["text"]}
for h in hits
]
Two things to check before you bolt a reranker on. First, is your positive document even in the top-50? If not, the reranker cannot save you and the bug is upstream: chunking, embedding model, or the query itself. Second, how often is it in top-50 but not in top-5? That gap is the reranker's job.
In a 50-query test set you can build (illustrative numbers from a 12k-chunk product-doc corpus, embedded with a general-purpose multilingual model and indexed in Qdrant), the dense retriever lands at roughly recall@50 of 0.94 and recall@5 of 0.61. Roughly a third of the queries have the right doc somewhere in 50 but miss the cutoff at 5. Classic reranker territory.
Stage two: Cohere Rerank
Cohere's hosted reranker is the lowest-friction option. One API call, no GPU, multilingual. The current model id is rerank-v3.5 per the Cohere Rerank docs.
import cohere
co = cohere.ClientV2() # reads COHERE_API_KEY
def rerank_cohere(query, candidates, top_n=5):
docs = [c["text"] for c in candidates]
resp = co.rerank(
model="rerank-v3.5",
query=query,
documents=docs,
top_n=top_n,
)
out = []
for r in resp.results:
c = candidates[r.index]
out.append({
"id": c["id"],
"text": c["text"],
"score": r.relevance_score,
})
return out
The result object gives you index (back into your candidate list) and relevance_score (a 0 to 1 float). You sort by score, slice top 5, hand them to the generator.
On the same illustrative 50-query test set, plugging Cohere rerank-v3.5 over the same top-50 candidates moves recall@5 from 0.61 to 0.83. A jump that size tells you the bug lives in ranking, and the retrieval funnel itself was already wide enough.
Stage two, alt: BGE-reranker-v2-m3 locally
If you cannot send documents to a hosted API (regulated data, air-gapped deploy, or you just want the cost line to be zero), BAAI/bge-reranker-v2-m3 is the open-source default. Multilingual, good size, runs on a single GPU or even CPU for small batches. See the model card on Hugging Face.
from FlagEmbedding import FlagReranker
reranker = FlagReranker(
"BAAI/bge-reranker-v2-m3",
use_fp16=True,
)
def rerank_bge(query, candidates, top_n=5):
pairs = [[query, c["text"]] for c in candidates]
scores = reranker.compute_score(pairs, normalize=True)
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True,
)
return [
{**c, "score": s}
for c, s in ranked[:top_n]
]
normalize=True applies a sigmoid so scores land in 0 to 1, which makes thresholding easier when you want to drop a candidate that scored too low to bother sending to the LLM.
On the same illustrative 50-query test set, BGE-reranker-v2-m3 over the same top-50 moves recall@5 from 0.61 to 0.79. A few points behind Cohere on this corpus. Within the noise on plenty of others. Worth running on yours before you decide.
Cost and latency, plainly
A reranker is not free. You are paying for a second pass that reads each candidate alongside the query, which is exactly why it works.
The shape to keep in your head:
- Cohere Rerank. Hosted call adds network latency. Illustrative shape: roughly 150 to 400 ms for 50 candidates of a few hundred tokens each. Billed per search. Check the current price on the Cohere pricing page. It will be a small fraction of your generation cost, but it is per query, so a high-QPS app should model it.
- BGE-reranker-v2-m3 self-hosted. Illustrative shape on a single L4 GPU: 50 candidates of 300 tokens each run in roughly 80 to 200 ms with fp16. On CPU, expect a few seconds: fine for a chat app, painful for autocomplete. Operational cost is the GPU you are already paying for.
The latency tradeoff is the one to take seriously. You added a stage. If your end-to-end budget was 1.2 seconds and the LLM eats 800 ms, you have 400 ms for retrieval plus rerank plus everything else. Measure before you commit.
Where rerankers do not help
A reranker fixes ranking. It does not fix:
- The right document is not in the top-50. Go back and look at chunking, the embedding model, or query rewriting.
- The corpus does not contain the answer. No reranker invents content.
- The query is ambiguous and pulls two equally valid neighborhoods. The fix lives in query understanding upstream of retrieval.
- Your top-1 was already correct and the LLM still got it wrong. That is a prompting problem; the reranker is not the layer that helps.
If your recall@50 is already mediocre, a reranker will polish a candidate set that does not contain the answer. You will see recall@5 stay flat and conclude the reranker is broken. The reranker is fine. The funnel is leaking earlier.
The whole loop, end to end:
def answer(query):
qvec = embed(query)
cands = retrieve_candidates(qvec, k=50)
top5 = rerank_cohere(query, cands, top_n=5)
return generate(query, top5)
Next move: build the labeled 50 today, baseline recall@5 against your current retriever, then run one of the two rerankers above on the same top-50 and rerun the metric. If recall jumps, ship it behind a flag and watch end-to-end latency for a week. If it stays flat, that is your signal to spend the next sprint on chunking, the embedding model, or query rewriting before you touch the ranker again.
If this was useful
The RAG Pocket Guide walks through retrieval, chunking, and reranking patterns with the failure modes you actually hit in production. The Prompt Engineering Pocket Guide is the companion for the layer right after retrieval, turning the top 5 into an answer that does not hallucinate the citation it just got handed.


Top comments (0)