Reranking: Retrieve Fast, Then Reorder Precisely (Better RAG)

#ai #beginners #llm #rag

Your RAG retriever pulls 50 candidate docs in milliseconds — but the best one is often sitting at rank 7, not rank 1. Reranking fixes the order with a slower, smarter model. It's the cheapest big win in RAG quality.

🥇 Watch the reorder: https://dev48v.infy.uk/prompt/day14-reranking.html

Two encoders, two jobs

Bi-encoder (retrieval): embeds the query and every doc separately, then finds nearest vectors. Fast and scalable — you can index millions. But it never looks at the query and a doc together, so the ranking is coarse.
Cross-encoder (reranking): feeds the query AND one doc into the model together and outputs a true relevance score. Accurate — but far too slow to run over your whole corpus.

The pattern: retrieve wide, rerank narrow

Retrieve the top ~50 cheaply with the bi-encoder, then cross-encode just those 50 (query, doc) pairs, re-sort by the new scores, and keep the top 5 for your LLM. The genuinely-best doc climbs to #1 — and your answer gets noticeably better.

The trade-off

Reranking adds latency and cost per query, but only on a small shortlist. Tune "retrieve K, keep N" to balance recall vs. speed.

🔨 Full pipeline (vector search top-k → cross-encoder rerank → top-n → LLM) on the page: https://dev48v.infy.uk/prompt/day14-reranking.html

Part of PromptFromZero. 🌐 https://dev48v.infy.uk

Top comments (1)

Ahmet Özel • Jun 23

Good explanation of the split. One thing worth adding from production: the reranker can only fix order, not recall. If the right chunk is not in your top 50 candidates, no cross-encoder will save you. I usually tune the bi-encoder K first by measuring recall@K offline, pick the smallest K that still captures the gold chunk, and only then put the cross-encoder on top. That keeps latency sane, because cross-encoding 50 pairs per query adds up fast under load. Adding hybrid retrieval (BM25 plus dense) before the rerank also lifts recall on exact terms like error codes or IDs that pure vector search tends to miss.