DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

Reranking: Retrieve Fast, Then Reorder Precisely (Better RAG)

Your RAG retriever pulls 50 candidate docs in milliseconds — but the best one is often sitting at rank 7, not rank 1. Reranking fixes the order with a slower, smarter model. It's the cheapest big win in RAG quality.

🥇 Watch the reorder: https://dev48v.infy.uk/prompt/day14-reranking.html

Two encoders, two jobs

  • Bi-encoder (retrieval): embeds the query and every doc separately, then finds nearest vectors. Fast and scalable — you can index millions. But it never looks at the query and a doc together, so the ranking is coarse.
  • Cross-encoder (reranking): feeds the query AND one doc into the model together and outputs a true relevance score. Accurate — but far too slow to run over your whole corpus.

The pattern: retrieve wide, rerank narrow

Retrieve the top ~50 cheaply with the bi-encoder, then cross-encode just those 50 (query, doc) pairs, re-sort by the new scores, and keep the top 5 for your LLM. The genuinely-best doc climbs to #1 — and your answer gets noticeably better.

The trade-off

Reranking adds latency and cost per query, but only on a small shortlist. Tune "retrieve K, keep N" to balance recall vs. speed.

🔨 Full pipeline (vector search top-k → cross-encoder rerank → top-n → LLM) on the page: https://dev48v.infy.uk/prompt/day14-reranking.html

Part of PromptFromZero. 🌐 https://dev48v.infy.uk

Top comments (0)