If your RAG app sometimes answers from the wrong document even though the right one was in your database, the fix usually isn't a better embedding model — it's adding a reranker. It's the single highest-leverage upgrade to a basic retrieval pipeline, and it's easy to bolt on.
This is Day 8 of PromptFromZero, where I break down one technique a day.
Why plain retrieval gets the order wrong
Standard RAG embeds your question and grabs the nearest document vectors. It's lightning fast — but that embedding squashes a whole passage into a single vector, so it often ranks a doc that merely shares words with your question above the one that actually answers it. Good enough for a first pass; not good enough to feed straight to the LLM.
const candidates = await vectorSearch(query, { k: 25 }); // fast, coarse
Bi-encoder vs cross-encoder (the key distinction)
Retrieval uses a bi-encoder: it encodes the query and each document separately, then compares the vectors. Fast, because every doc is pre-encoded once — but it never sees the query and the doc together.
A cross-encoder feeds the pair into the model together, so it can judge true relevance with full context. Far more accurate — at the cost of running once per candidate.
bi-encoder: sim(embed(query), embed(doc)) ← fast, approximate
cross-encoder: model(query + doc) → score ← slow, precise
The pattern: retrieve wide, rerank, keep narrow
You can't run an expensive cross-encoder over millions of documents. So:
- Retrieve a wide shortlist cheaply — grab the top 25, not the top 3.
- Rerank just those 25 with a cross-encoder, which re-reads each (query, doc) pair and assigns a precise relevance score.
- Keep only the best 2–4 for the prompt.
const scored = await reranker.rank(query, candidates);
const context = scored.sort((a, b) => b.score - a.score).slice(0, 3);
In the interactive demo on the page, pick a question and watch the truly-relevant doc (highlighted green) sit at position #5 under plain retrieval, then climb to #1 after reranking. Same documents, much better order.
No reranker model? An LLM can do it
If you don't want a dedicated reranker (Cohere Rerank, bge-reranker, etc.), just ask an LLM to score each candidate:
Rate 0-10 how well this passage answers the question.
Question: {q}
Passage: {doc}
Score:
Run it per candidate, sort by the number. Slower and pricier per query, but zero extra infrastructure and surprisingly strong.
Why fewer, better passages win
After reranking you throw most candidates away. Fewer, higher-quality passages mean a shorter prompt, lower cost, and far less chance the model gets distracted by an irrelevant chunk that happened to share keywords. In RAG, the quality of your context beats the quantity every single time.
The tradeoff
Reranking adds a step — latency and a little cost. But for anything where a wrong answer is expensive (support, legal, medical, internal search), the accuracy jump is well worth it. If your RAG "knows" the answer but won't surface it, reach for a reranker first.
👉 See it reorder live (watch the right doc climb): https://dev48v.infy.uk/prompt/day8-rag-rerank.html
🌐 All techniques: https://dev48v.infy.uk/promptfromzero.php
Tomorrow: HyDE — write a hypothetical answer first, then search with that.
Top comments (0)