Md Ayan Arshad

Posted on May 7

I Increased Retrieval From Top-5 to Top-20. My Answers Got Worse

#ai #machinelearning #programming #discuss

The standard advice for improving RAG retrieval quality is: retrieve more candidates, then filter down. Bigger pool, better reranker, better answers. I followed that advice in my RAG System. On PDFs, going from top-5 to top-20 made my RAGAS scores drop. The answers got worse, not better.

Here's what actually happened and the experiment design that explained it.

TL;DR

PDFs (40 QA pairs, 5 technical documents):

Condition	RAGAS SUM	Context Precision
top-5, no reranker (baseline)	3.4330	0.8102
top-20, no reranker	3.4051 ↓	0.8118
top-20 → Cohere rerank → top-5	3.4843 ↑	0.8368

GitHub code (50 QA pairs, encode/httpx repo):

Condition	RAGAS SUM	Context Precision
top-5, no reranker (baseline)	3.5680	0.7812
top-20, no reranker	3.5766	0.7812 ← identical
top-20 → Cohere rerank → top-5	3.7079 ↑	0.9335

On PDFs, more candidates without a quality filter made scores drop. On code, a 4x larger pool produced zero improvement in Context Precision i.e. 0.7812 versus 0.7812. Every gain in both cases came entirely from the reranker.

The standard advice

Most RAG tutorials recommend something like: retrieve top-20 or top-50 candidates, then rerank to top-5. The reasoning is intuitive, a bigger retrieval pool gives the reranker more material to work with, so the final 5 chunks are better quality.

That reasoning isn't wrong. But it hides an important assumption: the reranker is present. Without it, a bigger pool doesn't help. It actively hurts.

To separate these two effects, I designed a 3 condition experiment. Most people only test "with reranker vs without reranker", that confounds pool size and reranking quality into one comparison. Breaking it into three conditions isolates what's actually causing the change.

The 3-condition experiment design

Condition A: top-5,  no reranker    → baseline
Condition B: top-20, no reranker    → isolates pool size effect
Condition C: top-20 → Cohere → top-5 → isolates reranker contribution

The logic:

If C > B: the reranker is doing real work, not just benefiting from more candidates
If B > A: a bigger pool helps even without reranking
If B ≈ A or B < A: pool size doesn't matter and all improvement comes from the reranker

Running this on two different data types produced two different failure modes. Both pointed at the same root cause.

Result 1 : PDFs

Corpus: 5 technical PDFs from FastAPI, Kubernetes, React, Stripe API reference, AWS overview along with 40 QA pairs

Condition	Faithfulness	Ctx Precision	Ctx Recall	RAGAS SUM
top-5, no reranker	0.9137	0.8102	0.8917	3.4330
top-20, no reranker	0.9004	0.8118	0.8750	3.4051 ↓
top-20 → Cohere → top-5	0.9267	0.8368	0.8929	3.4843 ↑

Condition B scored lower than Condition A on every metric except Context Precision, where it gained 0.0016 and its statistically meaningless. Overall RAGAS SUM dropped from 3.4330 to 3.4051.

Result 2 : GitHub code

Corpus: encode/httpx repository and 90 files, 50 QA pairs on function behavior and parameters. Full experiment code and eval sets are in the repo

Condition	Ctx Precision	Ctx Recall	RAGAS SUM
top-5, no reranker	0.7812	0.9700	3.5680
top-20, no reranker	0.7812	0.9700	3.5766
top-20 → Cohere → top-5	0.9335	0.9300	3.7079

Condition B versus Condition A: Context Precision 0.7812 versus 0.7812. Identical. A 4x larger retrieval pool produced zero improvement in precision.

Then the reranker: Context Precision jumps from 0.7812 to 0.9335. That's +0.1523, the largest precision gain in any experiment across this entire project. RAGAS SUM 3.7079 is the highest score in the project. The PDF best was 3.4843.

One tradeoff worth naming: Context Recall dropped slightly from 0.9700 to 0.9300 when the reranker was added. The reranker filters aggressively for relevance, occasionally it discards a chunk that contained useful information but didn't score highest on the query. For most QA use cases, a +0.1523 precision gain at the cost of -0.0400 recall is clearly the right tradeoff. But it's real, and worth monitoring if recall matters more than precision for your use case.

Every point of the precision improvement came from the reranker, not from the pool size.

Why this happens

Without a reranker, the top-k selection is purely based on embedding similarity. The embedding model retrieves the 20 chunks whose vectors are closest to the query vector. In the top 5, those are the 5 closest. In the top 20, you get those same 5 plus 15 more, which are further away in embedding space and increasingly likely to be noise.

Those 15 extra chunks go directly into the LLM's context window. The LLM sees 20 chunks instead of 5. The signal-to-noise ratio drops. The answers get worse.

The reranker changes the game because it operates on a completely different signal. Cohere's reranker doesn't use vector proximity — it reads the query and each chunk as text, then scores relevance directly. It can distinguish between a chunk that contains the query's keywords but doesn't answer the question, and a chunk that answers the question using different words. Embedding similarity can't do that.

So the reranker takes the noisy top-20 pool and discards 15 chunks. The 5 it keeps are genuinely relevant, not just vectorially close. That's why Context Precision jumped from 0.7812 to 0.9335 on code and why adding more candidates without the reranker did nothing.

The "reranker does real work" proof

The 3 condition design specifically tests this.

If all the improvement in Condition C came from the larger pool rather than the reranker, then Condition B (same pool, no reranker) would show similar gains. It didn't, on code, B and A were identical. On PDFs, B was worse than A.

Every gain in Condition C came from the reranker acting on the larger pool. The pool size is not the lever. The reranker is.

This matters practically. A common optimization people reach for is "increase k." It's a one line config change. But the data shows it has no effect without a reranker, and can actively hurt. The right lever is adding a reranker, not increasing k.

What I learned

Increasing retrieval candidates without a reranker adds noise, not signal, on PDFs, top-20 without a reranker scored lower than top-5 on every metric
On code, expanding from top-5 to top-20 produced 0.0000 improvement in Context Precision, the pool size was genuinely irrelevant
The 3 condition design (top-5 / top-20 / top-20+rerank) is the correct way to test this, "with vs without reranker" conflates two separate effects
The reranker's advantage is operating on text, not vectors, it catches semantic relevance that embedding similarity misses
+0.1523 Context Precision on code is the largest single-component gain in this project, one API call, one reranker, that result

The practical takeaway

If you're trying to improve RAG answer quality, don't reach for a larger k first.

Add a reranker. Then increase k if you want to give it more to work with.

Increasing k without a reranker gives the LLM more context to get confused by. With a reranker, a larger pool means the right chunks are more likely to be in the candidate set before filtering. The order matters.

A top-20 retrieve → Cohere rerank → top-5 pipeline consistently outperformed both top-5 (baseline) and top-20 without reranking across two separate data types and 90 total QA pairs. The pattern is stable.

Part of an ongoing series on building and evaluating a production RAG system.
Full code in GitHub : Reverse Engineering YC Startup
Previous post: I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

Top comments (1)

Olebeng • May 11

Following on from the chunking article, this experiment design answers the question I was left with after Part 1.

The 3-condition design is the methodologically rigorous part of this post that deserves more attention than the headline result. Most teams testing rerankers compare "with reranker vs without reranker" and conclude that the reranker helps. Your design isolates what is actually causing the improvement. The comparison between Condition B and Condition C is the useful number. It tells you whether the gain comes from the larger pool or from the reranker acting on the larger pool. Your data answers that cleanly.

The code result is the one worth sitting with. A 4x larger retrieval pool with zero improvement in Context Precision before the reranker, then +0.1523 with it. The embedding similarity score was surfacing the same 5 relevant chunks regardless of pool size. The other 15 were noise. The reranker reading the query and each chunk as text catches what cosine similarity misses: a chunk that uses different vocabulary to answer the question correctly.

One extension worth adding for compliance and security analysis specifically: the precision-recall tradeoff from Condition C is not symmetric across all domains. For general Q&A, trading 0.04 recall for 0.15 precision is an easy decision. For security and compliance retrieval the calculus is different. A missed finding, a chunk containing a Critical vulnerability that the reranker scored lower because it uses uncommon phrasing, has a different cost profile than a false positive. Whether you optimise for precision or recall depends entirely on whether the downstream cost of missing a finding exceeds the cost of reviewing noise.

The data residency angle for the reranker is also worth naming for anyone building in regulated environments. Cohere Rerank is an API call and data leaves your infrastructure. For healthcare, financial services, or any deployment under GDPR with strict data localisation requirements, a local cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2 via sentence-transformers) produces strong reranking quality without the data egress. The quality difference versus Cohere is real but smaller than the difference between having a reranker and not having one.