DEV Community

Md Ayan Arshad
Md Ayan Arshad

Posted on

I Increased Retrieval From Top-5 to Top-20. My Answers Got Worse

The standard advice for improving RAG retrieval quality is: retrieve more candidates, then filter down. Bigger pool, better reranker, better answers. I followed that advice in my RAG System. On PDFs, going from top-5 to top-20 made my RAGAS scores drop. The answers got worse, not better.

Here's what actually happened and the experiment design that explained it.

TL;DR

PDFs (40 QA pairs, 5 technical documents):

Condition RAGAS SUM Context Precision
top-5, no reranker (baseline) 3.4330 0.8102
top-20, no reranker 3.4051 ↓ 0.8118
top-20 → Cohere rerank → top-5 3.4843 0.8368

GitHub code (50 QA pairs, encode/httpx repo):

Condition RAGAS SUM Context Precision
top-5, no reranker (baseline) 3.5680 0.7812
top-20, no reranker 3.5766 0.7812 ← identical
top-20 → Cohere rerank → top-5 3.7079 0.9335

On PDFs, more candidates without a quality filter made scores drop. On code, a 4x larger pool produced zero improvement in Context Precision i.e. 0.7812 versus 0.7812. Every gain in both cases came entirely from the reranker.

The standard advice

Most RAG tutorials recommend something like: retrieve top-20 or top-50 candidates, then rerank to top-5. The reasoning is intuitive, a bigger retrieval pool gives the reranker more material to work with, so the final 5 chunks are better quality.

That reasoning isn't wrong. But it hides an important assumption: the reranker is present. Without it, a bigger pool doesn't help. It actively hurts.

To separate these two effects, I designed a 3 condition experiment. Most people only test "with reranker vs without reranker", that confounds pool size and reranking quality into one comparison. Breaking it into three conditions isolates what's actually causing the change.

The 3-condition experiment design

Condition A: top-5,  no reranker    → baseline
Condition B: top-20, no reranker    → isolates pool size effect
Condition C: top-20 → Cohere → top-5 → isolates reranker contribution
Enter fullscreen mode Exit fullscreen mode

The logic:

  • If C > B: the reranker is doing real work, not just benefiting from more candidates
  • If B > A: a bigger pool helps even without reranking
  • If B ≈ A or B < A: pool size doesn't matter and all improvement comes from the reranker

Running this on two different data types produced two different failure modes. Both pointed at the same root cause.

Result 1 : PDFs

Corpus: 5 technical PDFs from FastAPI, Kubernetes, React, Stripe API reference, AWS overview along with 40 QA pairs

Condition Faithfulness Ctx Precision Ctx Recall RAGAS SUM
top-5, no reranker 0.9137 0.8102 0.8917 3.4330
top-20, no reranker 0.9004 0.8118 0.8750 3.4051
top-20 → Cohere → top-5 0.9267 0.8368 0.8929 3.4843

Condition B scored lower than Condition A on every metric except Context Precision, where it gained 0.0016 and its statistically meaningless. Overall RAGAS SUM dropped from 3.4330 to 3.4051.

More candidates made the answers worse.

The reranker (Condition C) recovered the loss and added on top of it: SUM 3.4843, Context Precision 0.8368. The difference between C and B is entirely the reranker's contribution.

Result 2 : GitHub code

Corpus: encode/httpx repository and 90 files, 50 QA pairs on function behavior and parameters. Full experiment code and eval sets are in the repo

Condition Ctx Precision Ctx Recall RAGAS SUM
top-5, no reranker 0.7812 0.9700 3.5680
top-20, no reranker 0.7812 0.9700 3.5766
top-20 → Cohere → top-5 0.9335 0.9300 3.7079

Condition B versus Condition A: Context Precision 0.7812 versus 0.7812. Identical. A 4x larger retrieval pool produced zero improvement in precision.

Then the reranker: Context Precision jumps from 0.7812 to 0.9335. That's +0.1523, the largest precision gain in any experiment across this entire project. RAGAS SUM 3.7079 is the highest score in the project. The PDF best was 3.4843.

One tradeoff worth naming: Context Recall dropped slightly from 0.9700 to 0.9300 when the reranker was added. The reranker filters aggressively for relevance, occasionally it discards a chunk that contained useful information but didn't score highest on the query. For most QA use cases, a +0.1523 precision gain at the cost of -0.0400 recall is clearly the right tradeoff. But it's real, and worth monitoring if recall matters more than precision for your use case.

Every point of the precision improvement came from the reranker, not from the pool size.

Why this happens

Without a reranker, the top-k selection is purely based on embedding similarity. The embedding model retrieves the 20 chunks whose vectors are closest to the query vector. In the top 5, those are the 5 closest. In the top 20, you get those same 5 plus 15 more, which are further away in embedding space and increasingly likely to be noise.

Those 15 extra chunks go directly into the LLM's context window. The LLM sees 20 chunks instead of 5. The signal-to-noise ratio drops. The answers get worse.

The reranker changes the game because it operates on a completely different signal. Cohere's reranker doesn't use vector proximity — it reads the query and each chunk as text, then scores relevance directly. It can distinguish between a chunk that contains the query's keywords but doesn't answer the question, and a chunk that answers the question using different words. Embedding similarity can't do that.

So the reranker takes the noisy top-20 pool and discards 15 chunks. The 5 it keeps are genuinely relevant, not just vectorially close. That's why Context Precision jumped from 0.7812 to 0.9335 on code and why adding more candidates without the reranker did nothing.

The "reranker does real work" proof

The 3 condition design specifically tests this.

If all the improvement in Condition C came from the larger pool rather than the reranker, then Condition B (same pool, no reranker) would show similar gains. It didn't, on code, B and A were identical. On PDFs, B was worse than A.

Every gain in Condition C came from the reranker acting on the larger pool. The pool size is not the lever. The reranker is.

This matters practically. A common optimization people reach for is "increase k." It's a one line config change. But the data shows it has no effect without a reranker, and can actively hurt. The right lever is adding a reranker, not increasing k.

What I learned

  • Increasing retrieval candidates without a reranker adds noise, not signal, on PDFs, top-20 without a reranker scored lower than top-5 on every metric
  • On code, expanding from top-5 to top-20 produced 0.0000 improvement in Context Precision, the pool size was genuinely irrelevant
  • The 3 condition design (top-5 / top-20 / top-20+rerank) is the correct way to test this, "with vs without reranker" conflates two separate effects
  • The reranker's advantage is operating on text, not vectors, it catches semantic relevance that embedding similarity misses
  • +0.1523 Context Precision on code is the largest single-component gain in this project, one API call, one reranker, that result

The practical takeaway

If you're trying to improve RAG answer quality, don't reach for a larger k first.

Add a reranker. Then increase k if you want to give it more to work with.

Increasing k without a reranker gives the LLM more context to get confused by. With a reranker, a larger pool means the right chunks are more likely to be in the candidate set before filtering. The order matters.

A top-20 retrieve → Cohere rerank → top-5 pipeline consistently outperformed both top-5 (baseline) and top-20 without reranking across two separate data types and 90 total QA pairs. The pattern is stable.


Part of an ongoing series on building and evaluating a production RAG system.
Full code in GitHub : Reverse Engineering YC Startup
Previous post: I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

Top comments (0)