DEV Community: Md Ayan Arshad

wrote article about why increasing retrieval from top 5 to top 20 worsens my answer quality: https://dev.to/ayanarshad02/i-increased-retrieval-from-top-5-to-top-20-my-answers-got-worse-3mke

Md Ayan Arshad — Fri, 08 May 2026 03:21:48 +0000

Md Ayan Arshad

May 7

I Increased Retrieval From Top-5 to Top-20. My Answers Got Worse

#discuss #ai #programming #machinelearning

Comments 2

6 min read

I Increased Retrieval From Top-5 to Top-20. My Answers Got Worse

Md Ayan Arshad — Thu, 07 May 2026 13:53:07 +0000

The standard advice for improving RAG retrieval quality is: retrieve more candidates, then filter down. Bigger pool, better reranker, better answers. I followed that advice in my RAG System. On PDFs, going from top-5 to top-20 made my RAGAS scores drop. The answers got worse, not better.

Here's what actually happened and the experiment design that explained it.

TL;DR

PDFs (40 QA pairs, 5 technical documents):

Condition	RAGAS SUM	Context Precision
top-5, no reranker (baseline)	3.4330	0.8102
top-20, no reranker	3.4051 ↓	0.8118
top-20 → Cohere rerank → top-5	3.4843 ↑	0.8368

GitHub code (50 QA pairs, encode/httpx repo):

Condition	RAGAS SUM	Context Precision
top-5, no reranker (baseline)	3.5680	0.7812
top-20, no reranker	3.5766	0.7812 ← identical
top-20 → Cohere rerank → top-5	3.7079 ↑	0.9335

On PDFs, more candidates without a quality filter made scores drop. On code, a 4x larger pool produced zero improvement in Context Precision i.e. 0.7812 versus 0.7812. Every gain in both cases came entirely from the reranker.

The standard advice

Most RAG tutorials recommend something like: retrieve top-20 or top-50 candidates, then rerank to top-5. The reasoning is intuitive, a bigger retrieval pool gives the reranker more material to work with, so the final 5 chunks are better quality.

That reasoning isn't wrong. But it hides an important assumption: the reranker is present. Without it, a bigger pool doesn't help. It actively hurts.

To separate these two effects, I designed a 3 condition experiment. Most people only test "with reranker vs without reranker", that confounds pool size and reranking quality into one comparison. Breaking it into three conditions isolates what's actually causing the change.

The 3-condition experiment design

Condition A: top-5,  no reranker    → baseline
Condition B: top-20, no reranker    → isolates pool size effect
Condition C: top-20 → Cohere → top-5 → isolates reranker contribution

The logic:

If C > B: the reranker is doing real work, not just benefiting from more candidates
If B > A: a bigger pool helps even without reranking
If B ≈ A or B < A: pool size doesn't matter and all improvement comes from the reranker

Running this on two different data types produced two different failure modes. Both pointed at the same root cause.

Result 1 : PDFs

Corpus: 5 technical PDFs from FastAPI, Kubernetes, React, Stripe API reference, AWS overview along with 40 QA pairs

Condition	Faithfulness	Ctx Precision	Ctx Recall	RAGAS SUM
top-5, no reranker	0.9137	0.8102	0.8917	3.4330
top-20, no reranker	0.9004	0.8118	0.8750	3.4051 ↓
top-20 → Cohere → top-5	0.9267	0.8368	0.8929	3.4843 ↑

Condition B scored lower than Condition A on every metric except Context Precision, where it gained 0.0016 and its statistically meaningless. Overall RAGAS SUM dropped from 3.4330 to 3.4051.

Result 2 : GitHub code

Corpus: encode/httpx repository and 90 files, 50 QA pairs on function behavior and parameters. Full experiment code and eval sets are in the repo

Condition	Ctx Precision	Ctx Recall	RAGAS SUM
top-5, no reranker	0.7812	0.9700	3.5680
top-20, no reranker	0.7812	0.9700	3.5766
top-20 → Cohere → top-5	0.9335	0.9300	3.7079

Condition B versus Condition A: Context Precision 0.7812 versus 0.7812. Identical. A 4x larger retrieval pool produced zero improvement in precision.

Then the reranker: Context Precision jumps from 0.7812 to 0.9335. That's +0.1523, the largest precision gain in any experiment across this entire project. RAGAS SUM 3.7079 is the highest score in the project. The PDF best was 3.4843.

One tradeoff worth naming: Context Recall dropped slightly from 0.9700 to 0.9300 when the reranker was added. The reranker filters aggressively for relevance, occasionally it discards a chunk that contained useful information but didn't score highest on the query. For most QA use cases, a +0.1523 precision gain at the cost of -0.0400 recall is clearly the right tradeoff. But it's real, and worth monitoring if recall matters more than precision for your use case.

Every point of the precision improvement came from the reranker, not from the pool size.

Why this happens

Without a reranker, the top-k selection is purely based on embedding similarity. The embedding model retrieves the 20 chunks whose vectors are closest to the query vector. In the top 5, those are the 5 closest. In the top 20, you get those same 5 plus 15 more, which are further away in embedding space and increasingly likely to be noise.

Those 15 extra chunks go directly into the LLM's context window. The LLM sees 20 chunks instead of 5. The signal-to-noise ratio drops. The answers get worse.

The reranker changes the game because it operates on a completely different signal. Cohere's reranker doesn't use vector proximity — it reads the query and each chunk as text, then scores relevance directly. It can distinguish between a chunk that contains the query's keywords but doesn't answer the question, and a chunk that answers the question using different words. Embedding similarity can't do that.

So the reranker takes the noisy top-20 pool and discards 15 chunks. The 5 it keeps are genuinely relevant, not just vectorially close. That's why Context Precision jumped from 0.7812 to 0.9335 on code and why adding more candidates without the reranker did nothing.

The "reranker does real work" proof

The 3 condition design specifically tests this.

If all the improvement in Condition C came from the larger pool rather than the reranker, then Condition B (same pool, no reranker) would show similar gains. It didn't, on code, B and A were identical. On PDFs, B was worse than A.

Every gain in Condition C came from the reranker acting on the larger pool. The pool size is not the lever. The reranker is.

This matters practically. A common optimization people reach for is "increase k." It's a one line config change. But the data shows it has no effect without a reranker, and can actively hurt. The right lever is adding a reranker, not increasing k.

What I learned

Increasing retrieval candidates without a reranker adds noise, not signal, on PDFs, top-20 without a reranker scored lower than top-5 on every metric
On code, expanding from top-5 to top-20 produced 0.0000 improvement in Context Precision, the pool size was genuinely irrelevant
The 3 condition design (top-5 / top-20 / top-20+rerank) is the correct way to test this, "with vs without reranker" conflates two separate effects
The reranker's advantage is operating on text, not vectors, it catches semantic relevance that embedding similarity misses
+0.1523 Context Precision on code is the largest single-component gain in this project, one API call, one reranker, that result

The practical takeaway

If you're trying to improve RAG answer quality, don't reach for a larger k first.

Add a reranker. Then increase k if you want to give it more to work with.

Increasing k without a reranker gives the LLM more context to get confused by. With a reranker, a larger pool means the right chunks are more likely to be in the candidate set before filtering. The order matters.

A top-20 retrieve → Cohere rerank → top-5 pipeline consistently outperformed both top-5 (baseline) and top-20 without reranking across two separate data types and 90 total QA pairs. The pattern is stable.

Part of an ongoing series on building and evaluating a production RAG system.
Full code in GitHub : Reverse Engineering YC Startup
Previous post: I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

Md Ayan Arshad — Mon, 04 May 2026 16:32:41 +0000

I assumed chunking was a solved problem. Pick a text splitter, set 512 tokens, add some overlap, move on. After running structured experiments across three different data types, that assumption collapsed. The best chunker for markdown documentation actively hurt performance on code. The winner changed completely depending on what I was chunking.

TL;DR

Data type	Winner	Headline metric
Markdown docs	HeadingAwareChunker	MRR 0.755 vs SlidingWindow 0.687
PDFs	RecursiveChar (512 tok)	Context Recall 0.9250, RAGAS SUM 3.4249
GitHub code	CodeBlockAwareChunker	RAGAS SUM 3.5680 — highest across all experiments

RecursiveChar won on PDFs. The same chunker scored 0.5690 Context Precision on code, roughly half the retrieved chunks were irrelevant. There is no universal best chunker. The data type decides.

What I was building

A RAG system that ingests documentation sites, PDFs, and GitHub repositories for multiple tenants, then answers developer questions with citations. Before embedding anything, I had to decide how to chunk each source type.

The standard advice is "use a recursive text splitter." Every tutorial does this. But markdown docs have headings, PDFs have paragraphs, code has functions. A function is a complete semantic unit, splitting it at token 256 and you've lost the return type, the error handling, the docstring. None of that is recoverable at query time

So I ran experiments, one variable changed per experiment: the chunker

The embedding model, retrieval method, reranker, LLM, and eval set stayed fixed.
RAGAS scored every pipeline on the same frozen question set.

3 data types, 3 experiments and here's what happened.

The full implementation, experiment notebooks, and eval sets are on GitHub

Experiment 1: Documentation (.md / .mdx)

Corpus: FastAPI and Supabase documentation, 78 QA pairs generated by GPT-4o, frozen after generation

Chunkers tested: HeadingAwareChunker (HAC), SlidingWindow-128, RecursiveChar, SemanticBlock

Key metric: MRR (Mean Reciprocal Rank), recall@5 tells you if the answer is somewhere in the top 5. MRR tells you if it's at rank 1, whether the right chunk comes first, not just eventually

Chunker	MRR (no reranker)	Chunks produced
HeadingAwareChunker	0.755	127
SlidingWindow-128	0.687	259

HAC produced the same Recall@5 as SlidingWindow (~0.82) but with significantly better MRR. The right answer appeared at rank 1 more often. And HAC did it with 127 chunks versus SlidingWindow's 259, half the chunks, better ranking, cheaper retrieval

Why? Markdown documentation is already structured by headings. Each section covers one concept, one API endpoint, one configuration option. HAC splits exactly at those heading boundaries. SlidingWindow ignores them entirely, it cuts at token count, which means a chunk might start halfway through one concept and end halfway through the next.

The embedding model then has to encode a chunk that mixes two ideas. The resulting vector is somewhere between them, and retrieval becomes imprecise.

Winner: HeadingAwareChunker.

Experiment 2: PDFs

Corpus: 5 technical PDFs from FastAPI concepts, Kubernetes architecture, React patterns, Stripe API reference, AWS overview along with 40 QA pairs

Chunkers tested: SlidingWindow-128, SemanticBlock, RecursiveChar (512 tokens, 50 overlap). HeadingAwareChunker was not included here, pymupdf4llm extracts PDFs to Markdown, but the heading hierarchy in PDFs is inconsistent across documents. Font-size-based heading detection is fragile enough that HAC's boundaries would be unreliable. The experiment focused on chunkers that work on paragraph level structure, which is what the extraction reliably produces

Chunker	Context Recall	RAGAS SUM
RecursiveChar	0.9250	3.4249
SlidingWindow-128	0.8750	3.3691
SemanticBlock	0.8167	3.2627

RecursiveChar won by a clear margin. Context Recall 0.9250 versus SlidingWindow's 0.8750.

The reason is specific to how I extracted the PDFs. I used pymupdf4llm, which converts PDFs to Markdown. The output is clean paragraphs with heading markers. RecursiveChar's default split points, double newlines, single newlines and it aligns naturally with those paragraph boundaries. It didn't need to classify blocks or detect headings. The structure was already there; RC just respected it.

SemanticBlock failed on the Stripe API PDF. That document's navigation sidebar produced 12-token noise chunks, fragment after fragment of menu items. Those wasted retrieval slots on every single query.

Winner: RecursiveChar.

Note what just happened: HAC won on docs, RC won on PDFs. Two different data types, two different winners and the experiments are only half done

Experiment 3: GitHub code

Corpus: encode/httpx repository having 90 files (60 Python, 29 Markdown, 1 text). 50 QA pairs focused on function behavior, parameters, and return values

Chunkers tested: CodeBlockAwareChunker (CBAC), RecursiveChar, SlidingWindow-128

Chunker	Ctx Precision	Ctx Recall	RAGAS SUM
CodeBlockAwareChunker	0.7812	0.9700	3.5680
SlidingWindow-128	0.8278	0.9150	3.4957
RecursiveChar	0.5690	0.9400	3.2856

RecursiveChar scored 0.5690 on Context Precision, that means roughly half of the retrieved chunks were irrelevant to the question. The same chunker that won on PDFs failed on code.

The failure mode is direct. Python code is full of blank lines between a function's docstring and its body, between logical sections inside a method, between a guard clause and the main logic. RecursiveChar splits at blank lines. So it routinely bundled two or three unrelated functions into a single chunk, averaging 457 tokens. When someone asks "what does Client.send() return," the retrieved chunk contains send() plus get() plus the __init__ method. Everything but a focused answer.

CBAC doesn't use blank lines. For Python files, it uses the ast module, it finds the exact byte offset of every function and class definition in the syntax tree, then extracts each one as a separate chunk. Zero false splits. The average chunk was 120 tokens, one complete function.

SlidingWindow 128 had the best Context Precision (0.8278), small windows avoid the bundling problem. But it split functions mid-body. A function's return value might land in the next window. That killed Recall: 0.9150 versus CBAC's 0.9700.

CBAC with a full reranker pipeline achieved RAGAS SUM 3.7079, the highest score across all experiments in this project and the PDF best was 3.4843

Winner: CodeBlockAwareChunker.

Why the results differ and why they shouldn't surprise you

Each experiment picked a different chunker, but every result points at the same question: what is the natural semantic unit of this data?

For markdown documentation, it's the section under a heading. That's a discrete concept, authored that way intentionally.

For PDFs extracted to Markdown, it's the paragraph. The extraction tool already produces those boundaries. The chunker just has to respect them.

For code, it's the function or class. A function is the smallest unit of behavior that makes sense alone. Split it and the chunk becomes meaningless without the surrounding context.

Text splitters, recursive or sliding window, don't know any of this. They operate on character counts, token counts, or blank lines. None of those correspond to semantic boundaries in code. That's the root cause of RecursiveChar's 0.5690 Context Precision. It wasn't a hyperparameter problem. It was a conceptual mismatch.

There's also a second effect worth naming: chunk count matters. HAC's 127 chunks versus SlidingWindow's 259 on the same corpus is not a coincidence. Fewer chunks means fewer candidates for noise to enter the retrieval pool. The embedding space is less diluted and rank 1 is cleaner

What I learned

The optimal chunker is determined by the data type, not by chunk size or overlap settings
RecursiveChunker's blank-line heuristic is a real liability for code, 0.5690 Context Precision proves it
Smaller average chunks (120 tokens) outperformed larger ones (457 tokens) on code by a significant margin, chunk size is a symptom, not a cause
Visual inspection of actual chunks before running RAGAS catches structural bugs that aggregate scores smooth over, I caught CBAC producing 8KB chunks on Go files before the experiment ran
Freezing the eval set before the first experiment is non negotiable because regenerating it mid experiment would invalidate every comparison

The practical takeaway

There is no universal best chunker

For markdown documentation: split at heading boundaries
For PDFs: convert to Markdown first, then split at paragraph boundaries
For code: use an AST parser

A generic 512-token splitter will technically work on all three. It will not be optimal on any of them. And on code specifically, the degradation is not marginal, it's a near-halving of retrieval precision.

Pick the chunker that matches the semantic structure of the data, not the one that's easiest to configure.

The harder version of this problem is mixed content, a PDF with embedded code blocks, a GitHub repo where half the files are Python and half are Markdown. Each file type still needs its own chunking strategy, which means the chunker has to detect content type at the file level and route accordingly. That's what the connector layer in this project handles, but it's a separate problem worth its own post.

I'm building a production RAG system that ingests multiple source types with per-source-type chunking strategies. Future posts cover the reranker experiments, eval methodology, and the CI pipeline I built around RAGAS scores.

5 Critical Failures We Hit Shipping a Multi-Tenant RAG Chatbot to 500+ Enterprises

Md Ayan Arshad — Sat, 04 Apr 2026 06:26:56 +0000

Our first enterprise tenant onboarded on a Monday.

By Wednesday, 30% of their documents had been silently indexed as empty strings. No error. No exception. The chatbot just said "I don't have enough information", confidently, every time.

That was Failure #1. There were four more.

Here's the honest account of shipping a multi-tenant RAG chatbot to 500+ enterprise clients — what broke, in what order, and what we should have caught earlier.

The System We Built

Before the failures, the context.

We built a RAG chatbot for enterprise warehouse management. Each tenant had their own isolated knowledge base — SOPs, compliance documents, operational guides. Users queried only their tenant's data. Scale target: ~25,000 queries per day at full rollout.

Indexing pipeline:
Document Upload → Type Detection → Preprocessing → Chunking → Embedding → Pinecone

Query pipeline:
User Query → Cache Check → Query Rewrite → Hybrid Search (BM25 + Vector) → RRF Fusion → Reranker → LLM → Response

Two pipelines in the design. One EC2 fleet in reality, which became Failure #4.

Indexing consumed from SQS. Query API sat behind an ALB. One Pinecone namespace per tenant, every query scoped to the authenticated tenant's namespace before touching the vector DB.

The architecture decisions were mostly right.

What broke was the assumptions underneath them.

Failure #1: The PDF Preprocessing Assumption (Week 1)

We assumed all enterprise documents were text-based PDFs.

They weren't.

About 30% of what tenants uploaded were scanned PDFs, images of physical pages, no text layer. When PyMuPDF opened these files, it returned empty strings. We embedded empty strings. We indexed empty chunks. No error. No exception. Just silent failure.

Users asked questions. Retrieval returned nothing relevant. The LLM said "I don't have enough information." Users assumed the chatbot was broken. They were right, just not for the reason they thought.

The fix: A preprocessing gate that checks average characters per page. If avg_chars_per_page < 100, no text layer exists, trigger OCR via AWS Textract before chunking. We also added an admin-facing flag marking documents as "pending OCR" so tenants know their document is processing, not lost.

The lesson: Never assume your input format. Garbage input produces zero output in RAG. Preprocessing is the most boring part of the pipeline and the most catastrophic to skip.

Failure #2: Headers, Footers, and the Chunk Contamination Problem

Even for text-based PDFs, every chunk was contaminated.

Enterprise documents have headers and footers on every page. "Softeon WMS User Guide — Confidential — Page 14 of 203." When you chunk a 200-page document into 512-token pieces, that text bleeds into hundreds of chunks.

The retrieval impact was subtle but real. Queries about "confidential" topics surfaced chunks with "Confidential" in the footer, not because the content was relevant, but because BM25 was matching on that exact term. Relevance scores were quietly polluted.

The fix: A stripping step before chunking. Text appearing in the top 5% and bottom 5% of every page gets flagged and removed. We also converted tables to markdown before chunking, a raw table extracted as "Product Price Refund Laptop 999 30 days" is useless for retrieval. The same table as structured markdown is self-contained and semantically meaningful.

The lesson: Most RAG tutorials skip directly to chunking size debates — 256 vs 512 tokens. They assume clean input. Real enterprise documents are not clean.

Failure #3: The Parallel Pipeline Was Actually Sequential

We ran BM25 and vector search in what we thought was parallel.

It wasn't.

The original implementation called BM25, waited for the result, then called Pinecone. "Parallel" on the architecture diagram. Sequential in the code. At p50 this cost us ~200ms we couldn't afford.

The fix is one line:

# Wrong — sequential
bm25_results = await bm25_search(query)
vector_results = await pinecone_search(query)

# Right — parallel
bm25_results, vector_results = await asyncio.gather(
    bm25_search(query),
    pinecone_search(query)
)

Latency becomes the max of the two, not the sum. Dropped our p95 by ~180ms.

The lesson: "Parallel" on a diagram and "parallel" in code are different things. Profile your pipeline stage by stage. The bottleneck is always somewhere surprising.

Failure #4: One Tenant's Upload Degraded Everyone's Query Latency

This one took a week to diagnose.

We noticed periodic p99 spikes, not consistent, not tied to query volume. Random, unpredictable.

The cause: our indexing pipeline and query pipeline were on the same EC2 instances.

When a large tenant uploaded 500 documents, the embedding loop hammered the instance CPU. Live users querying on the same instance saw response time jump from 800ms to 6+ seconds. The indexer and query service were invisible to each other in code — but very visible to each other on the metal.

The fix: Complete infrastructure separation. Indexing workers on a dedicated EC2 fleet, completely outside the ALB. The query fleet has no knowledge that indexing is happening. A document upload spike now has zero effect on query latency for any tenant. The SQS queue buffers upload bursts and feeds indexing workers at a controlled pace.

The lesson: Load isolation is not just an architectural principle. It's a user experience decision. Enterprise tenants don't care about your architecture, they care that the chatbot was slow when they needed it.

Failure #5: The Namespace Isolation Gap We Almost Missed

Multi-tenant isolation in Pinecone is handled by namespaces. One namespace per tenant. Every write tags it. Every read is scoped to it.

What we almost shipped: namespace scoped at the request body level.

A bad actor passing a forged tenant_id in the request body could scope the query to a different tenant's namespace. Subtle. Critical.

# Wrong — trusting request body
namespace = request.body.tenant_id

# Right — trusting validated token only
namespace = token_context.tenant_id  # resolved from JWT at API layer

The fix: Namespace resolved exclusively from the validated JWT token at the API layer. The request body's tenant_id is ignored entirely. By the time a request reaches the vector DB call, the namespace has already been locked to the authenticated tenant — it cannot be overridden.

If we had shipped the original version, any authenticated user who knew another tenant's ID could have queried their private documents. In a WMS context serving enterprise clients, that's not a security incident, that's a contract termination and a legal conversation.

The lesson: Namespace isolation is not the same as security. Enforce tenant identity at authentication, not the application layer.

What We Still Haven't Built

We don't have automated RAG evaluation in production.

No RAGAS running continuously. No Precision@5 after every deployment. Human review by an internal QA team, representative queries, manual quality ratings. It works at current scale. It won't at full rollout.

What I'd build next with two weeks:
→ A golden evaluation set with 200 curated question-to-chunk pairs from real tenant queries. Your retrieval quality baseline.
→ RAGAS Faithfulness in CI/CD runs on every deployment, blocks release if faithfulness drops more than 5% from baseline.
→ Context Precision tracking, tells you if your reranker is actually earning its latency cost.

The One Thing That Mattered Most

RAG systems fail at the edges of the pipeline, not the center.

Most engineering effort goes into the center, embedding models, reranking algorithms, chunk sizes. The real production failures happen at the edges: what goes into the indexer, what happens when two workloads compete for the same compute, and where tenant identity gets resolved.

What broke first in your RAG pipeline? Drop it in the comments. The failures nobody writes about are always the most useful. I'll compile the best ones into a follow-up post.

Why Our RAG System Was Silently Returning Wrong Answers — And How We Fixed It

Md Ayan Arshad — Wed, 11 Mar 2026 00:19:49 +0000

For 3 days, our RAG system was confident.

Every query returned an answer. Response times were stable. No errors in the logs. By every operational metric, the system was working.

Our RAGAS faithfulness score told a different story.

It had dropped from 0.91 to 0.67 without a single code change.

That meant roughly 1 in 3 responses was making claims our own retrieved context didn’t support. The system wasn’t crashing. It was hallucinating — silently, at scale, with complete confidence.

Here is what happened.

The System When It Started Failing

We were running a production RAG system serving enterprise clients across a large document corpus. Each client had their own isolated set of documents — product configuration files, setup guides, operational workflows — queried daily to answer operational questions.

_The state of the system when the drift began:

25K+ documents across corpus
500+ enterprise tenants
1 Pinecone namespace per tenant
5 chunks retrieved per query (top-K) _

Stack: GPT-4 for generation, text-embedding-ada-002 for embeddings, Pinecone with one namespace per tenant, FastAPI on ECS. Isolation was strict — no cross-tenant reads, ever.

_A NOTE on the namespace decision: Pinecone namespaces share the same index and billing unit, 500 namespaces cost the same as 1. We chose namespaces over metadata filtering (tenant_id filter on a single index) for one specific reason: Metadata filtering requires every query to carry the correct filter, and one bug means Tenant A can read Tenant B's data. For enterprise clients, that risk surface isn't acceptable. Namespaces make cross-tenant leakage structurally impossible at query time.

Namespaces give us defense-in-depth isolation at the infrastructure layer rather than relying on application-level filtering._

Monitoring: API latency, error rates, cost dashboards.

Answer quality monitoring: None.

That was the bug. Not in the code. In the architecture.

What Changed, And Why We Didn’t See It

Three days before we caught the drift, a large batch of new documents was ingested. Different document type than the existing corpus — denser, longer sentences, more domain-specific terminology. Same domain, different structure.

The new documents changed the distribution of our Pinecone namespaces. Queries that had previously retrieved highly relevant chunks now retrieved chunks that were topically related but not directly answering the query.

Cosine similarity scores: 0.76, 0.79, 0.81. High enough to clear any threshold we’d set. text-embedding-ada-002 couldn't distinguish between "this chunk discusses this topic" and "this chunk contains the specific answer this query is asking for." Retrieval looked confident. The chunks were wrong.

GPT-4 did what LLMs do when context is adjacent-but-imprecise: it filled the gaps with plausible-sounding claims not present in the retrieved text.

RAGAS faithfulness: 0.91 → 0.67.

Context Precision: 0.84 → 0.61.

We had no alert for either. We found out from a user.

The core failure: we instrumented everything easy to measure — latency, throughput, cost, error rates — and nothing that measured correctness. We were flying blind on the one metric that determined whether the system was actually useful.

One clarification on how we were running RAGAS. We were not evaluating live traffic — that would be prohibitively expensive and slow. We maintained a golden evaluation set of ~300 representative queries with known-good answers and source chunks, curated when the system first launched. RAGAS ran against that set nightly, and on every ingestion event. The drop from 0.91 to 0.67 showed up the morning after the batch ingestion. We just had no alert configured to catch it.

What We Tried First — And Why It Wasn’t Enough

First instinct: improve retrieval.

We raised the similarity threshold from 0.70 to 0.78, rejecting chunks below that score. Retrieval precision improved. We also started returning no results for legitimate queries with unusual phrasing. Users got empty responses. That was worse.

We increased top-K from 5 to 10. Slightly helped recall. Sent 2× the tokens into every LLM call, which compounded a cost problem already building at 500+ active tenants.

Context Precision recovered to 0.78. Faithfulness only reached 0.81, still below our 0.85 target.

The retrieval fixes were necessary. They were not sufficient. We needed a layer that caught the gap between what retrieval returned and what the LLM claimed. Retrieval improvement was treating the symptom. We needed to treat the cause.

The Real Fix: Grounding Validation as a First-Class Architecture Layer

We added a grounding validation step that runs after every LLM response, before it’s returned to the user.

The mechanism:

Extract the factual claims from the generated response using a structured extraction prompt. This step is imperfect, LLM-based claim extraction can miss or misinterpret implicit claims, so we treat it as a signal, not a definitive verdict. A claim is flagged as unsupported if no retrieved chunk scores above a similarity threshold for that claim.
Score each claim against the retrieved chunks, classified as supported, unsupported, or contradicted.
Flag any response where more than 15% of claims are unsupported or contradicted.
Regenerate flagged responses with an explicit grounding instruction injected into the prompt.

The regeneration prompt:

“Your response must only make claims directly supported by the provided context. If the context does not contain the answer, say so explicitly. Do not infer or extrapolate.”

Critical implementation detail: The claim extraction and verification call runs against gpt-4o-mini, not GPT-4. Running a full GPT-4 call for every response validation would double our inference cost and add 600–800ms of latency. With gpt-4o-mini, the validation step adds approximately 180–220ms on average for a 3–5 sentence response. That number is model-dependent, it will be higher on a slower model and lower with a fine-tuned classifier.

After deploying:

Before:  Faithfulness 0.67  |  Context Precision 0.61  |  31% unsupported claim rate
After:   Faithfulness 0.91  |  Context Precision 0.87  |  <4% unsupported claim rate

The Core Decision — And When You’d Make It Differently

We made an explicit trade-off: ~200ms of additional latency in exchange for verifiable answer quality.

For our use case — enterprise clients making operational decisions based on the chatbot’s answers — that trade-off was not a discussion. The 200ms is noise. The trust cost of a wrong answer at enterprise scale is not.

But this decision is not universal:

Constraint                            Decision          Rationale
──────────────────────────────────────────────────────────────────────────────
Consumer product, SLA < 500ms         Run async          Log failures, don't block.
                                                         Stakes are low, UX matters more.

Low-stakes (drafting, summarisation)  Skip it            User edits the output.
                                                         Grounding matters less.

High-volume, cost-sensitive           Sample 10%         Statistical signal at
                                                         1/10th the overhead.

Enterprise / regulated / high-stakes  Mandatory sync     A wrong answer has real
                                                         downstream consequences.

Multi-tenant, strict isolation        Mandatory + audit  Every response must be
                                                         traceable to a source chunk.

Principle: grounding validation is always worth measuring. Whether to block on it synchronously depends on your SLA and the cost of a wrong answer in your domain.

The Anti-Pattern: Checking Faithfulness After Generation Is the Wrong Architecture

Here is the deeper architectural mistake this exposed.

We were checking faithfulness after generation — as a post-hoc audit — rather than as a gating condition on the response pipeline. The audit told us something was wrong. It didn’t stop a wrong answer from being returned.

The correct architecture treats grounding validation as a blocking step in the response pipeline, not an observability metric reviewed after the fact.

Wrong: Generate → Return → [async] Validate → Log failure → Review weekly
Right: Generate → Validate → [if flagged] Regenerate → Return → Log

The async pattern gives you observability. It does not give you correctness. For any system where answer quality has downstream consequences, post-hoc monitoring is not a substitute for inline validation.

We caught our failure because a user noticed. That should never be the detection mechanism for a production system serving enterprise clients.

What We Changed Permanently

Quality monitoring is now first-class. RAGAS faithfulness and context precision scored against our golden evaluation set on every ingestion event and every deployment. Grafana alerts fire if either drops more than 10% from the established baseline. We do not run RAGAS on live traffic, it’s too slow and expensive at scale. The golden set gives us the signal we need.
Document ingestion now triggers a quality gate. When new documents are ingested, we run the benchmark query set against the updated index before traffic is shifted. Faithfulness drops >5% → ingestion rolled back.
Grounding validation is synchronous and non-configurable. The ~200ms cost is included in our SLA. Not optional for any enterprise-tier query.

The One-Line Takeaway

Your RAG system will hallucinate. The question is whether you find out before your users do.