Md Ayan Arshad

Posted on Mar 11

Why Our RAG System Was Silently Returning Wrong Answers — And How We Fixed It

#genai #rag #ai #architecture

For 3 days, our RAG system was confident.

Every query returned an answer. Response times were stable. No errors in the logs. By every operational metric, the system was working.

Our RAGAS faithfulness score told a different story.

It had dropped from 0.91 to 0.67 without a single code change.

That meant roughly 1 in 3 responses was making claims our own retrieved context didn’t support. The system wasn’t crashing. It was hallucinating — silently, at scale, with complete confidence.

Here is what happened.

The System When It Started Failing

We were running a production RAG system serving enterprise clients across a large document corpus. Each client had their own isolated set of documents — product configuration files, setup guides, operational workflows — queried daily to answer operational questions.

_The state of the system when the drift began:

25K+ documents across corpus
500+ enterprise tenants
1 Pinecone namespace per tenant
5 chunks retrieved per query (top-K) _

Stack: GPT-4 for generation, text-embedding-ada-002 for embeddings, Pinecone with one namespace per tenant, FastAPI on ECS. Isolation was strict — no cross-tenant reads, ever.

_A NOTE on the namespace decision: Pinecone namespaces share the same index and billing unit, 500 namespaces cost the same as 1. We chose namespaces over metadata filtering (tenant_id filter on a single index) for one specific reason: Metadata filtering requires every query to carry the correct filter, and one bug means Tenant A can read Tenant B's data. For enterprise clients, that risk surface isn't acceptable. Namespaces make cross-tenant leakage structurally impossible at query time.

Namespaces give us defense-in-depth isolation at the infrastructure layer rather than relying on application-level filtering._

Monitoring: API latency, error rates, cost dashboards.

Answer quality monitoring: None.

That was the bug. Not in the code. In the architecture.

What Changed, And Why We Didn’t See It

Three days before we caught the drift, a large batch of new documents was ingested. Different document type than the existing corpus — denser, longer sentences, more domain-specific terminology. Same domain, different structure.

The new documents changed the distribution of our Pinecone namespaces. Queries that had previously retrieved highly relevant chunks now retrieved chunks that were topically related but not directly answering the query.

Cosine similarity scores: 0.76, 0.79, 0.81. High enough to clear any threshold we’d set. text-embedding-ada-002 couldn't distinguish between "this chunk discusses this topic" and "this chunk contains the specific answer this query is asking for." Retrieval looked confident. The chunks were wrong.

GPT-4 did what LLMs do when context is adjacent-but-imprecise: it filled the gaps with plausible-sounding claims not present in the retrieved text.

RAGAS faithfulness: 0.91 → 0.67.

Context Precision: 0.84 → 0.61.

We had no alert for either. We found out from a user.

The core failure: we instrumented everything easy to measure — latency, throughput, cost, error rates — and nothing that measured correctness. We were flying blind on the one metric that determined whether the system was actually useful.

One clarification on how we were running RAGAS. We were not evaluating live traffic — that would be prohibitively expensive and slow. We maintained a golden evaluation set of ~300 representative queries with known-good answers and source chunks, curated when the system first launched. RAGAS ran against that set nightly, and on every ingestion event. The drop from 0.91 to 0.67 showed up the morning after the batch ingestion. We just had no alert configured to catch it.

What We Tried First — And Why It Wasn’t Enough

First instinct: improve retrieval.

We raised the similarity threshold from 0.70 to 0.78, rejecting chunks below that score. Retrieval precision improved. We also started returning no results for legitimate queries with unusual phrasing. Users got empty responses. That was worse.

We increased top-K from 5 to 10. Slightly helped recall. Sent 2× the tokens into every LLM call, which compounded a cost problem already building at 500+ active tenants.

Context Precision recovered to 0.78. Faithfulness only reached 0.81, still below our 0.85 target.

The retrieval fixes were necessary. They were not sufficient. We needed a layer that caught the gap between what retrieval returned and what the LLM claimed. Retrieval improvement was treating the symptom. We needed to treat the cause.

The Real Fix: Grounding Validation as a First-Class Architecture Layer

We added a grounding validation step that runs after every LLM response, before it’s returned to the user.

The mechanism:

Extract the factual claims from the generated response using a structured extraction prompt. This step is imperfect, LLM-based claim extraction can miss or misinterpret implicit claims, so we treat it as a signal, not a definitive verdict. A claim is flagged as unsupported if no retrieved chunk scores above a similarity threshold for that claim.
Score each claim against the retrieved chunks, classified as supported, unsupported, or contradicted.
Flag any response where more than 15% of claims are unsupported or contradicted.
Regenerate flagged responses with an explicit grounding instruction injected into the prompt.

The regeneration prompt:

“Your response must only make claims directly supported by the provided context. If the context does not contain the answer, say so explicitly. Do not infer or extrapolate.”

Critical implementation detail: The claim extraction and verification call runs against gpt-4o-mini, not GPT-4. Running a full GPT-4 call for every response validation would double our inference cost and add 600–800ms of latency. With gpt-4o-mini, the validation step adds approximately 180–220ms on average for a 3–5 sentence response. That number is model-dependent, it will be higher on a slower model and lower with a fine-tuned classifier.

After deploying:

Before:  Faithfulness 0.67  |  Context Precision 0.61  |  31% unsupported claim rate
After:   Faithfulness 0.91  |  Context Precision 0.87  |  <4% unsupported claim rate

The Core Decision — And When You’d Make It Differently

We made an explicit trade-off: ~200ms of additional latency in exchange for verifiable answer quality.

For our use case — enterprise clients making operational decisions based on the chatbot’s answers — that trade-off was not a discussion. The 200ms is noise. The trust cost of a wrong answer at enterprise scale is not.

But this decision is not universal:

Constraint                            Decision          Rationale
──────────────────────────────────────────────────────────────────────────────
Consumer product, SLA < 500ms         Run async          Log failures, don't block.
                                                         Stakes are low, UX matters more.

Low-stakes (drafting, summarisation)  Skip it            User edits the output.
                                                         Grounding matters less.

High-volume, cost-sensitive           Sample 10%         Statistical signal at
                                                         1/10th the overhead.

Enterprise / regulated / high-stakes  Mandatory sync     A wrong answer has real
                                                         downstream consequences.

Multi-tenant, strict isolation        Mandatory + audit  Every response must be
                                                         traceable to a source chunk.

Principle: grounding validation is always worth measuring. Whether to block on it synchronously depends on your SLA and the cost of a wrong answer in your domain.

The Anti-Pattern: Checking Faithfulness After Generation Is the Wrong Architecture

Here is the deeper architectural mistake this exposed.

We were checking faithfulness after generation — as a post-hoc audit — rather than as a gating condition on the response pipeline. The audit told us something was wrong. It didn’t stop a wrong answer from being returned.

The correct architecture treats grounding validation as a blocking step in the response pipeline, not an observability metric reviewed after the fact.

Wrong: Generate → Return → [async] Validate → Log failure → Review weekly
Right: Generate → Validate → [if flagged] Regenerate → Return → Log

The async pattern gives you observability. It does not give you correctness. For any system where answer quality has downstream consequences, post-hoc monitoring is not a substitute for inline validation.

We caught our failure because a user noticed. That should never be the detection mechanism for a production system serving enterprise clients.

What We Changed Permanently

Quality monitoring is now first-class. RAGAS faithfulness and context precision scored against our golden evaluation set on every ingestion event and every deployment. Grafana alerts fire if either drops more than 10% from the established baseline. We do not run RAGAS on live traffic, it’s too slow and expensive at scale. The golden set gives us the signal we need.
Document ingestion now triggers a quality gate. When new documents are ingested, we run the benchmark query set against the updated index before traffic is shifted. Faithfulness drops >5% → ingestion rolled back.
Grounding validation is synchronous and non-configurable. The ~200ms cost is included in our SLA. Not optional for any enterprise-tier query.

The One-Line Takeaway

Your RAG system will hallucinate. The question is whether you find out before your users do.

Top comments (8)

George Toresco • May 11

The namespace isolation vs metadata filter tradeoff is exactly right. One bug in a tenant_id filter and you're accidentally serving competitor data. Enterprise customers don't care about your 200ms latency - they care that you won't leak their config files. Worth the extra validation step every time.

Swift • Mar 11

This is smart. Effectively you're layering LLMs on top of each other to reduce the chance of error. Thanks for sharing!

Md Ayan Arshad • Mar 11

That is roughly the right intuition, though the framing I find more useful is: the first LLM call is generative, the second is evaluative. They are doing fundamentally different jobs. GPT-4 generates a response optimised for coherence and fluency. gpt-4o-mini then checks whether the claims in that response are grounded in the retrieved context, it is not trying to generate anything better, just flag what is unsupported.

The reason this matters architecturally: you can swap the evaluator independently of the generator. We use gpt-4o-mini now, but a fine-tuned classifier or a cross-encoder would do the same job faster and cheaper. The generative and evaluative layers have different optimisation targets, different cost profiles, and different upgrade paths. Keeping them separate is what makes the system maintainable at production scale.

Thanks for reading, glad it was useful.

Swift • Mar 11

Super helpful, appreciate the expanded explanation!

klement Gunndu • Mar 11

The namespace isolation over metadata filtering decision is the real gem here — one bug in a tenant_id filter and you have cross-tenant leakage. How often do you re-run the RAGAS faithfulness score now?

Md Ayan Arshad • Mar 11

Exactly, and the failure mode is subtle because metadata filter bugs don't throw errors. The query just silently returns results from the wrong tenant with full confidence. Hard to catch without explicit cross-tenant leak testing in your eval suite.

On RAGAS: we run it on two triggers, not on a schedule. First, every ingestion event, before new documents are promoted to the live index, the golden set runs against the staged index and blocks promotion if faithfulness drops more than 5%. Second, every deployment. We do not run it on live traffic continuously, at 500+ tenants the cost and latency make that impractical. The golden set of approx. 300 queries gives us the signal we need without instrumenting production queries directly.

Max Clark • May 11

The golden eval set on every ingestion is the unsung hero here. Most teams would just monitor production and scratch their heads. Catching the drift before traffic shifts - not after users complain - is the difference between professional and amateur MLOps. Stealing this.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.