
For 3 days, our RAG system was confident.
Every query returned an answer. Response times were stable. No errors in the logs. By every operational metric, the system was working.
Our RAGAS faithfulness score told a different story.
It had dropped from 0.91 to 0.67 without a single code change.
That meant roughly 1 in 3 responses was making claims our own retrieved context didn’t support. The system wasn’t crashing. It was hallucinating — silently, at scale, with complete confidence.
Here is what happened.
The System When It Started Failing
We were running a production RAG system serving enterprise clients across a large document corpus. Each client had their own isolated set of documents — product configuration files, setup guides, operational workflows — queried daily to answer operational questions.
_The state of the system when the drift began:
- 25K+ documents across corpus
- 500+ enterprise tenants
- 1 Pinecone namespace per tenant
- 5 chunks retrieved per query (top-K) _
Stack: GPT-4 for generation, text-embedding-ada-002 for embeddings, Pinecone with one namespace per tenant, FastAPI on ECS. Isolation was strict — no cross-tenant reads, ever.
_A NOTE on the namespace decision: Pinecone namespaces share the same index and billing unit, 500 namespaces cost the same as 1. We chose namespaces over metadata filtering (tenant_id filter on a single index) for one specific reason: Metadata filtering requires every query to carry the correct filter, and one bug means Tenant A can read Tenant B's data. For enterprise clients, that risk surface isn't acceptable. Namespaces make cross-tenant leakage structurally impossible at query time.
Namespaces give us defense-in-depth isolation at the infrastructure layer rather than relying on application-level filtering._
Monitoring: API latency, error rates, cost dashboards.
Answer quality monitoring: None.
That was the bug. Not in the code. In the architecture.
What Changed, And Why We Didn’t See It
Three days before we caught the drift, a large batch of new documents was ingested. Different document type than the existing corpus — denser, longer sentences, more domain-specific terminology. Same domain, different structure.
The new documents changed the distribution of our Pinecone namespaces. Queries that had previously retrieved highly relevant chunks now retrieved chunks that were topically related but not directly answering the query.
Cosine similarity scores: 0.76, 0.79, 0.81. High enough to clear any threshold we’d set. text-embedding-ada-002 couldn't distinguish between "this chunk discusses this topic" and "this chunk contains the specific answer this query is asking for." Retrieval looked confident. The chunks were wrong.
GPT-4 did what LLMs do when context is adjacent-but-imprecise: it filled the gaps with plausible-sounding claims not present in the retrieved text.
RAGAS faithfulness: 0.91 → 0.67.
Context Precision: 0.84 → 0.61.
We had no alert for either. We found out from a user.
The core failure: we instrumented everything easy to measure — latency, throughput, cost, error rates — and nothing that measured correctness. We were flying blind on the one metric that determined whether the system was actually useful.
One clarification on how we were running RAGAS. We were not evaluating live traffic — that would be prohibitively expensive and slow. We maintained a golden evaluation set of ~300 representative queries with known-good answers and source chunks, curated when the system first launched. RAGAS ran against that set nightly, and on every ingestion event. The drop from 0.91 to 0.67 showed up the morning after the batch ingestion. We just had no alert configured to catch it.
What We Tried First — And Why It Wasn’t Enough
First instinct: improve retrieval.
We raised the similarity threshold from 0.70 to 0.78, rejecting chunks below that score. Retrieval precision improved. We also started returning no results for legitimate queries with unusual phrasing. Users got empty responses. That was worse.
We increased top-K from 5 to 10. Slightly helped recall. Sent 2× the tokens into every LLM call, which compounded a cost problem already building at 500+ active tenants.
Context Precision recovered to 0.78. Faithfulness only reached 0.81, still below our 0.85 target.
The retrieval fixes were necessary. They were not sufficient. We needed a layer that caught the gap between what retrieval returned and what the LLM claimed. Retrieval improvement was treating the symptom. We needed to treat the cause.
The Real Fix: Grounding Validation as a First-Class Architecture Layer
We added a grounding validation step that runs after every LLM response, before it’s returned to the user.
The mechanism:
- Extract the factual claims from the generated response using a structured extraction prompt. This step is imperfect, LLM-based claim extraction can miss or misinterpret implicit claims, so we treat it as a signal, not a definitive verdict. A claim is flagged as unsupported if no retrieved chunk scores above a similarity threshold for that claim.
- Score each claim against the retrieved chunks, classified as supported, unsupported, or contradicted.
- Flag any response where more than 15% of claims are unsupported or contradicted.
- Regenerate flagged responses with an explicit grounding instruction injected into the prompt.
The regeneration prompt:
“Your response must only make claims directly supported by the provided context. If the context does not contain the answer, say so explicitly. Do not infer or extrapolate.”
Critical implementation detail: The claim extraction and verification call runs against gpt-4o-mini, not GPT-4. Running a full GPT-4 call for every response validation would double our inference cost and add 600–800ms of latency. With gpt-4o-mini, the validation step adds approximately 180–220ms on average for a 3–5 sentence response. That number is model-dependent, it will be higher on a slower model and lower with a fine-tuned classifier.
After deploying:
Before: Faithfulness 0.67 | Context Precision 0.61 | 31% unsupported claim rate
After: Faithfulness 0.91 | Context Precision 0.87 | <4% unsupported claim rate
The Core Decision — And When You’d Make It Differently
We made an explicit trade-off: ~200ms of additional latency in exchange for verifiable answer quality.
For our use case — enterprise clients making operational decisions based on the chatbot’s answers — that trade-off was not a discussion. The 200ms is noise. The trust cost of a wrong answer at enterprise scale is not.
But this decision is not universal:
Constraint Decision Rationale
──────────────────────────────────────────────────────────────────────────────
Consumer product, SLA < 500ms Run async Log failures, don't block.
Stakes are low, UX matters more.
Low-stakes (drafting, summarisation) Skip it User edits the output.
Grounding matters less.
High-volume, cost-sensitive Sample 10% Statistical signal at
1/10th the overhead.
Enterprise / regulated / high-stakes Mandatory sync A wrong answer has real
downstream consequences.
Multi-tenant, strict isolation Mandatory + audit Every response must be
traceable to a source chunk.
Principle: grounding validation is always worth measuring. Whether to block on it synchronously depends on your SLA and the cost of a wrong answer in your domain.
The Anti-Pattern: Checking Faithfulness After Generation Is the Wrong Architecture
Here is the deeper architectural mistake this exposed.
We were checking faithfulness after generation — as a post-hoc audit — rather than as a gating condition on the response pipeline. The audit told us something was wrong. It didn’t stop a wrong answer from being returned.
The correct architecture treats grounding validation as a blocking step in the response pipeline, not an observability metric reviewed after the fact.
Wrong: Generate → Return → [async] Validate → Log failure → Review weekly
Right: Generate → Validate → [if flagged] Regenerate → Return → Log
The async pattern gives you observability. It does not give you correctness. For any system where answer quality has downstream consequences, post-hoc monitoring is not a substitute for inline validation.
We caught our failure because a user noticed. That should never be the detection mechanism for a production system serving enterprise clients.
What We Changed Permanently
- Quality monitoring is now first-class. RAGAS faithfulness and context precision scored against our golden evaluation set on every ingestion event and every deployment. Grafana alerts fire if either drops more than 10% from the established baseline. We do not run RAGAS on live traffic, it’s too slow and expensive at scale. The golden set gives us the signal we need.
- Document ingestion now triggers a quality gate. When new documents are ingested, we run the benchmark query set against the updated index before traffic is shifted. Faithfulness drops >5% → ingestion rolled back.
- Grounding validation is synchronous and non-configurable. The ~200ms cost is included in our SLA. Not optional for any enterprise-tier query.
The One-Line Takeaway
Your RAG system will hallucinate. The question is whether you find out before your users do.

Top comments (1)
This is smart. Effectively you're layering LLMs on top of each other to reduce the chance of error. Thanks for sharing!