Russel Hawkins

Posted on May 8

Why RAG is Like Playing Space Invaders. The Higher the Level the More Difficult it Becomes to Win.

#ai #llm #rag #discuss

Remember Space Invaders. Level one, the invaders crawl. You pick them off easily. You feel like you have a system.

Level five, they move faster. You adapt. Better aim, better timing. You still clear the screen.

Level ten, the gaps are almost gone. You are playing better than you ever have. It does not matter. The invaders reach the bottom anyway. The ceiling was never about your skill. It was built into the game.

RAG works the same way. The data proves it.

Riddhesh wrote an honest piece on RAG in 2026 that is worth reading. He gets most of it right. But the data leads to a conclusion his article stops short of drawing.

RAG does not just have problems. It has a ceiling. Hybrid search, reranking, GraphRAG, and agentic pipelines all get you closer to it. None of them move it.

TL;DR

Basic RAG achieves 20-60% accuracy on complex domain queries. Documented across multiple independent benchmarks.
Well-engineered production RAG achieves 70-85% on controlled benchmarks. On real enterprise documents, independent studies consistently find the commercial floor sits at 60-75%.
Every current frontier model, including GPT-5, Claude Sonnet 4.5, and Grok-4, exceeds 10% hallucination on enterprise-length document summarisation. Vectara HHEM Leaderboard, April 2026.
Westlaw AI is accurate on 59% of legal queries. LexisNexis Lexis+ AI, the best performer, is accurate on 65%. Peer-reviewed study, Journal of Empirical Legal Studies, Stanford and Yale, 2025.
The hallucination problem is not narrowing. It is widening. GPT-5.5 simultaneously holds the highest creative writing score and the highest hallucination rate of any frontier model tested: 86%.
A three-layer governance stack of input governance, AI accuracy, and certified truth closes the evaluation gap that every honest RAG analysis identifies but none resolves.

The 96% Number Is Real. It Also Does Not Apply to the Hard Cases.

The "40-96% hallucination reduction" figure Riddhesh cites comes from well-tuned pipelines running hybrid retrieval and reranking on straightforward factual queries. That range is accurate for those conditions.

The problem is that those conditions do not describe the queries that carry real risk in legal, healthcare, and financial domains.

The academic literature on real-world RAG performance tells a more difficult story. Naive or basic RAG achieves 30-60% accuracy. Well-engineered production RAG achieves 70-85% on narrow benchmarks. Advanced hybrid and agentic RAG reaches 85-90% on those same narrow benchmarks. On complex multi-hop reasoning and table-based tasks, even research-grade systems remain below 80%. Most production RAG deployments on complex enterprise documents sit in the 60-75% trustworthy range. That is the realistic commercial floor.

The Vals AI Legal Research Report, October 2025, provides one of the clearest measurements of what complexity does to RAG accuracy. Legal AI tools scored 78-81% on straightforward tasks. On complex multi-jurisdictional queries, the same tools dropped 14 accuracy points. A system that loses 14 percentage points when the query gets harder is not a reliable instrument for the work it is deployed to do.

RAG System / Study	Query Type	Accuracy
Basic RAG (multiple benchmarks)	Complex domain queries	30-60%
Production RAG v2/v3	Controlled benchmark queries	70-85%
Production RAG v2/v3	Real enterprise documents	60-75% (commercial floor)
Legal AI tools (Vals AI, Oct 2025)	Complex multi-jurisdictional	64-67% (14pt drop from simple)
Westlaw AI (Thomson Reuters)	Legal queries	59% (41% hallucination)
LexisNexis Lexis+ AI	Legal queries	65% (35% hallucination)

Sources: Magesh et al., "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools", Journal of Empirical Legal Studies, peer-reviewed and published April 2025, Stanford and Yale. Vals AI Legal Research Report, October 2025. RAGBench, Friel et al., 2024. FinanceBench, Islam et al., 2023.

Westlaw and LexisNexis are the most mature, most heavily funded legal RAG deployments in the world. They explicitly claimed that RAG largely prevents or eliminates hallucination in legal research. The peer-reviewed evidence found the opposite. In April 2026, a US federal court found errors in a legal brief prepared using Westlaw's CoCounsel, the newest version of the product released after the study was conducted. The problem has not been fixed by the latest release.

That is a description of the ceiling.

The Hallucination Problem Is Getting Worse, Not Better

The most important finding in the current literature is not that RAG hallucinates. It is that the hallucination problem is widening as models get more capable.

The Vectara HHEM Leaderboard, updated April 2026, measures summarisation faithfulness in RAG pipelines: given a source document, does the model stay grounded in what it actually contains. The April 2026 results are unambiguous. All current frontier models, including GPT-5, Claude Sonnet 4.5, and Grok-4, exceed 10% hallucination on enterprise-length document summarisation. Vectara's explanation is that more capable models overthink summarisation tasks, their reasoning causes them to deviate from source material in ways that smaller, more focused models do not. Raw capability and grounding faithfulness do not move together.

GPT-5.5 makes this precise. On 29 April 2026, it topped the Short-Story Creative Writing Benchmark with a score of 3.01, the highest of any frontier model. Independently benchmarked in the same period, the AA-Omniscience hallucination evaluation recorded an 86% hallucination rate for the same model. GPT-5.5 simultaneously holds the highest creative writing score and the highest hallucination rate of any frontier model tested.

This is not a coincidence. The creative writing benchmark measures a model's ability to produce maximally convincing, coherent output regardless of factual grounding. The hallucination benchmark measures the same model's tendency to produce plausible, coherent content not supported by the source material. These are the same underlying capability in two different evaluation contexts. A model that excels at constructing convincing output regardless of truth is, by the same measure, highly capable at constructing convincing output that contradicts the documents it was given.

Model	Hallucination Rate	Benchmark
GPT-5.5	86%	AA-Omniscience, April 2026
Gemini 3 Pro	88%	AA-Omniscience, November 2025
Claude Opus 4.7	36%	AA-Omniscience, April 2026
Gemini 3.1 Pro Preview	50%	AA-Omniscience, April 2026
All frontier models (GPT-5, Claude Sonnet 4.5, Grok-4)	More than 10%	Vectara HHEM, enterprise dataset, April 2026

Sources: Vectara HHEM Hallucination Leaderboard, April 2026. AA-Omniscience benchmark, Artificial Analysis, reported The Decoder.

The industry is investing billions of dollars to build models that are better at producing convincing output. Every dollar of that investment widens the hallucination gap, because capability and confabulation are driven by the same architectural property. There is no version of this trajectory in which the hallucination problem solves itself.

Why the Ceiling Is Built Into the Game

Back to Space Invaders. The game gets harder not because you play worse but because the mechanic of aim, fire, and move has a hard speed limit. Once the invaders move faster than human reaction time, the outcome is fixed by the game design, not the player.

RAG has the same problem. The mechanic is this:

A query arrives.
The retrieval layer finds chunks that are semantically similar to the query.
Those chunks go into the prompt.
The model generates a response from the injected context.

Step four is where the ceiling lives. The model is doing autoregressive generation, predicting the most plausible next token given what it was given. It is not verifying anything. It is generating from a constrained distribution.

A RAG system is four compounding failure points: document parsing, retrieval, reranking, and answer generation. Even if each stage performs at 95% reliability, the end-to-end result is 0.95 x 0.95 x 0.95 x 0.95, approximately 81% correctness. A pipeline where every individual component is near-perfect still delivers one wrong answer in five. The failure is not in any single component. It is the architecture.

The Meta CRAG benchmark proved this precisely. Adding RAG improved accuracy from 34% to 44%. It shifted the ceiling slightly upward while leaving the fundamental failure mode entirely intact.

On complex queries that require reasoning across multiple documents, interpreting temporal dependencies, or applying regulatory frameworks, the retrieval step starts to break down. It returns partially relevant chunks. It returns the right chunks in the wrong order. It misses chunks entirely. The model then generates a plausible-sounding answer from imperfect material.

In legal, healthcare, and finance, almost every consequential query is complex in exactly this way. A contract dispute touches multiple clauses, multiple precedents, and multiple jurisdictional rules simultaneously. A diagnosis requires integrating patient history, current symptoms, drug interactions, and current clinical guidelines. A compliance determination requires reading regulations, internal policy, transaction history, and reporting requirements together. These are the queries where the ceiling shows, and they are precisely the queries where getting it wrong costs the most.

The $67.4 Billion Problem

The cost of AI confabulation in high-stakes domains is not hypothetical.

The paper "The Misdiagnosis That Cost $67.4 Billion" (SSRN 6609519) documents the aggregate economic cost of AI-generated errors in healthcare, legal, and financial contexts. The figure represents losses from outputs that were plausible, confident, and wrong.

In April 2026, the South African government withdrew its draft national AI policy after independent verification found that at least 6 of its 67 academic citations were fabricated by AI. The journals referenced were real. The cited papers did not exist. Communications Minister Solly Malatsi stated publicly that the failure had compromised the integrity and credibility of the draft policy. The vendor billed for every token of that policy. The vendor disclaimed all liability for its contents. The minister faced public accountability.

You are billed for every token, correct or confabulated. The vendor disclaims liability by design. The insurer will not cover the risk. The court holds you responsible. This is the commercial reality of deploying AI without a governance architecture.

The Evaluation Gap Nobody Has Closed

Riddhesh flags the evaluation gap as a weakness of RAG v3. He is right, and he is not the first to say so. It appears in every serious RAG analysis written in the last two years.

The evaluation gap is simple: you cannot know, at the moment a query is answered, whether that answer is correct. You can measure accuracy across a test set after the fact. You cannot certify a specific answer to a specific query before it reaches the user.

Ragas, TruLens, and every other evaluation framework give you a distribution. They tell you the system is 83% accurate on average. They cannot tell you whether this contract interpretation or this clinical recommendation is in the 83% or the 17%.

An audit trail applied to RAG output records what the system produced. It does not certify that what was produced is true. A dated, timestamped log of hallucinated legal clauses and fabricated citations is not a compliance asset. It is a record of liability.

For a customer support bot, that is acceptable.

For a legal brief or a clinical recommendation, it is not.

Three Layers, Not One

The answer to this problem is not a better retrieval pipeline. It is a governance stack with three distinct layers.

Layer 1: Input Governance

Before any model processes a query, the input needs to be classified and governed. Is this a legal query? Medical? Financial? General knowledge? The domain determines the accuracy standard that applies, the verification steps that run, and the format the output must follow.

Without this layer, a model applies the same probabilistic generation process to "write me a poem" and "interpret this indemnity clause." The stakes are different. The process should reflect that.

Layer 2: AI Accuracy Through Structured Context

Confabulation in AI output has a single root cause: ambiguity in AI input. The model fills gaps with statistically probable content when the input does not provide a complete specification. The solution is to eliminate those gaps before the model is ever invoked.

A structured approach decomposes the subject matter into four non-overlapping, deterministic pillars of context before the model sees anything. Together these pillars form a complete specification with no gaps. The model transcribes from that specification. It does not interpret, estimate, or invent. Identical input produces identical output, every time. This is deterministic AI, not probabilistic AI.

This is what moves accuracy on complex queries from 66% toward 70% and well above, through structured elimination of input ambiguity rather than retrieval heuristics. The basic four-pillar single-pass implementation of this approach, tested in April 2026 at default browser chat temperature with no API control, achieved 70% exact accuracy against basic RAG's 20% on identical real-world unstructured content. That is the entry-level result. It already matches or exceeds the best commercial RAG systems on complex queries.

Layer 3: Certified Truth

This layer closes the evaluation gap.

Multiple AI models are queried independently across five scoring pillars. The system also performs reference integrity verification, checking that cited sources adequately support the specific claims they are attributed to. Where responses converge above a 95% confidence threshold across all models and all pillars, the output is certified. Where they fall below 95%, the query does not go to general human review. It goes to a structured dossier that identifies precisely what failed: the specific reference that does not adequately support the claim, the specific scoring pillar that fell short, the specific discrepancy between model responses. The human reviewer resolves the specific flagged item. That decision is logged individually as part of a permanent audit trail.

The certified output is not 70% accurate. It is 95%+ accurate by design, because everything below that threshold never reaches the user without a human resolving the specific identified issue first. This satisfies the documented evidence requirements of ISO 42001, the international standard for AI management systems.

A RAG audit trail documents what an uncertain system produced. A governed audit trail proves what a certified architecture verified, which references it validated, which citations it checked, which discrepancies it flagged, and which human decisions resolved them. These are not the same instrument.

Layer	Function	What It Replaces
Input Governance	Domain classification, input validation before model invocation	Prompt engineering and guardrails
AI Accuracy	Four-pillar structured context eliminates input ambiguity	RAG retrieval pipeline
Certified Truth	Multi-model consensus, reference integrity verification, human review gate at 95%	RAG evaluation frameworks

The Reason RAG Can Never Win: Stateless by Design

There is one architectural fact about every RAG system in production today that the industry does not advertise. Every single one of them runs on stateless API calls.

Each request starts from zero. The model has no memory of the previous query. No accumulating understanding of the domain. No context that carries forward. The knowledge base is external and static. The model itself contributes nothing that persists between calls. When query 48 arrives, the model has no knowledge that query 47 ever happened.

This is not an engineering oversight. It is a deliberate architectural choice that the entire API-based AI industry is built on. It is also the reason the ceiling is permanent.

RAG Cannot Learn From Its Own Errors

If a RAG system hallucinates on query 47 today, it will hallucinate identically on query 47 tomorrow. There is no feedback loop. There is no correction mechanism. The same input produces the same wrong output indefinitely because there is no production experience to learn from. Each call is the first call. The system has no mechanism to improve from what it has seen in deployment, because from its perspective it has never seen anything before.

This is the deepest reason the 60-75% commercial floor does not move. It is not that the retrieval is not good enough. It is that the model has no accumulated understanding of the domain it is serving. Every call is a cold start.

Complex Reasoning Across a Session Is Structurally Impossible

Legal, medical, and financial queries are rarely single-shot. A real legal analysis requires holding reasoning from step one while processing step five. A diagnosis requires connecting observations made early in a consultation with findings made later. A compliance determination requires building a picture across multiple regulatory dimensions simultaneously.

A stateless API call cannot do this. Each intermediate step is a fresh model with no memory of what the previous step established. RAG vendors address this by injecting prior context back into each new call. That compounds token cost with every step. It compounds the hallucination surface with every step. And it still does not produce genuine accumulated reasoning, because the model is not remembering. It is being handed a summary of what it previously said and asked to continue from it. These are not the same thing.

The compounding failure point arithmetic from the previous section gets worse under this constraint. A four-stage reasoning chain where each stage has an 81% end-to-end accuracy ceiling produces an overall result of 0.81 x 0.81 x 0.81 x 0.81, approximately 43% correctness. That is not a benchmark result. That is the mathematical consequence of chaining stateless probabilistic calls through a multi-step reasoning task.

What Persistent Context Actually Changes

A system with genuine session persistence is categorically different from a stateless API chain. The model builds understanding across the conversation. Reasoning established in step one is genuinely available in step five, not as injected text the model is processing cold, but as context the model has already reasoned from. Corrections made during the session carry forward. Domain understanding accumulates. The model is not starting over. It is continuing.

This is not an incremental improvement on RAG. It is a different class of system. The hallucination ceiling that applies to stateless API calls does not apply in the same way to a persistent session where the model has genuine accumulated context. The input ambiguity that causes confabulation is progressively resolved across the session rather than reset to zero with each call.

RAG vendors cannot solve this by engineering better pipelines. The stateless architecture is the foundation their entire cost model, their entire infrastructure, and their entire API contract is built on. Changing it means rebuilding from the ground up. The ceiling is not a parameter they can tune. It is load-bearing.

In Space Invaders, the invaders reset to the top of the screen at the start of every level. RAG resets to zero at the start of every query. You cannot win a game where your progress is erased before the next move.

What This Means for the RAG Market

RAG is not going away. A $2.76 billion market growing at 49% annually is not dead technology, and Riddhesh is right to say so.

But that market is growing because enterprises have no alternative they know about, not because RAG has solved the accuracy problem in regulated domains. Every legal, healthcare, and finance team running RAG knows their system hallucinates. They manage the risk through disclaimer language, scope limits, and human review bolted on after the fact. They are containing the problem, not eliminating it.

The three-layer approach does not compete with RAG where RAG works well: knowledge freshness, private data access, and high-volume simple queries. It addresses a different question entirely: what do you do when the query is complex, the domain is regulated, and a wrong answer carries legal, clinical, or financial consequences?

A better retrieval algorithm is not the answer. A different architecture is.

The Honest Summary

Question	Production RAG	Three-Layer Governance Stack
Hallucination rate on enterprise documents?	More than 10% for all current frontier models (Vectara, April 2026)	Certifiable threshold through multi-model consensus
Accuracy on complex domain queries?	60-75% commercial floor on real enterprise documents	Structured input eliminates ambiguity before generation
Can output be certified for regulated domains?	No. Probabilistic by design.	Yes. Multi-model consensus above 95% with human review below.
Does it produce a compliance-grade audit trail?	No. RAG logs what was produced, not whether it is true.	Yes. Full chain of custody including reference verification and human decisions.
Does the hallucination problem improve over time?	No. It widens as models get more capable.	More capable models strengthen the governance mechanism automatically.

RAG solves the knowledge boundary problem. It does not solve the reasoning problem. When the reasoning problem is what exposes you to a lawsuit, a knowledge boundary is necessary but not sufficient.

The ceiling is real. The only question is whether your use case can afford to hit it.

In Space Invaders, hitting the ceiling costs you a quarter.

In legal, healthcare, and financial AI, it costs rather more than that.

DEV Community