Bala Paranj

Posted on Jun 3 • Edited on Jun 25

Fallacies of GenAI Development #5: Better Context Prevents Hallucination

#ai #softwaredevelopment #architecture #engineering

✓ Human-authored analysis; AI used for formatting and proofreading.

This is the fifth in a series of eight posts on the false assumptions teams make when building with generative AI. Fallacy #1 covered the generation-engineering gap. Fallacy #2 covered plausible vs. correct. Fallacy #3 covered AI verifying AI. Fallacy #4 covered removing the review gate. This post covers the assumption that better input guarantees correct output.

The Fallacy

"If we give the AI better documentation, up-to-date APIs, and more context, it won't hallucinate."

Why it's tempting

You've seen the problem firsthand. You ask the AI to call an API. It uses a deprecated endpoint. You ask it to implement a library function. It invents a method that doesn't exist. You ask it to follow your team's coding conventions. It generates code that looks like it read the conventions from 2019.

The diagnosis seems obvious: the AI doesn't have the right information. The fix seems obvious: give it the right information. RAG retrieves relevant documents before the AI generates. Context Hub tools fetch the latest API documentation. System prompts inject your team's conventions. The context window fills with the information the AI needs.

And it works. The AI generates code that uses the current API. It follows the latest conventions. The hallucination rate drops measurably. The investment in context infrastructure pays off.

So what's the fallacy?

Why it's wrong

Better context reduces hallucination. It does not eliminate it. Production failures live in the gap between reduced and eliminated.

Three reasons context can't close the gap completely:

Reason 1: The AI can ignore the context

Retrieval-Augmented Generation retrieves documents and places them in the context window. The model is SUPPOSED to use them. But the model can — and does — override the retrieved context with patterns from its training data when the training data feels more natural to the generation process.

Liu et al. (2023) documented this as the "lost in the middle" problem: LLM performance degrades significantly when relevant information is placed in the middle of a long context window. Performance is highest at the beginning or end, and drops in between. This alone proves that "more context" is not a linear path to "more correctness."

But the problem is deeper than positional bias. Research into parametric versus non-parametric memory (Longpre et al., 2021; Neeman et al., 2022) shows that when an LLM's internal training weights (parametric) conflict with the provided RAG context (non-parametric), the model frequently defaults to its training — especially when the training data was highly reinforced. The model's internal weights represent billions of patterns. The context window represents a few thousand tokens. When the two conflict, the model doesn't reliably choose the context.

You can test this yourself: provide a document that explicitly contradicts the model's training data. Ask a question about it. The model will sometimes answer from the document and sometimes answer from its training. The percentage depends on the model, the prompt, and the specific conflict. It's never 100% from the document. The model isn't reading and following instructions. It's predicting the most likely next token given ALL its inputs — training weights and context window combined.

Reason 2: Context is necessary, verification is separate

Context answers: "What information should the AI use?"
Verification answers: "Did the AI use the information correctly?"

These are different questions with different mechanisms. Having the right ingredients doesn't guarantee the dish is correct. A chef with perfect ingredients can still combine them wrong, overcook the protein, or plate the wrong dish entirely. The ingredients are necessary. The taste test is still required.

Modern benchmarks confirm this gap. The Retrieval-Augmented Generation Benchmark (RGB) shows that models often fail to reason correctly over retrieved documents even when the retrieval is 100% accurate. Having the right documents in the context window and using them correctly are independent capabilities — and current models fail at the second even when the first is perfect.

In engineering terms:

Context pipeline:       Ensures the AI HAS the right information
                        → vector database, retrieval, ranking, injection
                        → measured by: retrieval accuracy, relevance scores

Verification pipeline:  Ensures the AI USED the information correctly
                        → specification checks, contract tests, property verification
                        → measured by: output correctness against declared properties

Most teams invest heavily in the first pipeline and have nothing for the second. They measure retrieval quality — "did we give the AI the right documents?" — but not output correctness — "did the output satisfy the properties it should satisfy?"

This is like a restaurant that invests millions in sourcing the finest ingredients, tracks every supplier relationship, monitors freshness down to the hour — but has no quality check on the finished dish. The ingredients are world-class. The plate that reaches the customer might still have no flavor or taste.

Reason 3: Context doesn't cover properties

Even with perfect context — every document retrieved, every API current, every convention injected — the AI has no mechanism to enforce PROPERTIES that span the generated output.

"No function in this module may call an external service without a timeout." That's a property. It's not in any API documentation. It's not in any retrieved document. It's an architectural decision the team made. The AI has no way to know about it from context alone, because it was never written as a retrievable document — it lives in the team's shared understanding.

"This data pipeline must preserve message ordering." That's a constraint. It's implicit in the architecture. Even if someone wrote it down, RAG would have to retrieve THAT SPECIFIC document for THAT SPECIFIC generation. The probability of the right document being retrieved for the right generation at the right time decreases as the number of such properties grows.

Context can provide facts. It can't provide the complete set of properties that must hold across every generated artifact. Properties are exhaustive ("ALL functions must have timeouts"). Context is sampled ("HERE are some relevant documents").

There's a deeper structural reason RAG misses properties. RAG retrieves by semantic similarity — vector math that finds documents "close to" the query in meaning space. A security invariant ("never log PII") is semantically FAR from a function that processes user data. They aren't similar in the vector space. The RAG system never retrieves the security rule for that specific generation — because security constraints and implementation code don't look alike in embeddings. Facts live near the code that uses them. Properties live in a different semantic neighborhood entirely.

Mishra et al. (2022) demonstrated a related failure: LLMs struggle significantly with negation and constraints in instructions. "Never log PII" is a negative property. Even when the constraint is explicitly in the context, the model's training data is composed almost entirely of positive examples ("do Y"), not negative ones ("never do X"). The context provides the constraint. The model's architecture biases against following it.

The boom

Month 1-3: The context investment. The team builds a sophisticated RAG pipeline. Vector database, document preprocessing, chunk optimization, relevance scoring. The AI generates better code. Hallucination rate drops from 15% to 5%. The team publishes an internal blog post celebrating the improvement.

Month 4-6: The 5% that matters. The 5% residual hallucination isn't random. It's concentrated in the hard cases — the ones where the context conflicts with the training data, the ones where the property isn't in any retrievable document, the ones where the AI needs to reason about composition rather than pattern-match individual facts. These are disproportionately the cases that cause production incidents.

Month 7-9: The false confidence incident. A developer generates code for a new feature. RAG retrieves the correct API documentation. The AI uses the correct API endpoint. The code compiles. The tests pass. But the code violates an architectural constraint — it makes a synchronous call to an external service without a timeout, inside a transaction. Nothing in the retrieved context mentioned the timeout requirement. The constraint was in the team's architecture decision records, which were never indexed in the vector database. The service hangs. The database connection pool exhausts. The outage lasts four hours.

Here's the cruel irony: better context made the hallucination MORE dangerous, not less. The code used the correct API. It followed the correct conventions. It looked MORE correct than code generated without RAG — which made everyone trust it more. Better context doesn't stop the AI from generating violations. It gives the AI better facts to wrap the violation in. The plausibility goes up. The scrutiny goes down. The damage goes up.

Post-mortem: "We had the right context. The AI used the right API. But the AI didn't know about the timeout constraint because we didn't retrieve it. And we didn't retrieve it because we didn't index it. And we didn't index it because we didn't know we needed to."

The post-mortem reveals the structural gap: context improves what the AI generates FROM. It doesn't check what the AI generates AGAINST. The retrieved documents were correct. The generated code was plausible. The property that was violated was never in the context pipeline — and no verification pipeline existed to catch it.

Month 10, the response: The team indexes their ADRs in the vector database. They add more documents to the RAG pipeline. The context gets better. The next incident is a different property that wasn't indexed. The whack-a-mole cycle begins — each incident reveals a property that should have been in the context but wasn't. The team can never index EVERYTHING because they don't have a complete list of everything that matters.

The resolution: verify the output, not just the input

Context and verification are complementary. Both are necessary. Neither is sufficient alone.

INPUT QUALITY (context):
    "Give the AI the right information"
    → RAG, Context Hub, up-to-date docs, system prompts
    → Reduces hallucination
    → Measured by: retrieval accuracy

OUTPUT CORRECTNESS (verification):
    "Check that the AI used the information correctly"
    → Specification gates, contract tests, property checks
    → Catches violations regardless of context quality
    → Measured by: properties satisfied per change

The team that invests only in context is optimizing input quality and hoping output correctness follows. Sometimes it does. The 5% where it doesn't is where the incidents live.

The team that adds verification checks the output REGARDLESS of how it was generated. Whether the AI had perfect context or no context at all, the verification catches violations — because the verification checks PROPERTIES, not INPUTS.

The kitchen analogy, completed

Context = Ingredients
    Better ingredients → better chance of a good dish
    But: a chef can still combine them wrong

Verification = Taste test
    Check the finished dish against the standard
    Catches errors regardless of ingredient quality

Great restaurants have BOTH:
    World-class sourcing AND quality checks on every plate

Great engineering has BOTH:
    RAG pipeline AND specification verification on every change

Investing in RAG without investing in verification is like a restaurant that sources the finest ingredients but has no chef tasting the dish before it reaches the customer. Most dishes will be fine. The ones that aren't will be the ones the customer remembers.

What the context pipeline can't replace

Five categories of problems that better context cannot solve:

1. Compositional correctness. The AI generates Function A correctly (right context retrieved) and Function B correctly (right context retrieved). The composition of A and B violates a cross-cutting property. No individual document in the RAG pipeline covers the composition — because the property emerges from the interaction, not from any single component.

2. Architectural constraints. "All inter-service calls must use gRPC with deadline propagation." This is an organizational decision, not a fact in a document. Even if indexed, RAG must retrieve it for EVERY generation that involves a service call. One missed retrieval = one violation.

3. Negative properties. "This function must NEVER log PII." Context tells the AI what TO do. Negative properties tell it what NOT to do. The AI can follow the positive context perfectly and still violate a negative property that wasn't in the retrieved documents — and even when the constraint is retrieved, the model's training bias toward positive examples means it struggles to follow negative instructions reliably (Mishra et al., 2022).

4. Implicit conventions. "Error handling in this codebase uses result types, not exceptions." The convention is embedded in 10,000 existing functions. RAG might retrieve a few examples. The AI might still generate exceptions because its training data favors exceptions. The context helps. It doesn't guarantee.

5. Mathematical properties. "The sum of all line items must equal the invoice total." This is arithmetic. The AI can have perfect context about the invoice schema and still generate code where floating-point arithmetic introduces a rounding error. The property requires verification, not context.

What you can do this week

1. List five properties that must hold in your system. Not facts the AI needs to know — PROPERTIES the output must satisfy. "All endpoints require authentication." "No database query uses string concatenation for parameters." "All monetary amounts use integer cents, not floating-point dollars." These are your verification targets.

2. Check: are any of these in your RAG pipeline? Probably not. Properties live in people's heads, in architecture decision records, in tribal knowledge. They're not the kind of documents teams typically index for retrieval. This gap is your vulnerability.

3. Add one property as a mechanical check. Not in the RAG pipeline — in the CI pipeline. A check that runs on every change, regardless of what context the AI had. The property either holds or it doesn't. The check is the verification layer that context can't provide.

4. Audit your RAG pipeline against your properties. Pick 10 recent successful retrievals — cases where RAG provided the right documents and the AI generated correct-looking code. For each one, ask: "If the AI had used this context perfectly but ignored our internal timeout policy (or our PII logging rule, or our authentication requirement), would anything have caught it?" If the answer is no for even one, you have a context-only architecture. The context is doing its job. The verification layer doesn't exist yet.

Your context pipeline is probably good. Keep investing in it. Better retrieval means fewer hallucinations. But add the verification pipeline alongside it — because the hallucinations that survive good context are the ones that cause the most damage. They're the subtle ones. The plausible ones. The ones that pass every test because nobody wrote a test for that specific property. And they're the ones that a mechanical check catches on every change, every time, regardless of what was or wasn't in the context window.

Next in the series: **Fallacy #6 — "AI-Generated Code Is an Asset."* Why every line of generated code is a liability, why the right unit of progress isn't code volume, and what Unix figured out about composable functions 50 years ago.*

The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.

References

Liu, N.F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL.
Longpre, S., et al. (2021). "Entity-Based Knowledge Conflicts in Question Answering." EMNLP 2021.
Mishra, S., et al. (2022). "Cross-Task Generalization via Natural Language Crowdsourcing Instructions." ACL 2022.
Neeman, E., et al. (2022). "DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering." ACL 2023.

DEV Community