You ask an LLM-powered agent to fix a flaky test. It reads the stack trace, notices the failure happens right after a database call, and patches it with a retry. The test still fails. The model saw a correlation — failure near a database call — and never checked whether that call caused the failure. That gap has a precise name. Judea Pearl, who won the 2011 Turing Award for formalizing probabilistic and causal reasoning, would say the agent never left the bottom rung of the Ladder of Causation.
This isn't a prompt-engineering problem you can patch away. It's a statement about what data-driven systems can and cannot compute — and it explains a lot of what you see go wrong with LLM tools.
The three rungs of the ladder
Pearl's causal hierarchy — laid out for a general audience in his 2018 book The Book of Why — sorts every question you can ask into three rungs, and each rung needs information the one below it cannot supply.
Rung 1 is association. "What does seeing X tell me about Y?" Written formally, it is the conditional probability P(Y | X). Correlation, pattern recognition, curve fitting, ordinary supervised learning, and next-token prediction all live here. Example: users who open the billing page churn at a higher rate.
Rung 2 is intervention. "What happens to Y if I do X?" Pearl gives this its own notation, P(Y | do(X)) — the do-operator — because acting is not the same as observing. Example: if we redesign the billing page, does churn drop? The Rung 1 correlation cannot tell you. Maybe confused users both visit billing and churn, and the page itself changes nothing.
Rung 3 is counterfactual. "Would this specific user have churned if they had not hit the broken page — given that they did hit it, and did churn?" This is reasoning about alternatives to events that already happened. It is what you do every time you say "that bug would not have shipped if we'd had a test for it."
The rungs are ordered for a reason. The Causal Hierarchy Theorem — formalized by Elias Bareinboim and colleagues building on Pearl's work — makes the separation rigorous: in general, data from a lower rung cannot answer a question on a higher rung. No amount of Rung 1 observation settles a Rung 2 question on its own.
Why more data does not climb the ladder
The part developers miss is that this is a structural limit, not a sample-size limit. More rows do not help.
Here is the intuition. Two completely different causal worlds can produce the exact same observational distribution. Pearl's stock example: a rooster crows every morning before sunrise. The data — crow, then sun, every single day, for years — is equally consistent with "the rooster causes the sunrise" and "the sunrise causes the crow." To pick the right one you need an assumption that does not come from the data: knowledge about how the world is actually wired. Strip that away and the dataset is mute, no matter how large it gets.
This is the part that matters for your stack. Retrieval-augmented generation adds more Rung 1 evidence. It can genuinely cut hallucinations that come from missing facts — if the model never saw that your API returns a 429 under load, putting that in context fixes it. What retrieval does not do is hand the model a do-operator. You can index every incident postmortem your company has ever written, and the model still cannot compute what would happen if you changed the retry policy — unless something in that text already spells out the causal structure for it.
Treating retrieval as a fix for reasoning gaps is a common and expensive mistake. RAG improves grounding — what the model knows. It does not change which rung the model operates on. If your failure mode is the model confidently asserting a cause-and-effect relationship, adding more documents to the context window will not fix it. You are giving a Rung 1 system more Rung 1 data.
What this means for your LLM tools and agents
A language model trained to predict the next token is modeling P(text) — Rung 1, scaled to a size no statistician of Pearl's generation imagined. It does Rung 1 work genuinely well. The trouble starts when a prompt looks like a causal or counterfactual question. The model does not run a causal computation; it retrieves the text patterns most associated with questions of that shape.
Sometimes that works. If the training corpus contains enough worked causal reasoning about a topic — and for well-trodden topics it does — the pattern-match lands on a correct answer, and it looks like reasoning. It breaks down when the situation has no close textual precedent: a novel system, your particular codebase, a chain of two or three interventions stacked on each other. That is the profile of a large share of production hallucinations. The model is not lying. It is doing Rung 1 work on a Rung 2 question, and presenting the result with the same fluency either way.
Agents make the gap sharper. An agent acting in the world asks a Rung 2 question at every step — "if I run this command, what state results?" An agent whose only signal is logs of past runs has Rung 1 data about those runs. It performs well when the new situation matches the distribution it has seen, and degrades, often silently, when it does not. "Works in the demo, fails in production" is frequently this exact mismatch.
The practical move is to stop asking these tools to climb a rung they cannot, and to use them hard where Rung 1 is the job: autocomplete, boilerplate, format translation, summarizing a diff, surfacing a pattern across files. Then supply the causal model yourself — explicit constraints in the prompt, tests that encode your cause-and-effect expectations, and review that checks the reasoning rather than only the output.
None of this is an argument against AI tooling. It is an argument for matching the tool to the rung. Pearl's hierarchy gives you a fast check before you delegate a task: am I asking this model to recognize a pattern, or to reason about what a change would cause? The first is what it was built for. The second is still on you.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)