Why "It Works" Is the Wrong Bar for AI-Generated Code in Agentic Systems
The most dangerous line of code in your agentic pipeline is not the one that crashes. It is the one that runs fine in isolation, gets merged because it passed review, and then silently degrades your system's reliability at scale. AI-generated code is producing a lot of that second kind right now, and the engineering community is starting to name it.
What Actually Happened
Two conversations have been colliding in engineering circles lately. The first is about building reliable agentic AI systems, specifically the hard operational problem of making multi-step LLM workflows actually hold together in production. The second is a more personal one: experienced engineers describing the specific moment they reject AI-generated code even when it is technically correct and the tests pass.
These two conversations sound separate. They are not. They are describing the same failure mode from two different vantage points.
The reliability conversation focuses on system design: retry logic, fallback chains, observability hooks, deterministic checkpointing between agent steps. The code review conversation focuses on something harder to quantify: whether the code carries forward enough human understanding of the system to be safely modified six months from now by someone who was not in the room.
My take is that the second problem is actually upstream of the first. Unreliable agentic systems are often unreliable because the code composing them was accepted on the wrong criteria.
The Technical Detail That Matters
When an LLM generates code, it optimizes for local correctness. It solves the function signature you gave it. What it does not do is reason about the broader system contract: what happens when the upstream API returns a 429 mid-workflow, whether this retry helper will compose correctly with your existing circuit breaker, or whether the abstraction it chose will make the next feature easy or impossible.
That is not a model quality problem. That is a fundamental mismatch between what LLMs are optimizing for and what production systems actually need.
In agentic architectures specifically, this gets worse. Each agent node is a function composition point. If the code at any node lacks clear failure semantics, your orchestrator cannot make intelligent decisions about whether to retry, escalate, or abort. A function that swallows exceptions and returns a default value looks fine in unit tests. In a multi-step agent chain, it produces ghost completions: results that look successful but carry corrupted state forward.
The engineers rejecting working AI code are usually rejecting it for exactly this reason. The code does not make its failure modes legible. It does not signal what it owns, what it borrows, and what it explicitly does not handle.
What This Means for Builders
If you are building RAG pipelines or agent systems right now, the practical implication is this: your code review criteria need to be updated for an AI-assisted workflow.
"Does it pass tests" is not sufficient. The bar needs to be: does this code make its operational behavior legible to a human who will debug it at 2am? Does it make failure modes explicit rather than swallowed? Does it fit the existing error taxonomy of the system, or does it invent a new one?
For multi-tenant platforms, this is even more acute. A retrieval function that silently returns empty results instead of surfacing a quota error will produce tenant-visible quality degradation that looks like a model problem, not an infrastructure problem. That is an expensive bug to trace.
The reliability frameworks people are building for agentic systems, checkpointing, structured retries, explicit state machines between agent steps, only work if the code they are orchestrating is honest about what it does and does not guarantee. Bad abstractions defeat good orchestration.
One Thing to Do Today
Pull up the last three pieces of AI-generated code that got merged into your agent or retrieval pipeline. Do not ask whether they work. Ask whether each function's failure behavior is explicit and legible. Check whether exceptions are surfaced or swallowed, whether error returns are typed or stringly-typed, and whether the abstraction boundary matches what your orchestrator actually needs to route on. If you find a function that returns a default on failure without logging or raising, that is your first refactor. Follow along here for daily takes on what is actually mattering in AI engineering right now.
References
- Introducing Claude Corps - Anthropic News
- Introducing Claude Opus 4.8 - Anthropic News
- Anthropic co-founder Chris Olah's remarks on Pope Leo XIV's encyclical "Magnifica humanitas" - Anthropic News
Top comments (0)