Why "It Works" Is the Wrong Bar for AI-Generated Code in Agentic Systems

#ai #machinelearning #technology #programming

Why "It Works" Is the Wrong Bar for AI-Generated Code in Agentic Systems

The most dangerous line of code in your agentic pipeline is not the one that crashes. It is the one that runs fine in isolation, gets merged because it passed review, and then silently degrades your system's reliability at scale. AI-generated code is producing a lot of that second kind right now, and the engineering community is starting to name it.

What Actually Happened

Two conversations have been colliding in engineering circles lately. The first is about building reliable agentic AI systems, specifically the hard operational problem of making multi-step LLM workflows actually hold together in production. The second is a more personal one: experienced engineers describing the specific moment they reject AI-generated code even when it is technically correct and the tests pass.

These two conversations sound separate. They are not. They are describing the same failure mode from two different vantage points.

The reliability conversation focuses on system design: retry logic, fallback chains, observability hooks, deterministic checkpointing between agent steps. The code review conversation focuses on something harder to quantify: whether the code carries forward enough human understanding of the system to be safely modified six months from now by someone who was not in the room.

My take is that the second problem is actually upstream of the first. Unreliable agentic systems are often unreliable because the code composing them was accepted on the wrong criteria.

The Technical Detail That Matters

When an LLM generates code, it optimizes for local correctness. It solves the function signature you gave it. What it does not do is reason about the broader system contract: what happens when the upstream API returns a 429 mid-workflow, whether this retry helper will compose correctly with your existing circuit breaker, or whether the abstraction it chose will make the next feature easy or impossible.

That is not a model quality problem. That is a fundamental mismatch between what LLMs are optimizing for and what production systems actually need.

In agentic architectures specifically, this gets worse. Each agent node is a function composition point. If the code at any node lacks clear failure semantics, your orchestrator cannot make intelligent decisions about whether to retry, escalate, or abort. A function that swallows exceptions and returns a default value looks fine in unit tests. In a multi-step agent chain, it produces ghost completions: results that look successful but carry corrupted state forward.

The engineers rejecting working AI code are usually rejecting it for exactly this reason. The code does not make its failure modes legible. It does not signal what it owns, what it borrows, and what it explicitly does not handle.

What This Means for Builders

If you are building RAG pipelines or agent systems right now, the practical implication is this: your code review criteria need to be updated for an AI-assisted workflow.

"Does it pass tests" is not sufficient. The bar needs to be: does this code make its operational behavior legible to a human who will debug it at 2am? Does it make failure modes explicit rather than swallowed? Does it fit the existing error taxonomy of the system, or does it invent a new one?

For multi-tenant platforms, this is even more acute. A retrieval function that silently returns empty results instead of surfacing a quota error will produce tenant-visible quality degradation that looks like a model problem, not an infrastructure problem. That is an expensive bug to trace.

The reliability frameworks people are building for agentic systems, checkpointing, structured retries, explicit state machines between agent steps, only work if the code they are orchestrating is honest about what it does and does not guarantee. Bad abstractions defeat good orchestration.

One Thing to Do Today

Pull up the last three pieces of AI-generated code that got merged into your agent or retrieval pipeline. Do not ask whether they work. Ask whether each function's failure behavior is explicit and legible. Check whether exceptions are surfaced or swallowed, whether error returns are typed or stringly-typed, and whether the abstraction boundary matches what your orchestrator actually needs to route on. If you find a function that returns a default on failure without logging or raising, that is your first refactor. Follow along here for daily takes on what is actually mattering in AI engineering right now.

References

Introducing Claude Corps - Anthropic News
Introducing Claude Opus 4.8 - Anthropic News
Anthropic co-founder Chris Olah's remarks on Pope Leo XIV's encyclical "Magnifica humanitas" - Anthropic News

Top comments (1)

Nazar Boyko • Jun 21

Ghost completions is a good name for the exact thing that bites in agent chains. A function that catches an error and returns a default sails through unit tests and then feeds bad state to the next node, which still thinks everything worked. The fix you point at, making failure legible so the orchestrator can route on it, is the right bar. The only thing I'd add is this isn't new to AI. People have always written code that swallows the error and returns a default, but a model produces it at volume and wraps it in tidy structure, so it slips past review far more often.