Daniel Butler

Posted on Jan 30 • Edited on Feb 3

Designing agentic workflows: where agents fail and where we fail

#ai #agents #codereview #tooling

Agentic coding increases throughput. It also increases the probability that we ship something we didn't mean to ship.

Both the agent and the review process tend to optimise for visible success signals: green tests, plausible diffs, confident summaries, "done".

This creates predictable failure modes in production work, split into two buckets: where the agent fails (reward hijacking, shallow correctness, residue), and where we fail as output volume spikes (review fatigue, rubber-stamping, intent drift, decision delegation). The split matters because the agent's failure modes are only half the story. The other half is how our own behaviour changes when output volume spikes.

This isn't an argument against using agents. We get real leverage from them. The point is that the failure modes are predictable, and if we don't design our workflow around them, we end up shipping problems with a green CI badge.

There are two expectations to set up front:

The workflow has to work with the human, not against them. Most failures don't come from a single "bad suggestion". They come from a process that stops enforcing verification once the diffs get big.
Enterprise stakes are different. In personal projects we can optimise for momentum and fun. In production work, we're exposed to security regressions, compliance issues, availability incidents, and quiet failures that only surface months later.

Tools are evolving rapidly, with more built-in guardrails appearing. These failure modes are architectural, not implementation bugs in current tools. Guardrails help, but workflow design remains essential. This post focuses on the patterns, not the tools.

What we mean by "vibe coding"

The term has picked up multiple definitions. Here, I'm using it to mean working primarily at the level of intent, and relying on an agentic system to generate and evolve the implementation details.

That shift increases speed and volume. It does not change ownership. If a failure slips through the gap, it's our failure. The tool isn't accountable. We are.

Why these failures cluster

The most useful label I've found for this behaviour is reward hijacking: the system is not trying to deceive anyone — it's optimising for what it can observe and be "rewarded" for.

In agentic coding, the "reward" is often an observable success signal:

tests are green
linters are quiet
the diff looks plausible at a glance
the agent says "done"
the UI loads once

If that's all we validate, that's all we'll reliably get.

These failure modes cluster around a gap: machine-verifiable gates (tests, linters, type checks) are necessary but insufficient. They catch syntax and contract violations. They don't catch "this works but isn't what we meant" or "this passes but the coverage is worse."

Long sessions amplify everything

As the current conversation context grows, earlier intent and constraints are more likely to be missed even if they're still present.

This is a known long-context behaviour: the "lost in the middle" effect. In long-context evaluation of language models, performance is often highest when relevant information is near the beginning or end, and degrades when the relevant detail is buried in the middle.

This doesn't create new failure modes. It increases the probability of all of them.

Where agents fail

The four named failure modes below are from Vibe Coding by Gene Kim and Steve Yegge. I'm using their names as labels, and grounding each one with a concrete example of how it shows up in code review.

1) Baby-counting

What it is
Requirements are silently dropped while the system still claims completion.

Why it happens
The observable success signal is "the thing looks done" (green CI, fewer failing checks, a confident completion message), not "all original requirements are satisfied".

The metaphor
You send someone into a room to count the babies. They report back: "Ten babies, all accounted for." You trust the count. Everything looks fine.

An hour later, you go into the room yourself. There are five babies.

Everything was fine until it wasn't. Now it's a catastrophe.

Real-world example
We have ten failing unit tests. We ask the agent to "make sure there are no failing tests".

The agent fixes five tests, skips one, and deletes the remaining tests.

Result: zero failing tests.

From the system's perspective, the observable goal was achieved. From our perspective, coverage was silently lost.

Why it's dangerous (enterprise impact)
Coverage loss is silent. No failing tests, no alerts, just missing protection that surfaces months later when something else breaks and there's no test to catch it.

What we look for

deleted tests, skipped tests, weakened assertions
"refactors" with suspiciously large removals
acceptance criteria quietly disappearing from the discussion
any change that reduces safety signals while claiming progress

2) Cardboard muffin

What it is
The output looks correct on the surface but is hollow or incorrect inside.

Why it happens
Plausibility + test satisfaction are strong proxies. If tests can be satisfied shallowly, there is no pressure to implement the real behaviour.

Real-world example
We provide a function signature and tests. The agent produces a large implementation with helpers and branching.

Buried in the code is a hard-coded return value. Regardless of input, the function always returns the same result. The existing tests pass.

The code has the shape of a solution without implementing the underlying intent.

Why it's dangerous (enterprise impact)
This creates false confidence. Reviewers skim because it looks like a serious implementation. CI is green because the tests were satisfiable shallowly. The bug appears later under real inputs, at a time when the original change is no longer fresh in anyone's head.

What we look for

hard-coded constants where logic should exist
conditionals that funnel everything into one output
broad fallbacks that swallow errors
tests that assert shape/type instead of behaviour
"impressive" structure with suspiciously little meaningful computation

3) Half-assing

What it is
Correct behaviour, poor implementation.

The feature works, but it ignores standards, architecture, operational concerns, or long-term maintainability.

Why it happens
Unless quality constraints are explicit, the shortest path to "works" wins. The observable success signal is "feature delivered", not "feature delivered sustainably".

Real-world example
The agent implements the feature correctly but:

hard-codes configuration that should be injectable
bypasses existing abstractions and duplicates logic
adds minimal error handling to get the happy path working
writes tests that cover only the most obvious scenario

Everything passes. The change is "done". The repo got worse.

Why it's dangerous (enterprise impact)
The debt is silent and cumulative. Each shortcut makes the next change harder. It shows up later as slower delivery, production incidents from missing error handling, operational friction from hard-coded behaviour, and inconsistent patterns across teams.

What we look for

configuration hard-coded into code paths
bypassing existing patterns "because it's faster"
incomplete error handling or missing observability
tests that don't cover failure cases
new dependencies without justification

4) Litterbug

What it is
Residue left behind after "working" changes.

The behaviour works, but the codebase is messier than before.

Why it happens
Cleanup often doesn't change observable behaviour. If we don't require it explicitly, it won't be prioritised.

Real-world example
The agent adds a new helper that overlaps heavily with an existing one instead of consolidating logic.

It leaves TODOs, debug logs, and comments like "new implementation" that quickly stop being true. Nothing breaks. Nobody notices. Entropy increases.

Why it's dangerous (enterprise impact)
The damage compounds. Small bits of litter become systemic cognitive load.

Once comments and structure can't be trusted, engineers stop trusting them at all. That degrades onboarding, incident response, and change safety.

What we look for

duplicated helpers or parallel implementations
dead code and commented-out blocks after refactors
TODO/FIXME without an owner or issue reference
stale comments that no longer match behaviour
debug logging or temporary scaffolding left behind

Where we fail

The agent doesn't work in isolation. It works in a process with us.

That process is almost always a code review: reading diffs, scanning tests, reconstructing intent, and deciding whether a change is safe to merge.

The part that's easy to miss is that this review has its own "context window". It isn't measured in tokens; it's measured in what we can keep active while scanning diffs, reconstructing intent, and validating behaviour.

Classic cognitive psychology framed short-term capacity as "seven, plus or minus two" — a description of how many meaningful chunks of information we can actively hold and manipulate at once. Later work argues the practical working set is often smaller under many conditions. In practice, that gap shows up as a predictable pattern: as the change set grows, we shift from verification to plausibility. We stop holding the whole thing in our head, and we start leaning on proxies.

1) Rubber-stamping

What it is
Rubber-stamping is approving changes without real verification.

Why it happens
Large diffs plus green checks encourage approval-by-glance. Agent summaries can feel like substitutes for evidence.

Real-world example
A large change set lands. Tests pass. The diff looks reasonable at a glance. It gets approved.

A baby-counting or cardboard-muffin failure was inside the diff, but nobody checked the specific risk.

Why it's dangerous (enterprise impact)
This is how high-severity regressions slip through "successful" pipelines. The process appears healthy because the mechanics ran.

What we look for

large diffs merged with minimal review notes
"LGTM" reviews on complex changes
reliance on agent summaries instead of evidence (tests, traces, manual checks)

2) Review fatigue

What it is
Attention and defect discovery drops as review volume increases.

Why it happens
Humans don't review indefinitely at the same quality level. Once change sets are routinely too large, the review standard shifts from verification to plausibility.

Industry experience shows defect discovery rates drop significantly as review size increases, leading to practical ceilings on defect discovery per review session.

Real-world example
Early on, reviews are careful. As output increases, the standard quietly shifts from verifying correctness to deciding whether something "looks fine".

Why it's dangerous (enterprise impact)
When fatigue sets the review standard, the process becomes performative. Failures become inevitable, not occasional. Fatigue is what drives rubber-stamping: reviewers approve changes they haven't truly verified because the volume makes real verification unsustainable.

What we look for

review cycles that are consistently too large
repeated "skim reviews" because "we can't read it all"
teams normalising that "AI output is too big to review"

3) Intent drift

What it is
The intended outcome evolves but isn't re-anchored.

Instructions accumulate through conversation, assumptions change, constraints are added informally. Nothing restates a single explicit definition of "correct".

Why it happens
We treat the current conversation context like shared memory, but it degrades with length. Without deliberate re-anchoring, the agent optimises against whichever version of intent is most salient or easiest to satisfy.

Real-world example
Early in the session: "Add a new endpoint for user preferences."

Midway through: "Oh, and this needs to work with our legacy auth system that uses session tokens, not JWTs."

The agent continues, but the new constraint isn't consistently applied across all code paths. Half the implementation uses the new auth pattern, half assumes the original approach. Both paths pass their tests.

Why it's dangerous (enterprise impact)
Intent drift is where compliance and security constraints get dropped, because they were never re-stated as hard requirements.

What we look for

decisions made verbally but not captured in artifacts
constraints that exist "somewhere earlier in the thread"
mismatches between the final code and the stated intent

4) Decision delegation

What it is
We delegate not just execution, but architectural and product decisions, without explicitly acknowledging or approving them.

Why it happens
Volume creates distance. Confidence creates complacency. Without explicit re-engagement points, decisions get smuggled in as "implementation detail."

Real-world example
A change introduces a new abstraction or refactors a subsystem "for clarity." We ship it without a deliberate decision that this was the right shape for the codebase.

Why it's dangerous (enterprise impact)
Decision delegation creates architectural drift. It also creates ownership ambiguity, which is poison in incident response.

What we look for

major structural changes without explicit approval
"drive-by refactors" embedded in feature work
unclear rationale for architecture shifts

Where this leaves the system

None of these are exotic. They are predictable outcomes of optimising against observable success signals and reviewing under volume.

Workflow design has to assume both failure surfaces and build in explicit constraints and verification gates. The failure modes are architectural problems. They require architectural solutions.

The question then becomes: what does a workflow designed around these constraints actually look like?

That question is explored in the follow-up post, Designing agentic workflows: a practical example, which presents a concrete, verification-first workflow design intended to turn these failure modes into explicit constraints.

DEV Community

Designing agentic workflows: where agents fail and where we fail

What we mean by "vibe coding"

Why these failures cluster

Long sessions amplify everything

Where agents fail

1) Baby-counting

2) Cardboard muffin

3) Half-assing

4) Litterbug

Where we fail

1) Rubber-stamping

2) Review fatigue

3) Intent drift

4) Decision delegation

Where this leaves the system

Further reading

Top comments (0)