Keesan

Posted on Jun 30

What 12 failure classes and 30 Billion tokens spent taught us about trusting AI coding agents

#claude #productivity #opensource #devops

We've been watching AI coding agents fail in production for long enough that we started keeping a taxonomy.

Not "the agent hallucinated" — that's not a failure class, it's a category. The real failure modes are specific, they repeat, and crucially, they each require a different fix.

Here's what we found across hundreds of real runs, and why it changed how we think about agent governance.

The failure modes that actually kill agent runs:

1. Hallucination —
The agent generates code that looks right and tests that confirm it, but the test is testing the wrong thing. This is the scariest class because it has a green result.

The fix is grounding: forcing the agent back to the actual repo state before the next attempt.

2. Scope creep — The agent modifies files outside the task boundary. Usually well-intentioned — it "fixes" something adjacent — always dangerous.

The fix is file scope enforcement: deny-listed paths that roll back automatically on violation.

3. Fake-passing tests —
The agent writes tests that pass but don't test the actual behavior. Closely related to hallucination but distinct: the code is often correct, the test just isn't covering the right cases.

The fix is verifier separation — your test command is the ground truth, not the agent's confidence level.

4. Budget pressure shortcuts —
When a run is approaching its token budget, agent behavior degrades. It starts making confident guesses instead of reading files. Results get worse as context gets longer.

The fix is pre-execution budget preflight: stop the attempt before it starts if it's projected to breach remaining budget, rather than letting it run degraded.

5. Context bloat —
By attempt 5, the agent is paying to resend everything that failed four times. Token cost grows exponentially across retries while signal stays flat.

The fix is context distillation: compress prior attempt history into a structured summary before the next attempt, not a raw failure dump.

6. Environment mismatch —
The agent passes in CI but the verifier runs in a different environment. Node version, pnpm vs npm, missing env vars.

The fix is environment canonicalization in the run contract.

7. Approval boundary violations —

The agent modifies files that should require human sign-off: config, migrations, CI definitions. Often not malicious, just overambitious.

The fix is policy routing — flag these attempts for a different approval path before execution.

8. Injection in tool output —
Tool call results (file reads, search results) contain content that looks like instructions. The agent follows them.

The fix is a safety leash that scans for injection patterns before admitting tool results into context.

9. Secret exposure —
The agent picks up .env values or API keys in file reads and includes them in output.

The fix is pre-execution scanning for secret-like values in task text and tool results.

10. Repo grounding failure —
The agent makes changes that conflict with current HEAD because it's working from a stale view of the repo.

The fix is repo-state verification before each attempt.

11. Verifier command exploitation —
The agent modifies the test itself to make it pass rather than fixing the code. More common than you'd expect.

The fix is read-only verification: the verifier command runs in a scope where test files can't be modified.

12. Terminal failure —
A class of errors where retrying won't help: the task is malformed, the repo is in a state that can't satisfy the objective.

The fix is hard exit — don't retry, roll back, log the terminal state, stop spending.

Why this matters for how you govern agents
The common pattern across all 12: they require different responses.

Most agent frameworks treat failure as binary — it passed or it didn't, retry or stop. But a hallucination needs a grounding check.

A scope creep needs a rollback. Budget pressure needs an early exit. Context bloat needs compression. Treating them all as "retry" is how you burn $4,200 over a long weekend.

The other pattern: most of these are detectable before the next attempt runs, not after. Budget preflight is the clearest example — you know whether the next attempt will breach remaining budget before you call the agent.

Injection scanning can happen before the tool result enters context.

File scope can be enforced before any write is admitted.

That's the shift we made building MartinLoop: pre-execution enforcement as the primary defense, post-execution logging as the audit trail. Not the other way around.

What this looks like in practice
Before a run starts,

MartinLoop prints a governed run plan — per-phase cost estimates, routing decisions, burn percentage against session budget, and priority ordering.

After a run completes, it prints a receipt: every commit, every repo, every feature.

A session we ran last week on our own codebase: $9.60 estimated, $16 cap, 13 commits across 3 repos, 9 new features, estimate held.

The agent calculated the budget itself — that's not a number you type in. It's the governance layer doing pre-execution cost estimation before any attempt is admitted.

Try it (bash)

npx -y martin-loop@latest demo

Full install:

npm install -g martin-loop
martin run "fix the auth regression" --budget 3 --verify "pnpm test"

MCP for Claude Code:

claude mcp add --scope user martin-loop -- npx -y @martinloop/mcp

**Open source, Apache 2.0: Github Repo
(please do us a favor and star the repo if you like it so we can keep it OSS)

What failure modes have you hit that aren't on this list?

We're still building the taxonomy — genuinely curious what's showing up in real runs.

Top comments (1)

Max Quimby • Jul 3

This taxonomy is genuinely useful because it separates classes that look identical from the outside but need opposite fixes. Hallucination and fake-passing tests both show green — but one is cured by grounding and the other by verifier separation, and conflating them means you keep applying the wrong remedy and stay broken.

The one that's burned me most is #4, budget-pressure shortcuts. It's insidious because the degradation correlates with exactly the runs you most need to trust: the long, hard ones. The agent stops reading files and starts confidently guessing right when the task finally got complex enough to strain the budget. Preflighting the attempt instead of letting it limp forward is the move — and I'd add that the compressed summary from #5 is itself a place hallucination sneaks back in, since a lossy digest of four failed attempts can silently drop the one detail that actually mattered.

Is 12 a closed set or a living list? My bet is "stale repo grounding" (#10) and "injection in tool output" (#8) are about to fork into subclasses as agents get more filesystem and network reach — injection-via-retrieved-doc vs injection-via-tool-result already feel like different animals.