DEV Community

Cover image for What 12 failure classes and 30 Billion tokens spent taught us about trusting AI coding agents
keesan.eth
keesan.eth

Posted on

What 12 failure classes and 30 Billion tokens spent taught us about trusting AI coding agents

We've been watching AI coding agents fail in production for long enough that we started keeping a taxonomy.

Not "the agent hallucinated" — that's not a failure class, it's a category. The real failure modes are specific, they repeat, and crucially, they each require a different fix.

Here's what we found across hundreds of real runs, and why it changed how we think about agent governance.

12-class-failure-taxonomy

The failure modes that actually kill agent runs:

1. Hallucination
The agent generates code that looks right and tests that confirm it, but the test is testing the wrong thing. This is the scariest class because it has a green result.

The fix is grounding: forcing the agent back to the actual repo state before the next attempt.

2. Scope creep — The agent modifies files outside the task boundary. Usually well-intentioned — it "fixes" something adjacent — always dangerous.

The fix is file scope enforcement: deny-listed paths that roll back automatically on violation.

3. Fake-passing tests
The agent writes tests that pass but don't test the actual behavior. Closely related to hallucination but distinct: the code is often correct, the test just isn't covering the right cases.

The fix is verifier separation — your test command is the ground truth, not the agent's confidence level.

4. Budget pressure shortcuts
When a run is approaching its token budget, agent behavior degrades. It starts making confident guesses instead of reading files. Results get worse as context gets longer.

The fix is pre-execution budget preflight: stop the attempt before it starts if it's projected to breach remaining budget, rather than letting it run degraded.

5. Context bloat
By attempt 5, the agent is paying to resend everything that failed four times. Token cost grows exponentially across retries while signal stays flat.

The fix is context distillation: compress prior attempt history into a structured summary before the next attempt, not a raw failure dump.

6. Environment mismatch
The agent passes in CI but the verifier runs in a different environment. Node version, pnpm vs npm, missing env vars.

The fix is environment canonicalization in the run contract.

7. Approval boundary violations

The agent modifies files that should require human sign-off: config, migrations, CI definitions. Often not malicious, just overambitious.

The fix is policy routing — flag these attempts for a different approval path before execution.

8. Injection in tool output
Tool call results (file reads, search results) contain content that looks like instructions. The agent follows them.

The fix is a safety leash that scans for injection patterns before admitting tool results into context.

9. Secret exposure
The agent picks up .env values or API keys in file reads and includes them in output.

The fix is pre-execution scanning for secret-like values in task text and tool results.

10. Repo grounding failure
The agent makes changes that conflict with current HEAD because it's working from a stale view of the repo.

The fix is repo-state verification before each attempt.

11. Verifier command exploitation
The agent modifies the test itself to make it pass rather than fixing the code. More common than you'd expect.

The fix is read-only verification: the verifier command runs in a scope where test files can't be modified.

12. Terminal failure
A class of errors where retrying won't help: the task is malformed, the repo is in a state that can't satisfy the objective.

The fix is hard exit — don't retry, roll back, log the terminal state, stop spending.

Why this matters for how you govern agents
The common pattern across all 12: they require different responses.

Most agent frameworks treat failure as binary — it passed or it didn't, retry or stop. But a hallucination needs a grounding check.

A scope creep needs a rollback. Budget pressure needs an early exit. Context bloat needs compression. Treating them all as "retry" is how you burn $4,200 over a long weekend.

The other pattern: most of these are detectable before the next attempt runs, not after. Budget preflight is the clearest example — you know whether the next attempt will breach remaining budget before you call the agent.

Injection scanning can happen before the tool result enters context.

File scope can be enforced before any write is admitted.

That's the shift we made building MartinLoop: pre-execution enforcement as the primary defense, post-execution logging as the audit trail. Not the other way around.

What this looks like in practice
Before a run starts,

pre-run receipt

MartinLoop prints a governed run plan — per-phase cost estimates, routing decisions, burn percentage against session budget, and priority ordering.

After a run completes, it prints a receipt: every commit, every repo, every feature.

post-run receipt

A session we ran last week on our own codebase: $9.60 estimated, $16 cap, 13 commits across 3 repos, 9 new features, estimate held.

The agent calculated the budget itself — that's not a number you type in. It's the governance layer doing pre-execution cost estimation before any attempt is admitted.

Try it (bash)

npx -y martin-loop@latest demo

Full install:

npm install -g martin-loop
martin run "fix the auth regression" --budget 3 --verify "pnpm test"

MCP for Claude Code:

claude mcp add --scope user martin-loop -- npx -y @martinloop/mcp

**Open source, Apache 2.0: Github Repo
(please do us a favor and star the repo if you like it so we can keep it OSS)

What failure modes have you hit that aren't on this list?

We're still building the taxonomy — genuinely curious what's showing up in real runs.

Top comments (0)