The Audit Tax: Why Your Agent Made You Slower

#ai #agents #llm #codereview

Originally published in Temrel, a weekly newsletter on agentic engineering.

You ask an agent to code an update. It takes about 90 seconds to produce the PR. You then spend the next 90 minutes reading it line by line to see if you trust it. You might, whisper it, be shipping code even slower than you were before.

Agent-based development velocity is borrowed time, re-invoiced with interest at review time. The agent writes the PR in seconds; you pay for that speed in the time it takes to decide whether to trust what it has written. This is the Audit Tax.

This is a deliberate sequel to last week's "Stop prompting, start looping." Verification was one of our six dials, and today we focus on that one.

The bottleneck moved while you were watching the leaderboard

Code generation is effectively solved. By mid-2026, even the die-hard holdouts can't seriously argue that coding agents underperform humans in commercial environments. The hard part now is verification.

The old scoreboard measures the wrong thing: model benchmarks, tokens per second, and the rest. The real measurement is how quickly agent-produced code gets into production.

According to LinearB's 2026 Software Engineering Benchmarks Report, AI PRs take 4.6x longer to get reviewed. That is a product of higher volume and faster delivery, and it is the biggest blocker to AI engineering productivity.

Reviewing agent code is harder than reviewing human code

Verification is harder than it looks. You can't interrogate the agent and trust the answer; the hallucination might be buried in the reasoning. Your old heuristics for reviewing human code are unfit for the task:

Agent-written PRs always look clean and self-confident, whether they work or not. Sloppy formatting and thin documentation no longer signal a weak PR, so you can't kick it back on those grounds.

Enforcing small diffs doesn't work either. Try it and "4.6x longer" becomes a stretch goal; you'll be drowning in PRs forever.

Individual reliability means nothing now. John, the old hand who always shipped clean code and earned a cursory review? John's gone. There's just Claude now.

And don't forget: you contribute to The Sloppening every time you push slop to the codebase.

Stop paying the tax by hand. Build the verification layer.

Get your cheap, deterministic gates in first: typecheck, tests, lint, build. You already have them, they're virtually free and fast, and they catch stupid mistakes. Anthropic calls these code-based graders.

Then add a review subagent. In Anthropic's terms, model-based graders. Check the diff against the stated intent, not just whether it builds and runs.

Then human-in-the-loop: a person's eyes on anything that survives the deterministic and agent-review gates. The machines clear the early hurdles, and the human lets the output hit production. Anthropic calls these human graders.

Evals make verification repeatable, not vibes

Anthropic recommend starting evals early, and so do I. Record the cases where the agent misses requirements, and once you have around 20, start building your evals.

Add your deterministic checks plus an LLM-as-judge for the fuzzy intent. Wire them to triggers so you don't kick them off by hand.

There's an in-depth Anthropic blog on methodology that is lighter on technical implementation. Take that as a sign of how early this step in the agentic loop still is.

Action steps (do this week)

Measure your tax: time-to-generate a PR versus time-to-merge it. The gap is the bill.
Add one mandatory CI gate the agent cannot merge past (start with tests or typecheck).
Stand up a 20-case eval from last month's actual agent failures.
Add a "review" pass that checks diffs against intent before they reach you.
Re-measure the gap. Watch the tax drop.

Why this matters

This is the reframing of the dev career ladder. We started with context engineering (2024), then loop engineering (2026). Follow the thread and you become one of the top players in software development, set up well for what's next.

Whoever owns verification owns the bottleneck, and whoever owns the bottleneck owns the leverage. Code generation is solved. The tax is rigorous evaluation.

Pay the tax on purpose, or pay it by accident.

Subscribe to Temrel for weekly agentic engineering field notes.

Top comments (1)

Raju Dandigam • Jul 1

This is exactly the adoption curve many engineering teams are hitting. The agent made the PR faster, but the human still has to rebuild enough context to decide whether the change is safe. That means the bottleneck moved from generation to verification, and the team may not be faster unless the verification layer improves. I think the winning workflows will pair agents with strong receipts: tests, traces, diffs tied to intent, dependency evidence, and objective gates that reduce review ambiguity.