Updated May 2026.
A coworker pulled the workweek breakdown last Wednesday and showed it to me on Slack. 11.4 hours per week reviewing AI-generated code. 9.8 hours per week writing code by hand. The two numbers crossed sometime in late 2025 and nobody put up a sign. The first instinct is to call this an AI win — look how much we shipped! The second instinct, which takes longer to arrive, is the one that matters: review time is the bottleneck now, and review is the layer where you discover what's actually broken.
This piece is about that second instinct. Specifically, about what review is asking developers to do that nobody trained them for, and the reason every "AI-native" testing tool you can name still doesn't fix it.
What the verification gap actually is
The verification gap is the distance between code that compiles, passes type checks, passes the test suite an AI agent wrote alongside it, and code that survives contact with a real user doing something the agent did not anticipate. Compilation, types, and tests reflect what the author thought to check. The gap is everything the author didn't think to check.
Sonar's January 2026 State of Code survey put a number on the unease: 96% of developers report they don't fully trust that AI-generated code is functionally correct, and only 48% always check it before committing. Tariq Shaukat, Sonar's CEO, framed the shift in the press release:
"We are witnessing a fundamental shift in software engineering where value is no longer defined by the speed of writing code, but by the confidence in deploying it."
The data went stale on the calendar (four months old) but not in spirit. Lightrun's April 2026 State of AI-Powered Engineering Report found 43% of AI-generated code changes need manual debugging in production after passing QA, and 88% of companies need 2-3 redeploy cycles to confirm an AI fix actually works. The trust gap got worse, not better, between January and April.
Why your test suite isn't catching what your agent ships
A short answer first: the tests an AI coding agent writes share the same assumptions as the code it writes. The same agent picks the inputs, the same agent picks the assertions, and the same agent decides when the function is "done." A test suite generated this way is an echo, not a check.
I noticed this on our own product. We have a build flow where Claude Code writes a feature, writes the test for the feature, runs the test, sees green, and moves on. For a stretch of weeks the suite was clean and the product had real bugs. The specific failure mode was that the agent only tested what it asked itself to test. Boundary cases it didn't think about were not in the test file, because the test file was a transcript of the agent's own reasoning. Bugs sat outside the transcript.
There's a parallel from cognitive psychology that fits cleanly here: confirmation bias loops. The test you can write reflects the bug you've already considered. New bugs come from places you haven't considered yet, which means a writer-as-its-own-reviewer cannot, by construction, find them. That's the discovery layer the workweek inversion is pointing at. Reviewers spend 11.4 hours per week filling in what the agent didn't think of.
What the AI testing tools actually solve
Open the AI testing category and you get a tag cloud: Mabl, Octomind, testRigor, Shiplight, Checksum, BlinqIO, BaseRock, Bugster, Testers.AI, MuukTest, and a half-dozen more. Pick any four. Underneath the brand colors the input shape is the same: a human still has to author what to verify. YAML files, natural-language prompts, recorded user sessions, pasted Figma flows. Different surface, identical floor.
That's a real category. It just isn't the same problem as discovery. These tools execute tests faster and maintain selectors better than what a small team would write by hand. If you have a list of flows you want covered, they cover them well. If you don't have that list, or if your list reflects only the flows you happened to think of, the tool inherits your blind spot.
Aikido and Snyk and Semgrep belong to a different category entirely. They scan for CVEs and known dependency issues. Both of those layers are useful. Neither of them tells you the checkout flow on your staging environment silently fails when a user applies a discount code with a leading space. That's a different problem.
What we built, and what we don't do
We built Muggle Test for the discovery layer specifically. You paste a URL. An agent crawls the product, builds a model of what the user-facing flows are, and runs them against an LLM that judges the outcome, not against a pre-written assertion. There's no test file to author and no YAML for the user to write.
The DEV.to-honest version of what that actually means: we cover happy paths and the unhappy paths a competent crawler discovers in one pass. Our crawl can miss multi-step authenticated flows that depend on prior session state, where step 4 is only reachable if steps 1, 2, and 3 left specific data behind. Those still need human direction. We're working on it. We're also web-only; server-rendered and SPA both fine, mobile is on the roadmap and not in the product yet.
That's the scope. If you already have a Cypress suite you trust, we're probably not for you. If you ship through Cursor or Claude Code and your test coverage matches whatever the last prompt asked for, we're a layer you don't currently have.
What the next twelve months reveal
The interesting question isn't whether AI coding will keep getting faster. It will. The question is what gets exposed when generation is free and discovery isn't. Two things to watch as 2026 closes:
- Whether the workweek ratio inverts further (12+ hours reviewing, 8 hours writing) and what teams do when review time exceeds half the engineering week.
- Whether the AI testing category fragments along the discovery line, with one set of tools writing tests for humans to maintain and a different set of tools discovering what humans didn't think to test in the first place.
If you've shipped an AI-generated feature this month, here's a thing worth checking: pull the last user session that hit a real production bug. Trace it back to the PR. Then ask whether any test in the test file would have caught it, and whether anyone would have written that test if the bug hadn't already happened. If the answer is no on both counts, the verification gap isn't an abstraction. It's the next bug you ship.
Top comments (0)