DEV Community

Muggle AI
Muggle AI

Posted on

Actually, vibe coding didn't kill testing — agentic engineering did

Updated May 2026.

A few weeks ago, the agent shipped a one-line fix on a utility I've used a dozen times. CI green. Diff readable. The PR description sounded confident. Six hours later, a completely different surface broke in production, because the small fix had a downstream behavior I never observed. I didn't open the page. The agent had, in a sense. It ran its checks and narrated what it saw. I trusted the narration.

That trust is the problem this post is about.

What changed when "I prompted" became "the AI shipped"

Behavioral verification of the running web product is the missing layer in agentic engineering.

Simon Willison's "Vibe coding and agentic engineering are getting closer than I'd like" hit 784 points on Hacker News on May 6, 2026. Andrej Karpathy gave the same transition a more flattering label, agentic engineering, but the mechanism is identical. The same coding agent now drafts the diff, runs the tests it just wrote, articulates the change, and ships. A human used to sit in at least one of those seats. Now the human sits downstream of the whole loop, reading a description.

The shift is economic, not cultural. When the agent does eighty percent of the typing, the marginal cost of opening the page and clicking around gets priced against the rate at which the next diff is already arriving. So the page-open stops happening. You spot-check. You trust the description. You ship.

What the CodeRabbit data actually says (and what it doesn't)

CodeRabbit's State of AI vs Human Code Generation Report compared 470 pull requests: 320 AI-co-authored, 150 human-only. AI-co-authored PRs contain approximately 1.7x more issues overall: 10.83 issues per PR against 6.45. That's the headline.

The subcategory number people are paraphrasing wrong online: AI-co-authored PRs are 2.74x more likely to add cross-site scripting bugs. That figure is about XSS, full stop, not "security vulnerabilities" in general. Underneath it sit three more subcategory ratios: 1.88x for improper password handling, 1.91x for insecure direct object references, 1.82x for insecure deserialization.

XSS matters here because of where it lives. CVE scanners can't see XSS in your code without executing the live page: whether escaping happens, whether a posted value round-trips into the DOM cleanly, whether what comes back to a real user is what the prompt asked for. The 2.74x ratio is naming a failure class that only shows up in the running web product.

Sitting next to that data is the ICSE 2026 paper "Vibe Coding in Practice" by Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe — a systematic grey literature review of 101 practitioner sources and 518 firsthand behavioral accounts. They name the cause directly: "speed–quality trade-off paradox where vibe coders are motivated by speed and accessibility, yet quality assurance practices are frequently overlooked, with many skipping testing." PR-quality went down. Practitioner testing-rate went down. Both at the same time.

Where existing testing tools fail on agentic-built code

Pick the strongest tool in the closest adjacent category and trace one named failure through it.

Cursor BugBot reviews diffs. It reads code and surfaces issues at the diff layer — high precision on patterns visible in the patch, including some XSS-shaped ones. The class of bug that the 2.74x number is mostly about does not live in the diff. It lives in the round-trip from form submission to rendered DOM, sometimes through a sanitizer config, sometimes through a template layer two repos away. A reviewer agent that reads the diff in isolation cannot reproduce the round-trip. It can flag a string interpolation that looks unsafe; it cannot confirm the browser actually renders the unsafe state.

The same gap recurs across the adjacent categories. CVE scanners enumerate known sinks but don't load the page. What about test-from-code frameworks like Playwright or Cypress? They require you to already know the assertion you want to write, which is exactly the artifact you don't yet have right after an agent ships a change you didn't fully read. By the time a test-from-traffic platform notices the failure, production users have already met it.

None of those are bad tools. They just aren't sitting in the seat the bug walks through.

What behavioral verification means in practice

In one sentence: open the running web product, operate it the way a real user would, compare the live behavior against the intent that produced the prompt.

That is a different artifact from a test suite. A test suite asks the program a question and accepts the program's answer. Behavioral verification asks the user-facing surface whether it does what the user asked it to do, and the program doesn't get a vote.

Concretely, on a change the agent just shipped:

  1. Load the deployed preview or the local build of the running web product.
  2. Walk the user flow the prompt described — not the flow the agent claims to have tested, but the one a confused human would actually attempt.
  3. Type unexpected values into the form. Refresh. Click back. Retry.
  4. Compare the live result against the wording of the original prompt, not against the test the agent generated to confirm itself.

This isn't a new category. It's the category that used to be filled by a human, badly and at low scale, but filled. What changed isn't that the layer became valuable. The layer became impossible to fill manually at the speed everything else now runs.

What this means if you're shipping AI-co-authored code today

The question I now write at the top of every agent-shipped PR is shorter than any review checklist:

Did anyone, human or otherwise, open the running web product after this change and confirm it behaves the way the prompt asked for?

If the only "yes" comes from the agent that wrote the code, the author is self-certifying their own work. Useful information. Not a signoff.

The honest thing I'll add, the part I haven't seen a clean answer to yet on six platforms this week (the r/devops "ban 'I built…' posts" thread, Joe Colantonio's LinkedIn post, Zhimin Zhan's running list of test automation tools companies have quietly deprecated): nobody has named what fills the seat once the human can't sit there anymore. I have a guess about the shape. Software that uses the product, not software that reads more code. I don't have the name. If you've built or seen something that does this well, say so in the comments. I'd rather find out from a builder than from the next production incident.


Ben Deng writes about software testing at muggle-ai.com. The longer version of this argument lives on his Substack: When vibe coding becomes agentic engineering, who tests the agents?

Top comments (0)