Static scanners caught zero behavioral bugs in 15 AI-coded apps. Here's why that's the expected result.

#vibecoding #testing #ai #webdev

Tenzai published Bad Vibes earlier this year: fifteen vibe-coded apps run through five AI-testing tools, every tool scored on what it caught. The headline findings are blunt. Zero of fifteen apps had CSRF protection on state-changing routes. SSRF showed up in every single testing tool — the scanner that was supposed to check your code had the same vulnerability class it was built to find.

Tenzai's methodology is rigorous and their framing is fair. Nothing in this post argues against running Snyk or Semgrep. If you only buy one layer, buy the scanner. The point of this post is what the scanner cannot tell you, which is what Tenzai's paper also does not claim it can.

The week after the paper dropped, we ran a simple experiment. We took an Amex test card (the one that starts with 3782-) and tried to complete checkout on five live vibe-coded apps we'd found in public launch threads. The cards were accepted by four out of five. The fifth returned a 500.

We grabbed the handler on that fifth app. It looked fine:

`// card.js
function validateCard(number) {
const cleaned = number.replace(/\s/g, "");
if (!/^\d{16}$/.test(cleaned)) {
throw new Error("Invalid card number");
}
return luhn(cleaned);
}

async function checkout(req, res) {
const card = validateCard(req.body.card);
const charge = await paymentProvider.charge(card, req.body.amount);
res.json({ ok: true, charge });
}`

Reads the card, validates sixteen digits, runs Luhn, charges. On a Visa or Mastercard test number (4111-1111-1111-1111, sixteen digits) this passes. On an Amex test number (fifteen digits, starts with 3782), validateCard throws, the uncaught error climbs up the async stack, the framework's default error handler returns 500. There's no code-level CVE here. The scanner signs off because there is nothing to flag. Any unit test on validateCard probably asserts that malformed inputs get rejected, which they do; the integration test, meanwhile, almost certainly used the standard 4111... Visa number, because that's what's in the tutorial.

This is the gap Tenzai's own methodology documents but does not fill. They counted code-level findings across 15 apps × 5 tools and published the distribution. Two other recent studies sit at the same layer: Veracode found 45% of LLM-generated code failed OWASP Top 10 across 100+ models, and CSA reported a 62% overall vulnerability rate using a similar static methodology. Useful numbers. None of them measure what happens when a real user tries to buy something. Our informal answer after an afternoon of testing five apps: one in five breaks on Amex. Probably also breaks on Discover, Diners, UnionPay — we didn't check.

The pharma parallel, briefly

Drug safety has the same structural problem and figured it out half a century earlier. In vitro assays (isolated cells in a dish) catch one class of toxicity: direct molecular damage. Clinical trials catch a different class: effects that only appear when the compound meets a living metabolism, dosing schedule, and patient population. You do not run clinical trials instead of in vitro assays. You run both, because each answers a question the other cannot. Nobody in pharma argues the in vitro people have been replaced. The people running clinical trials are not "more rigorous," they are testing a different surface.

Scanners and discovery-based testing are the same relationship. The scanner sees the code; a discovery agent sees the running app. Either one alone is a partial answer. Both together is the answer.

What Layer 3 adds that Layer 1 cannot

A discovery agent given a deployed URL walks the user journeys the app actually has. It fills the checkout form with a valid Visa test number and checks for success, then retries with an Amex number, a Discover number, a billing address in a country the form's validation library does not know about, and a coupon code that triggers some conditional path buried in the server's state machine. Each attempt is a separate journey; each result is a specific fact about the running app.

The journey either completes or it does not. Completion is a binary, measurable fact about the deployed system. It does not require a written selector, a mock, or a testing script authored in advance. It does require the running system, which is exactly what scanners cannot see.

We have limits. Our agents will miss race conditions that only appear under concurrent load, because we do not yet generate sustained traffic patterns well, and bugs that only surface after thirty days of accumulated data are out of reach for any run we do this afternoon. CVEs in a dependency are also not our job; that is what Layer 1 is for.

One concrete next step

If you ship AI-generated code to a preview URL and you have not personally tried to complete the highest-value journey on it using an unusual-but-valid input (Amex card, non-US address, apostrophe in last name, long email), do that before you read another testing article. The bug is probably already there; the only question is whether you find it or a user does.

Tenzai counted the code-level bugs. Go count the behavioral ones on your own deploy.

DEV Community

Static scanners caught zero behavioral bugs in 15 AI-coded apps. Here's why that's the expected result.

Top comments (0)