DEV Community: Muggle AI

The Suite and the Code Came From the Same Prompt

Muggle AI — Sat, 18 Apr 2026 19:20:26 +0000

If you're using Claude Code or Cursor with Playwright MCP, your test suite and your feature code are coming out of the same agent session. Sometimes literally the same context window.

Your dashboard says everything passes. That's probably true. It's also not what you think it is.

The Structural Problem

Here's the thing a passing suite actually tells you, when the agent wrote both sides:

The assertions the author thought to write are satisfied by the code the author wrote.

That's it. It's not a claim about correctness. It's not a claim about user-facing behavior. It's a statement about internal consistency between two artifacts produced by the same model with the same brief.

Compare that with what you're assuming it means:

The product works for the users who will hit it.

The gap between those two statements is where the bugs live.

A Concrete Shape of It

The test body from a real Playwright MCP session I don't want to reproduce verbatim looked structurally like this:

test('user submits form and sees confirmation', async ({ page }) => { await page.goto('/form'); await page.fill('[data-testid="email"]', 'test@example.com'); await page.click('[data-testid="submit"]'); await expect(page.locator('[data-testid="confirmation"]')).toBeVisible(); });

The agent added the data-testid attributes to the component in the same task. So the assertion is checking for a selector the agent itself just wrote. The test passes. The test has always passed, from the moment the agent wrote both files, because it cannot structurally fail — the assertion and the markup were produced together.

What the test does not check, and cannot check: whether confirmation shows up for a user on Safari iOS with a stale service worker. Whether the email field accepts a plus-sign the backend later rejects. Enter-to-submit hitting the same path as the button click. Double-firing on a second submission.

None of those were in the brief. So none of them are in the test. And if you point the agent at the same code later and ask it to add more tests, it will add tests for the things its understanding-of-the-code implies are worth checking — which is the same brief, again.

The Mirror Problem, Stated Plainly

This is the one-liner I keep using internally because nothing else fits:

The mirror doesn't catch what the mirror doesn't know to show.

The suite is a reflection of the author's model of the product. When the author is an LLM and the suite-writer is the same LLM, you have a reflection of a reflection. Everything inside the loop validates everything else inside the loop. Everything outside the loop is invisible by construction.

Ken Thompson's 1984 Turing lecture on trusting trust put the same problem at a different layer: a compiler compiled by itself can be silently backdoored by modifying the source, because any check you write runs through the thing being checked. His fix had to come from outside the toolchain — a second compiler built from unrelated source. Same shape as what we're talking about here: in-loop verification cannot see what the loop didn't know to look for.

Industry numbers say the same thing less romantically. Veracode's State of Software Security has held AI-generated code at roughly 45-55% OWASP pass rate for two years while HumanEval and friends keep trending upward. The models got better at the test; the code got no safer in the wild.

What This Isn't

I'm going to pre-empt the reasonable pushback, because it matters.

If you have a mature Cypress suite maintained by QA engineers who own the domain — if three humans are keeping a Page Object Model alive and a domain expert is writing assertions — this post is not about you. Unit tests on business logic are not the problem. Snyk, Semgrep, Aikido are not the problem; they do real work in the layer they claim to cover.

The problem is specifically: tool-written code + tool-written tests + dashboard-as-truth. That's the workflow most teams I talk to are actually running in April 2026. The workflow is new enough that the test-authoring-feedback-loop from the pre-LLM era hasn't caught up.

A Second Reader

The fix is not more tests from the same brief. The fix is a reader that didn't write the paper. Something that looks at the preview URL and derives user flows from the product surface, not from the test intents. The flows it finds will overlap heavily with what your existing suite covers; the interesting ones are the ones it finds that your suite never considered, because those are the ones your users are quietly hitting.

Honest Admission

We built our own version of this (Muggle Test) partly because we had to: we'd been benchmarking our own testing product against a suite the same tools had helped us write, and the first time we ran a non-shared-brief reader over our preview URL, it surfaced a category of regression we'd never configured against. That is embarrassing and worth saying out loud.

Full piece with the Veracode/Georgia Tech proof stack and the academic-review analog on Substack →
What If Your Benchmark Is the Bug?

Static scanners caught zero behavioral bugs in 15 AI-coded apps. Here's why that's the expected result.

Muggle AI — Fri, 17 Apr 2026 15:56:21 +0000

Tenzai published Bad Vibes earlier this year: fifteen vibe-coded apps run through five AI-testing tools, every tool scored on what it caught. The headline findings are blunt. Zero of fifteen apps had CSRF protection on state-changing routes. SSRF showed up in every single testing tool — the scanner that was supposed to check your code had the same vulnerability class it was built to find.

Tenzai's methodology is rigorous and their framing is fair. Nothing in this post argues against running Snyk or Semgrep. If you only buy one layer, buy the scanner. The point of this post is what the scanner cannot tell you, which is what Tenzai's paper also does not claim it can.

The week after the paper dropped, we ran a simple experiment. We took an Amex test card (the one that starts with 3782-) and tried to complete checkout on five live vibe-coded apps we'd found in public launch threads. The cards were accepted by four out of five. The fifth returned a 500.

We grabbed the handler on that fifth app. It looked fine:

`// card.js
function validateCard(number) {
const cleaned = number.replace(/\s/g, "");
if (!/^\d{16}$/.test(cleaned)) {
throw new Error("Invalid card number");
}
return luhn(cleaned);
}

async function checkout(req, res) {
const card = validateCard(req.body.card);
const charge = await paymentProvider.charge(card, req.body.amount);
res.json({ ok: true, charge });
}`

Reads the card, validates sixteen digits, runs Luhn, charges. On a Visa or Mastercard test number (4111-1111-1111-1111, sixteen digits) this passes. On an Amex test number (fifteen digits, starts with 3782), validateCard throws, the uncaught error climbs up the async stack, the framework's default error handler returns 500. There's no code-level CVE here. The scanner signs off because there is nothing to flag. Any unit test on validateCard probably asserts that malformed inputs get rejected, which they do; the integration test, meanwhile, almost certainly used the standard 4111... Visa number, because that's what's in the tutorial.

This is the gap Tenzai's own methodology documents but does not fill. They counted code-level findings across 15 apps × 5 tools and published the distribution. Two other recent studies sit at the same layer: Veracode found 45% of LLM-generated code failed OWASP Top 10 across 100+ models, and CSA reported a 62% overall vulnerability rate using a similar static methodology. Useful numbers. None of them measure what happens when a real user tries to buy something. Our informal answer after an afternoon of testing five apps: one in five breaks on Amex. Probably also breaks on Discover, Diners, UnionPay — we didn't check.

The pharma parallel, briefly

Drug safety has the same structural problem and figured it out half a century earlier. In vitro assays (isolated cells in a dish) catch one class of toxicity: direct molecular damage. Clinical trials catch a different class: effects that only appear when the compound meets a living metabolism, dosing schedule, and patient population. You do not run clinical trials instead of in vitro assays. You run both, because each answers a question the other cannot. Nobody in pharma argues the in vitro people have been replaced. The people running clinical trials are not "more rigorous," they are testing a different surface.

Scanners and discovery-based testing are the same relationship. The scanner sees the code; a discovery agent sees the running app. Either one alone is a partial answer. Both together is the answer.

What Layer 3 adds that Layer 1 cannot

A discovery agent given a deployed URL walks the user journeys the app actually has. It fills the checkout form with a valid Visa test number and checks for success, then retries with an Amex number, a Discover number, a billing address in a country the form's validation library does not know about, and a coupon code that triggers some conditional path buried in the server's state machine. Each attempt is a separate journey; each result is a specific fact about the running app.

The journey either completes or it does not. Completion is a binary, measurable fact about the deployed system. It does not require a written selector, a mock, or a testing script authored in advance. It does require the running system, which is exactly what scanners cannot see.

We have limits. Our agents will miss race conditions that only appear under concurrent load, because we do not yet generate sustained traffic patterns well, and bugs that only surface after thirty days of accumulated data are out of reach for any run we do this afternoon. CVEs in a dependency are also not our job; that is what Layer 1 is for.

One concrete next step

If you ship AI-generated code to a preview URL and you have not personally tried to complete the highest-value journey on it using an unusual-but-valid input (Amex card, non-US address, apostrophe in last name, long email), do that before you read another testing article. The bug is probably already there; the only question is whether you find it or a user does.

Tenzai counted the code-level bugs. Go count the behavioral ones on your own deploy.

Why AI Output Quality Plateaus — And What Actually Raises the Ceiling

Muggle AI — Thu, 16 Apr 2026 13:25:44 +0000

Ira Glass made this observation about creative work that stuck with a lot of people: the reason your early work is bad isn't that your ability is low. It's that your taste is already high. You can hear the gap between what you made and what you were trying to make. You know it's not there yet. That's your taste working against you.

The observation was about writers and filmmakers developing their craft. But it predicts something about AI that most people haven't named clearly yet. This maps onto AI output quality more precisely than most people realize.

AI output quality plateaus because AI eliminates the execution gap but cannot close the taste gap — the distance between recognizing good work and producing it. Process and guardrails raise the floor. They don't move the ceiling. This article explains what the taste gap is, why longer specs can't close it, and the one practice that does.

AI closes the ability gap. It does not close the taste gap.

What the Ability Gap Actually Was

Before AI, execution was expensive. Writing a draft took hours. Coding a feature took days. The bottleneck was the doing.

A lot of bad output existed because doing was costly. People shipped the second draft when they knew a fifth draft would be better. Teams built the expedient implementation because the elegant one would take three more days. The execution gap — between knowing what good looks like and being able to produce it — was the binding constraint.

AI collapses that gap dramatically. The draft takes minutes. The feature takes hours. A BCG and Harvard study of 758 consultants measured this directly: bottom-quartile performers gained 43% on task quality when given AI access. The floor rose sharply.

This is real. The gains at the bottom are genuine and significant.

The New Binding Constraint
When AI makes execution cheap, the binding constraint shifts from ability to taste: knowing which problem to solve, which feature to cut, and when technically correct output is wrong for a specific user. AI cannot supply this judgment. It averages across acceptable options. The result is output that works but disappoints anyone with specific standards.

When execution becomes cheap, the binding constraint shifts. What's left?

A 2025 arXiv paper on AI output variance found that generative AI systematically compresses human output distribution — the floor rises but the ceiling drops. Ted Chiang called ChatGPT "a blurry JPEG of the web": structure preserved, fine detail lost. Amanda Askell at Anthropic described the dynamic as LLMs providing "the average of what everyone wants."

The average of what everyone wants is not the best version of anything specific.

This is where the taste gap becomes visible. The AI can execute. The AI cannot judge what's worth executing. It cannot tell you which angle on the problem is the interesting one, which feature to cut because the product is already doing too much, when technically correct is the wrong move for this user.

Those judgments are taste. Taste is why good AI output disappoints people with high standards — they can hear what it could have been.

The Spec Problem Is a Taste Problem

There's a concrete version of this that any developer who has worked with AI on a real project has encountered.

Write a thorough spec — two thousand words, every endpoint, every edge case. Feed it to the AI. The AI builds everything on the list. Every requirement is met. The product works.

It also feels like a toy.

Not broken. Not missing features. Just hollow — like a homework assignment that proves the concept without understanding what the concept is for. The AI followed the spec and had no understanding of what production-level software actually needs, or what good design feels like from the user's side. The spec described the parts. Building the parts is not building the product.

The missing information is not writable as spec text. It's taste: the accumulated pattern recognition that tells an experienced engineer when a loading state will feel broken even at 200ms, when an empty state communicates abandonment, when the technically correct dropdown is the wrong choice. That knowledge didn't make it into the spec because it's not articulable as requirements.

Andrej Karpathy walked back his famous "vibe coding" framing in 2024: "You still need taste, architecture thinking." The developer's role shifted from coder to orchestrator — but orchestrating well still requires knowing what good looks like.

Evaluative Taste vs. Generative Taste

Evaluative taste is the ability to judge existing work — scoring, ranking, filtering. Generative taste is the ability to decide what to create — which topic matters, which angle resonates, which details to include and which to cut. AI is improving at evaluative taste. Generative taste remains a human capacity.

There's a nuance worth naming here. AI can learn some forms of taste.

A 2024 study found AI achieved 59% accuracy evaluating research pitches — identifying which proposals were strong. That's evaluative taste: scoring what exists against criteria.

Generative taste is different. It's knowing what to create before it exists, which angle is worth pursuing, what the product needs that nobody asked for. The 59% accuracy on scoring does not transfer to 59% accuracy on generating what's worth scoring highly.

Paul Graham's "Taste for Makers" essay argued that taste can be evaluated but not manufactured. You can articulate what makes something good after seeing it. You cannot turn that articulation into a reliable procedure for generating good things. The articulation is always incomplete. Goodhart's Law runs here too: once you optimize against a quality proxy, the proxy stops measuring quality.

The Practice That Moves the Ceiling

The gap is not fixed. Taste develops. The question is how.

Before accepting AI output — code, copy, analysis — pause and name one thing you'd change if you had unlimited time. Not a bug. Not a missing requirement. The thing that's technically fine but wrong for this specific situation.

That practice does two things. First, it forces the articulation of the taste judgment, which is how taste becomes more precise over time. Second, when you can name the thing, you can respecify it. The next AI iteration has a real target instead of an implicit one.

The ability gap closed fast. The taste gap closes through repetition of this specific exercise: recognize, name, specify. Not through better prompts or longer specs.

Glass was right about creative work. Your taste precedes your ability. With AI, ability is nearly free. What remains is closing the taste gap — and that one you have to do yourself.

What did the last AI output get technically right that still felt wrong — and could you name exactly what was off?

Two kinds of AI testing shipped this month. They solve completely different problems.

Muggle AI — Wed, 15 Apr 2026 18:03:17 +0000

Lovable shipped $100 AI pentests. Meta proved LLM-generated tests catch 4x more bugs. Both shipped this month. They solve completely different problems — and the confusion between AI security testing, AI test automation, and AI-generated test suites is making it harder to know which one you need. Neither one touches the layer where most teams are actually losing users.

On March 24, Lovable launched integrated security pentesting in partnership with Aikido — a $1B security unicorn — for $100 per pentest. The same week, Meta published research on a system called JiTTests (arXiv: 2601.22832) showing that LLM-generated unit tests can catch bugs at scale inside a production engineering organization. Both are real advances. Both are well-executed. And both are getting lumped under "AI testing" in a way that obscures what they actually do — and what neither of them touches.

It's worth pulling these apart carefully, because the gap between them is where a lot of teams are quietly bleeding.

What the Lovable + Aikido pentest covers

The Lovable integration runs a full whitebox + blackbox + greybox pentest against your deployed application: OWASP Top 10, LLM Top 10, privilege escalation, IDOR, authentication bypasses. It delivers results in 1–4 hours for $100, against a traditional range of $5K–$50K for an equivalent manual engagement. At that price point, security testing becomes something you can do per deploy rather than per quarter.

That's a meaningful shift. But the boundaries matter: this tests Lovable-built apps only, it tests the deployed application, and it looks for security vulnerabilities. A pentest will tell you whether an attacker can access data they shouldn't. It won't tell you whether your checkout flow breaks when a user applies a coupon code on mobile. That's not a security failure — it's a behavioral failure, and it's explicitly out of scope.

What Meta's JiTTests covers

The Meta paper is the more technically interesting result. The core idea: instead of maintaining a static test suite that grows stale, generate fresh unit tests per code diff — tests specifically designed to fail on the incoming change if it introduces a bug. These are catching tests, not hardening tests.

The numbers are compelling: 22,126 tests analyzed, 4x more candidate catches compared to hardening-style tests, and 70% reduction in human review time. The pipeline used Llama 3.3-70B, Gemini 3 Pro, and Claude Sonnet 4 as assessors. Of 41 candidate catches surfaced to human reviewers, 8 were confirmed bugs — 4 of them serious.

Those caveats are real and the paper acknowledges them. Eight confirmed from 41 is a small sample. The oracle problem (determining whether a test failure signals a real bug or a spec change) remains unsolved and requires human judgment. JiTTests works at the unit level — individual functions and their immediate behaviors. It's not testing sequences of actions. It's not testing how a user navigates through your product. And it requires the diff to exist — by definition, it can't catch bugs that live in the interaction between components rather than inside a single changed function.

The gap neither of them fills

There's a third category that both systems structurally ignore: user journey testing.

The checkout flow that silently dead-ends when a promo code is applied. The signup that completes on desktop but drops users on mobile Safari after the email confirmation step. The dashboard that loads correctly in isolation but throws a 403 when navigated to from a shared link. These are behavioral bugs. They only surface when a real user clicks through a sequence of actions — and they're invisible to both a security scanner and a per-diff unit test generator.

A single broken checkout flow costs more in customer lifetime value than a year of testing infrastructure — and most teams discover it only when a customer emails to say something is broken.

Security testing doesn't touch these because they're not vulnerabilities. Code-level catching doesn't touch them because they're not regressions in a single function — they're emergent failures in multi-step flows. Right now, the only reliable way to catch them is manual QA, end-to-end test suites that someone has to write and maintain, or actual user reports. Penligent published a taxonomy in March 2026 noting that "AI testing" now refers to at least five distinct categories — and the terminology itself is obscuring which problems are actually being addressed. Muggle AI is building specifically on this journey testing layer — the approach is paste a URL, get journey coverage across your key flows (muggle.ai). That's a different class of problem from security scanning or unit diffing: you're not testing what code does, you're testing what a user experiences through a sequence of real steps.

Which layer you actually need right now

This isn't a "you need all three" post — that's easy to say and hard to act on. Here's a more honest framing:

If you're handling payments or sensitive user data: security testing is the non-negotiable starting point. The Lovable + Aikido model makes this accessible at a price that removes the excuse. If you're shipping AI-generated code at speed — vibe coding, rapid prototyping, whatever you want to call it — code-level catching of the JiTTests variety addresses the specific risk that your diff introduces a regression no one reviewed. Those are different threat models. If users are dropping off or churning from flows that "should work," neither of those tools will find the problem. That's a journey testing gap, and most small teams have none of the three layers covered.

Developer trust in AI-generated output has already slipped — Stack Overflow data shows it falling from 69% to 54% — and the pressure to ship fast hasn't changed. The testing infrastructure hasn't kept pace with the generation infrastructure. That's the actual problem statement.

What March 2026 actually shipped

Two out of three layers got serious investment this month. Security testing is now accessible to teams that previously couldn't afford it. Unit-level catching is showing real signal at Meta's scale, even if the confirmed-bug sample is small. Both are genuine progress.

The third layer — testing what users actually experience when they click through your product — is the hardest to automate, and it didn't ship this month. Testing a behavioral flow requires understanding intent, state, and sequence in a way that doesn't reduce to "does this function return the right value" or "is this endpoint vulnerable to injection." The industry knows what the gap is. The hard part is that solving it means building something that can reason about user experience, not just code behavior. That's a different class of problem — and it's the one most teams discover only after a customer emails to tell them something is broken.

Which of these three layers does your team actually have covered?