DEV Community

Muggle AI
Muggle AI

Posted on

The Suite and the Code Came From the Same Prompt

If you're using Claude Code or Cursor with Playwright MCP, your test suite and your feature code are coming out of the same agent session. Sometimes literally the same context window.

Your dashboard says everything passes. That's probably true. It's also not what you think it is.

The Structural Problem

Here's the thing a passing suite actually tells you, when the agent wrote both sides:

The assertions the author thought to write are satisfied by the code the author wrote.

That's it. It's not a claim about correctness. It's not a claim about user-facing behavior. It's a statement about internal consistency between two artifacts produced by the same model with the same brief.

Compare that with what you're assuming it means:

The product works for the users who will hit it.

The gap between those two statements is where the bugs live.

A Concrete Shape of It

The test body from a real Playwright MCP session I don't want to reproduce verbatim looked structurally like this:

test('user submits form and sees confirmation', async ({ page }) => {
await page.goto('/form');
await page.fill('[data-testid="email"]', 'test@example.com');
await page.click('[data-testid="submit"]');
await expect(page.locator('[data-testid="confirmation"]')).toBeVisible();
});

The agent added the data-testid attributes to the component in the same task. So the assertion is checking for a selector the agent itself just wrote. The test passes. The test has always passed, from the moment the agent wrote both files, because it cannot structurally fail — the assertion and the markup were produced together.

What the test does not check, and cannot check: whether confirmation shows up for a user on Safari iOS with a stale service worker. Whether the email field accepts a plus-sign the backend later rejects. Enter-to-submit hitting the same path as the button click. Double-firing on a second submission.

None of those were in the brief. So none of them are in the test. And if you point the agent at the same code later and ask it to add more tests, it will add tests for the things its understanding-of-the-code implies are worth checking — which is the same brief, again.

The Mirror Problem, Stated Plainly

This is the one-liner I keep using internally because nothing else fits:

The mirror doesn't catch what the mirror doesn't know to show.

The suite is a reflection of the author's model of the product. When the author is an LLM and the suite-writer is the same LLM, you have a reflection of a reflection. Everything inside the loop validates everything else inside the loop. Everything outside the loop is invisible by construction.

Ken Thompson's 1984 Turing lecture on trusting trust put the same problem at a different layer: a compiler compiled by itself can be silently backdoored by modifying the source, because any check you write runs through the thing being checked. His fix had to come from outside the toolchain — a second compiler built from unrelated source. Same shape as what we're talking about here: in-loop verification cannot see what the loop didn't know to look for.

Industry numbers say the same thing less romantically. Veracode's State of Software Security has held AI-generated code at roughly 45-55% OWASP pass rate for two years while HumanEval and friends keep trending upward. The models got better at the test; the code got no safer in the wild.

What This Isn't

I'm going to pre-empt the reasonable pushback, because it matters.

If you have a mature Cypress suite maintained by QA engineers who own the domain — if three humans are keeping a Page Object Model alive and a domain expert is writing assertions — this post is not about you. Unit tests on business logic are not the problem. Snyk, Semgrep, Aikido are not the problem; they do real work in the layer they claim to cover.

The problem is specifically: tool-written code + tool-written tests + dashboard-as-truth. That's the workflow most teams I talk to are actually running in April 2026. The workflow is new enough that the test-authoring-feedback-loop from the pre-LLM era hasn't caught up.

A Second Reader

The fix is not more tests from the same brief. The fix is a reader that didn't write the paper. Something that looks at the preview URL and derives user flows from the product surface, not from the test intents. The flows it finds will overlap heavily with what your existing suite covers; the interesting ones are the ones it finds that your suite never considered, because those are the ones your users are quietly hitting.

Honest Admission

We built our own version of this (Muggle Test) partly because we had to: we'd been benchmarking our own testing product against a suite the same tools had helped us write, and the first time we ran a non-shared-brief reader over our preview URL, it surfaced a category of regression we'd never configured against. That is embarrassing and worth saying out loud.

Full piece with the Veracode/Georgia Tech proof stack and the academic-review analog on Substack →
What If Your Benchmark Is the Bug?

Top comments (0)