Agentic Testing Has a Discovery Gap Nobody Talks About

#webdev #playwright #testing #ai

Agentic Testing Has a Discovery Gap Nobody Talks About

Updated April 2026.

Microsoft moved Playwright MCP into the Playwright CLI this month. Then the tutorials piled up: Hawks & Owls's Medium walkthrough wiring it into Cursor and Claude Code, alexop.dev's "Building an AI QA Engineer" with full repo, plus a Frontend Masters Agentic Playwright workshop. Every one of these is excellent. Every one answers the same question: how do I get an AI agent to author Playwright tests for me?

None of them answers the question that comes first.

Which tests should the agent be writing?

The four-stage pipeline, and the one stage nobody automated

E2E testing has four stages: discovery → generation → execution → verification. The agentic testing wave of 2025–2026 automated three of them. Generation got Playwright MCP, the Cursor sub-agent tutorials, and Mabl's Test Creation Agent. Execution got QA Wolf's managed runners and CI integrations everywhere. Verification got LLM-as-judge plus screenshot diffs.

Discovery is still a human writing a list.

The discovery stage is the work of deciding which user journeys deserve a test in the first place: checkout, login, password reset, the weird flow finance built last quarter. Plus the work of deciding which failure modes would actually hurt if they shipped broken. Every tutorial assumes you walk in with that list. The list is the bottleneck.

This is why a developer on r/QualityAssurance last week could read four "complete guides to agentic QA" published in thirty days — Katalon, QA Wolf, Tricentis, Momentic — and conclude that nobody agreed on what the term meant. They were each describing a different stop on the same incomplete pipeline.

Why no agent decides which tests to write

External financial auditors don't check every transaction in a public company's ledger. They sample. The audit's value depends almost entirely on which transactions got sampled, not on how thoroughly each sampled one was checked. Bad sampling plus perfect testing equals useless audit. E2E testing has the same shape, and the sampling step is the part nobody automated.

Discovery is harder than generation, and most of the difficulty is upstream of code. There's no schema for "what users do." Only the front-end, the URL structure, the visible state transitions, and a lot of inference. Once you've inferred the journeys, you still have to weight them: a no-code tool will happily generate twenty tests for your homepage carousel and zero for the signup-to-billing path because it has no model of what failure costs. And then there's the state-vs-page problem. Crawling every URL is not the same as exercising every state, and most generation agents conflate the two.

In code terms, a discovery-first agent doesn't take a test description as input. It takes a URL.

# generation-first (the 2026 default)
$ playwright-agent --spec "test the checkout flow with a valid card"
> generates test_checkout_valid.spec.ts

# discovery-first (the missing piece)
$ muggle-test https://your-staging-url.com
> crawls the web product
> identifies 14 user journeys, ranked by blast radius
> proposes a test plan
> waits for you to approve or edit before generating anything

The output of the discovery step isn't code. It's a plan a human approves. Generation comes after.

Have you seen a single agentic testing tutorial that addresses any of these three problems? I haven't. If you have one I should read, drop it in the comments. I'm collecting counter-examples.

What the differentiation actually is

Honest market read. Shiplight wants you writing YAML intent files. testRigor wants you writing natural-language test descriptions. QA Wolf generates Playwright code against your codebase. Each is honest about authoring being the workflow. None claims to do the discovery step on your behalf. The dishonest ones are the "complete guide" posts that imply discovery is solved because generation is solved.

Generation isn't the bottleneck anymore. Authoring cost collapsed two years ago. The bottleneck is upstream: knowing what to write tests for is still a human's job in 2026, and most of those humans are the engineers who didn't have time to write the tests in the first place.

Where this approach breaks

I'd be lying if I said discovery-first is solved. Token budget is real on a big site: a first crawl can take 8-12 minutes before the agent has anything useful to say, and that's longer than CI on a pre-written Cypress suite. We don't do mobile yet. If you already have a Playwright suite you trust, we're probably not solving your problem.

For teams shipping fast on the web with no QA function, the discovery step is what closes the gap between "we generated 200 tests" and "we have meaningful coverage." That's the version of agentic testing worth wanting. Most of what's currently shipping is just authoring, faster.

One thing we built

We built the discovery step as a web product. Paste a URL, the agent crawls, you review the plan before any test gets generated. It's on Product Hunt today, no real volume yet. If you want to poke at it and tell us what's broken, that would actually help: Muggle Test on Product Hunt.

What's the most useful test in your suite that nobody would have written if an agent had to guess? Curious which journeys you'd put on that list.