DEV Community

Cover image for Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago
Anton Gulin
Anton Gulin

Posted on

Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago

I shipped a self-healing test framework three years ago. Nobody called it agentic then. The word "agent" was what your antivirus company ran on your laptop.

I called my three internal components Planner, Generator, and Healer. Not because I'd read a paper — because those were the three jobs the pipeline needed and I was out of clever names.

Last October, Playwright v1.56 shipped native Test Agents. Three of them.

They're called Planner, Generator, and Healer.

This week's v1.59 release added the infrastructure that makes the three-role pattern actually viable in production — video receipts via page.screencast, MCP interop via browser.bind(), and async disposables for clean resource management. The agents shipped in October. The AI test automation architecture they need shipped last week.

So this is a post about a pattern that just got validated by the team that ships the framework I bet my career on. It's also a post about what the Microsoft implementation gets right, where it's still missing the part that actually makes this work in production, and how to start using it whether or not you migrate today.

If you're a QA architect, test lead, or SDET who's ever been told to "just make the flaky tests pass" — this one's for you.

The Problem: the flake tax nobody budgets for

Here's a number every engineering manager underestimates: the flake tax.

On a team I worked with years ago — mid-stage B2B SaaS, 12 engineers, 8 services — the suite had about 1,200 end-to-end tests. Roughly 4% flaked per run. Sounds tolerable. It wasn't.

  • 4% flake × 20 PR runs per day = ~1,000 spurious failures per week
  • Every spurious failure triggers a re-run, a triage, a Slack thread
  • On a good week, 3 engineers burned a full day each chasing ghosts
  • On a bad week (release freeze, CI degradation, upstream flake) it could take the whole team for 2 sprints

That's the flake tax. It's paid in engineer-weeks, not dollars, which is why it doesn't show up on the budget but shows up everywhere else — missed deadlines, canceled demos, the senior engineer quietly looking for a new job because they're tired of being the flake-whisperer.

The traditional fix is discipline: write better locators, wait on the right events, don't trust the backend, quarantine flakes, review the quarantine weekly, blah blah. All true. All inadequate at scale. Discipline is linear; flake is exponential.

Eventually I stopped fighting flake as a writer and started designing against it as an architect.

The Drama: the 2-week death-march that broke me

I won't name the company or the release. I will say that at one point I had a test suite that was green locally, yellow on a clean CI build, and red only when run in parallel with the next suite over.

The failure was non-deterministic. The reproduction wasn't. It happened every Tuesday between 10:14 AM and 10:22 AM.

We lost two weeks to it. I tried everything. I tried everything again. I tried everything in a different order. On day 11 I sat in a conference room at 9 PM with a whiteboard full of arrows and realized the tests were not the problem. The test infrastructure was the problem. My framework assumed the application was the only thing being tested. It wasn't. The CI runner was being tested too. So was the database snapshot restore job. So was the deployment timing on the staging environment.

We fixed that specific bug. But the death-march taught me the thing I'd refused to see: test maintenance is not a writing problem. It's an architecture problem. The tests don't need more discipline. The framework around them needs more intelligence.

That's where the three-role pattern was born.

The Solution: An AI Test Automation Architecture in Three Roles

Here's the pattern, condensed. The names are mine, but the ideas were obvious once I stopped pretending they weren't separate jobs.

Planner

Job: given a feature, a user story, or a production incident, produce a structured test plan. Not test code — a plan. A list of flows, edge cases, pre-conditions, cleanup.

Why it's separate: planning and writing are different skills. If one component does both jobs, tests drift from plans. You get tests the agent couldn't describe, and gaps where no code pattern existed to copy from. Planning first forces completeness before cleverness.

What I built three years ago: a template-driven plan generator that read from PR descriptions, Jira tickets, and production alerts, and produced a Markdown spec engineers reviewed before any code was written. Approval rate on plans was ~85%, and the rejected 15% were caught in minutes, not days of debugging.

Generator

Job: take an approved plan and produce the test code. Choose the locators, write the assertions, set up the fixtures.

Why it's separate: code generation benefits from narrow context (the plan), not broad context (the whole codebase). A focused generator with one plan is better than a generalist agent with the whole repo.

What I built: a generator that output Playwright/TypeScript tests from plan Markdown, with locator strategies (data-testid preferred, role-based fallback, text-based last-resort), fixture scaffolding, and soft-assertion patterns baked in.

Healer

Job: when a test fails, diagnose whether the failure is real (app bug), structural (locator stale after a UI refactor), or environmental (CI flake). Fix the structural ones. Flag the real ones. Quarantine the environmental ones with context.

Why it's separate: and this is the one nobody wanted to believe at the time — healing is not about re-running failed tests until they pass. That's not healing; that's hiding. Real healing is triage plus targeted mutation plus review.

What I built: a healer that diffed the current DOM against the last green run, proposed three locator candidates when the old one was stale, scored each against a stability heuristic, and opened a PR with the one-line change for a human to review. Merge rate on healer PRs was ~80%. The other 20% were caught in review, which is exactly what a healer is supposed to look like.

The Numbers

I don't love citing numbers without naming the shop, but my feedback memory is explicit on that. So here's what's defensible:

  • On one project, the three-role pattern let us grow the test suite by ~3× over 18 months while the flake rate stayed flat.
  • On another, we cut the test-maintenance time-per-engineer by more than a third in the first quarter after rollout.
  • On a third, the Healer caught a UI-refactor regression pattern (100+ tests stale from a single CSS rename) and produced a single-PR fix overnight. The alternative would have been a 3-week cleanup sprint.

These numbers are not magic. They are the mechanical consequence of separating concerns and instrumenting the boundary between them. If you already do this with your services in production, you already know why it works.

Now Playwright Ships It Natively

Playwright v1.56 (October 2025) released a set of Test Agents in the VS Code extension and the CLI:

  • Planner agent — explores the app, writes structured test plans
  • Generator agent — converts plans into test code
  • Healer agent — fixes failing tests with AI assistance

The release notes span three versions: v1.56 shipped the agents themselves, v1.58 shipped the token-efficient CLI (playwright-cli), and v1.59 shipped the agent-facing APIs — browser.bind() for MCP interop and page.screencast for video receipts. The naming and the split are what matter — Microsoft built the same architecture I built. They built it better in several specific ways and worse in one.

What Microsoft got right

Each agent is separate. You can run Planner alone, pass its output to Generator, and never touch Healer. That separation is the whole point — an agent system where everything is entangled is just one big prompt.

The agents are optional. You don't have to buy in all at once. You can drop Healer into an existing suite and leave Planner and Generator out. That's how adoption actually happens in enterprise shops.

They shipped the infrastructure, not just the agents. Two pieces matter here:

  • browser.bind() — added in v1.59. It exposes a running browser over a named pipe or websocket. Any MCP client can attach.
  • Playwright MCP Bridge — a free Chrome extension that connects your already-open tabs to a local Playwright MCP server. Your real cookies. Your real profile. Your real logged-in session. Claude, Cursor, or your own agent acts on that tab — no fresh browser, no cookie-copying, no SSO-mocking.

Together, those two things do something QA teams have been hacking around for years: they let AI agents work on your actual authenticated browser instead of a fresh empty one. Microsoft built the plumbing. You don't have to.

What the official implementation is still missing

The contract. Self-healing is not a feature; it's a contract between the test, the app, and the team. The Healer agent will happily propose fixes — but who reviews them? Who owns the approval policy? Who escalates when the Healer's fix rate drops? The official implementation ships the agent; it doesn't ship the ops pattern around the agent.

That ops pattern is the hard part. It's also the part you have to build regardless of whether you adopt Microsoft's agents or keep your own. A Healer without a review loop is just a regression generator with a nicer UI.

What to Do (Whether or Not You Migrate)

If you're running Playwright already, the path is obvious: try the Planner agent in VS Code next sprint. Feed it one real user story. Compare its output to what you'd have written. Repeat ten times. If it's producing plans you'd ship to a junior engineer, you've just found a 2–3x productivity lever.

If you're on Selenium, Cypress, or something older, the migration math got better with v1.59 this month — but the pattern is portable. You don't need Microsoft's implementation to build this. You need:

  1. Plans as artifacts. Markdown. Version-controlled. Reviewable.
  2. Generators with narrow context. One plan in. One test file out. No repo-wide reasoning.
  3. A healer with a review loop. It proposes, a human approves, CI enforces. If the human always approves, your healer is working. If the human always rejects, your healer is broken. If it's 80/20, it's doing its job.

Start with the Healer if you're drowning. Start with the Planner if you're understaffed. Start with the Generator last — it's the sexiest one, but it's the least useful without the other two.

And if you're the AI QA Architect on a team that doesn't have this yet: this post is your new case study. Print it. Paste it in your design doc. Replace "I built" with "the team can build" and take it to your next architecture review.

The Takeaway

Three years ago this pattern was a weird thing a weird architect built because nothing off the shelf solved the problem.

Last October it shipped as a native feature in the framework every serious web team uses. This week's v1.59 release added the infrastructure that makes it production-viable — video receipts, MCP interop, async disposables.

If you're still treating flake as a writing problem, you're three years behind the curve. If you're treating it as an architecture problem, you're on the curve. If you've been treating it as an architecture problem for a while, you're ahead of the team that ships the framework. That's a fine place to be.

The pattern worked then. It ships natively now — agents in v1.56, infrastructure in v1.59. The contract around it is still yours to build.

That's the job.


Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.

Originally published at https://www.anton.qa on April 23, 2026.

Top comments (0)