nklars0

Posted on Feb 22

I Fixed 110 Failing E2E Tests in 2 Hours Without Writing a Single Line of Test Code

#testing #ai #automation #playwright

110 failing Playwright tests. Login flows, multi-step form wizards, search filters, file uploads, complex user workflows. Some failures were missing UI steps. Some were dirty state from previous runs. Some were stale selectors. I fixed all of them in 2 hours. I didn't write a single line of test code.

I built a https://github.com/kaizen-yutani/playwright-autopilot that does it.

How the debugging workflow actually works

When you run a test through the plugin, a lightweight capture hook injects into Playwright's worker process. It monkey-patches BrowserContext._initialize to add an instrumentation listener — no modifications to Playwright's source code, works with any existing installation.

From that point, every browser action is recorded:

DOM snapshots — full ARIA tree of the page captured before and after each click, fill, select, and navigation. When a test fails, you see exactly what the page looked like at the moment of failure, and what it looked like one step before.
Network requests — URL, method, status code, timing, request body, response body. Filter by status (400+ to find failed API calls), by URL pattern, or by method.
Console output — errors, warnings, and logs tied to the specific action that produced them. Not a wall of text — scoped to the step that matters.
Screenshots — captured at the point of failure.

The AI doesn't dump all of this into context at once. It's built on MCP (Model Context Protocol), so it pulls data on demand — action timeline first, then drills into the specific failing step, checks the DOM snapshot, inspects the network response, reads the console. 32 tools, each returning just what's needed. Token-efficient by design.

It thinks in user flows, not selectors

Before touching code, the agent maps the intended user journey: "a user logs in, fills out a multi-step form, uploads a file, submits." It walks through the steps a real user would perform and compares that against what the test actually did.

When a step is missing — a dropdown never selected, a required field never filled, a radio button never clicked — it finds the existing page object method in your codebase and adds the call. No new abstractions. Minimal diff.

It follows your architecture

Page Object Model, business/service layer, whatever pattern your team uses — it reads your codebase and works within it. Uses getByRole(), getByTestId(), web-first assertions. No page.evaluate() hacks, no waitForTimeout, no try/catch around Playwright actions.

If the application itself is broken — 500s regardless of input, unhandled exceptions in app code — it tells you that instead of working around it.

It learns and remembers

After a test passes, the plugin automatically saves the verified user flow — the exact sequence of interactions that make up the happy path. Next time that test breaks, the agent already knows the intended journey and jumps straight to identifying what changed.

Run e2e_build_flows once across your suite and it captures every test's journey. The agent gets faster over time.

A real example

A checkout test was failing with "locator resolved to hidden element." The usual debugging path: open trace viewer, find the step, read the DOM, realize a country dropdown was never selected so the shipping section never rendered. 20 minutes if you're fast.

The plugin found the same root cause in one run. It pulled the DOM snapshot at the failing step, saw the unselected dropdown with its options sitting right there in the ARIA tree, searched the page objects for selectCountry(), found it, added the call in the service layer, re-ran the test. Passed. One fix, 12 seconds of AI thinking.

Get started

Add the marketplace

/plugin marketplace add kaizen-yutani/playwright-autopilot

Install the plugin

/plugin install kaizen-yutani/playwright-autopilot

Then prompt: Fix all failing e2e tests

https://github.com/kaizen-yutani/playwright-autopilot — star it, try it on your flakiest test, tell me what breaks.

Top comments (2)

Matthew Hou • Feb 23

DOM snapshots before and after each action is the killer feature here. The hardest part of debugging flaky E2E tests is reproducing the exact page state when it failed. If you can see the ARIA tree at the moment of failure, you skip the entire reproduction cycle.

Question: how does this handle tests that fail due to timing issues — like an element appearing 50ms later than expected? Is the snapshot taken at the point of failure or continuously throughout the test run?

nklars0 • Feb 23

Snapshots are taken at the action boundary — before Playwright starts waiting, and after it finishes (whether it succeeded or timed out).

So for a 50ms late element: Playwright's auto-waiting happens inside that boundary. If it retries and finds the element, you get normal before/after. If it times out, onAfterCall still fires with the error — and the "before" snapshot shows exactly what was in the DOM when the wait started.

You also get a diff between before and after — which lines were added, which were removed. So instead of reading two full ARIA trees, you see exactly what changed on the page as a result of that action. it immediately shows you whether that section appeared or not. :)