What I Learned Wiring Playwright Into MCP: The Good, the Flaky, and the Surprising

#ai #playwright #test #programming

The first time it worked, I just sat there.

I'd typed one sentence into a chat window. Something like "go to the staging site, log in with the test account, and add two items to the cart." A browser opened on its own. The cursor moved. Fields filled in. A product landed in the cart. I hadn't written a single locator, a click, or one line of a spec file.

My first reaction wasn't "this is the future." It was "what exactly just happened, and can I trust any of it?"

That question is pretty much what this whole post is about. I've spent the last few years writing Playwright suites the normal way. Selectors, fixtures, page objects, all of it. Wiring Playwright into MCP felt like watching a tool I knew well start speaking a different language. Some of it was genuinely brilliant. Some of it drove me up the wall. And a few things caught me completely off guard.

What this actually is, in plain terms
If you haven't touched it yet, here's the short version.

MCP, the Model Context Protocol, is a way for an AI model to call external tools through one shared interface. The Playwright MCP server is one of those tools. Instead of you writing the browser automation, it hands the model a set of browser actions. Navigate, click, type, read the page, take a snapshot. The model decides what to do and Playwright carries it out.

The thing that surprised me earliest was how it sees the page. I assumed screenshots, with some vision model squinting at pixels. It isn't, at least not by default. It works off the accessibility tree, the structured, labelled version of the page that screen readers rely on. That one decision explains a lot of what comes next, both the good and the bad.

So you're not really recording a test anymore. You're describing what you want and letting something else turn that into actions. After years of being deliberate about every selector, that's a strange thing to sit with.

The good

I'll start with what genuinely impressed me, because there was plenty.

Exploring an app got fast. Normally when I land on something unfamiliar I poke around by hand, work out the flows, then start scripting. With MCP in the loop I could just ask it to walk through a feature while I watched. It found steps and edge cases I'd have reached eventually, only quicker. As a way to understand a new system before writing a single real test, it paid for itself straight away.

The accessibility tree approach held up better than I expected. Because it reads structured page data instead of pixels, a layout shifting a few points or a colour changing doesn't throw it. It reads the page the way the page describes itself. When it found an element, it usually found the right one, by role and label, which is honestly how we're all told we should be writing locators anyway.

It opened the door for people who don't code. I showed it to a manual tester on our team who has never been comfortable writing automation. Within an hour they were driving real flows in plain English. That's not a small thing. So much testing knowledge sits with people who can't translate it into a framework, and this closes part of that gap.

Drafting the boring scaffolding got quicker. Even when I went back to writing proper Playwright code, letting the model map a flow first gave me a decent structure to harden by hand. Less staring at an empty file.

For a few days I was sold. Then I pushed it on real work and the cracks showed.

The flaky

Here's where I have to be straight with you, because the title promised flaky and the title wasn't bluffing.

It isn't deterministic, and tests are meant to be. Run the same instruction twice and you don't always get the same path. Once it logged in through the main form. Another time it spotted a "continue with email" route I'd forgotten was even there and went that way instead. Both reached the goal. But a test that takes a different road on each run is a test you can't fully trust, because you no longer know exactly what it checked. In automation, "it got there somehow" is not the same as "it did the thing I meant."

It resolves ambiguity confidently, and sometimes wrongly. When my instruction was vague, it didn't stop to ask. It picked a reading and ran with it, totally sure of itself. "Submit the form" once meant a Save Draft button rather than the real submit, and the run looked perfectly fine while testing the wrong thing. A green result that quietly checked the wrong path is the most dangerous green there is.

It stalls on the genuinely odd parts. Custom date pickers, drag and drop, canvas elements, anything that isn't a clean, clearly labelled control. These tripped it up far more than they trip up a locator I've tuned by hand for that exact widget. The accessibility tree strength turns into a weakness the moment the interface stops describing itself properly. And plenty of real apps describe themselves badly.

Cost and speed pile up. Every step means a model deciding what to do next. For a quick bit of exploring, fine. For a full regression run, the latency and the token cost are real, and a plain Playwright suite still wins comfortably on both.

None of this means it's broken. It means it behaves like a sharp assistant improvising, not a deterministic test runner. Different tools, different jobs.

The surprising

A few things I didn't see coming at all.

The bottleneck moved to how well I describe things. I thought the hard part would be setup. It wasn't. The hard part was writing instructions precise enough to get the same behaviour twice. The clearer and tighter my wording, the better it ran. Which, once I sat with it, is just a fuzzier version of the discipline good test design already asks for. The skill didn't vanish. It changed shape.

Watching it work became its own debugging tool. Seeing the model talk through its choices, "the login link wasn't visible, so I opened the menu first," surfaced assumptions about my own app I'd long stopped noticing. A couple of times it exposed genuine clunkiness in the UX, not test problems. I went in to judge a testing tool and came out with notes for the product team.

It made me value my boring old suite more, not less. I'd half expected to walk away feeling replaceable. Instead I came away clearer on what handwritten, deterministic tests are actually for. The predictability I used to take for granted turns out to be the entire point in the places that matter.

So where would I actually use it

I keep landing on a simple split.

For exploring, prototyping, getting to know a new app, and helping teammates who don't write code contribute, Playwright through MCP is a real addition to the toolbox. I'll keep using it there. It's fast, it's approachable, and it lowers a barrier that's been stubbornly high for years.

For the regression suite that gates a release, though, especially anything I'd have to defend to someone who can stop that release, I still want deterministic tests I wrote and can explain. It's the same conclusion I keep reaching with AI in testing in general. Lean on it where being wrong costs little, and keep a firm hand where it doesn't.

That's not a dig at the tooling. It's just knowing which job you're doing.

Where I've landed, for now

Wiring Playwright into MCP was one of the more interesting things I've done in a while, mostly because it refused to resolve neatly into "great" or "useless." It's both, depending on what you ask of it.

The good is genuinely good. The flaky is genuinely flaky. And the most surprising part was realising the tool didn't reduce the need for testing judgement. It just moved it, from writing the steps to deciding which steps to trust.

I'm still working out exactly where it fits in my week. These are early notes, not a verdict.

If you've wired Playwright into MCP, or any model into a real browser, I'd love to hear where it held up for you and where it fell apart. I get the feeling we're all still mapping the edges of this one.