I had both Playwright and TWD pointed at the same small app for a working day. Same backend, same UI, the same bugs to chase. The point wasn't to declare a winner. It was to notice what each one feels like to live with, especially when an AI agent is in the loop too.
Two different rhythms. Different things they make easy. Different things they make slower. One of them I ended up reaching for much more often.
The first thing you notice: where the tests live
Playwright spawns its own browser. A separate window, a separate context, a separate world. You write tests in Node, run them, watch them happen in a window you can look at but not really use.
TWD lives in the tab you're already developing in. The sidebar sits on top of your app while you work. Click run on a test, the test drives the same page you were just clicking around on, then hands it back to you. You can keep using the app.
That single difference shapes most of what comes after. When you can keep using the app while tests run, you reach for them more often. When tests take you out of the app to another window, you reach less.
The feedback loop
Inside TWD, the cycle is short: write a line, save, click run, watch. If something fails, the failure shows up next to the code that produced it. You're never out of the dev tab.
Playwright's loop is longer in feel, even when the run itself is fast. There's a separate browser, a terminal, a report, sometimes a trace viewer. Each piece is fine on its own. Together they add up to context-switching cost.
It isn't that one runs in CI and the other doesn't. Both have a headless CLI runner. Both are fine in CI. The difference is which side of the day each one is designed around. TWD is shaped for the inside of a dev session and treats CI as the export target. Playwright is shaped for the run-the-whole-thing pass and treats local dev as a smaller slice of that.
The debugging story (and why it matters more now)
Both runners are good at telling you why something failed. They tell you in different shapes.
When a TWD test fails, the error is usually about something specific. "Couldn't find this label." "The request payload didn't match." "This element wasn't visible." It's the vocabulary you used to write the test, applied to what just happened. The diagnosis is the test reading itself back to you, in a sentence or two.
When a Playwright test fails, you get richer artifacts. Screenshots, traces, a structured dump of the accessibility tree. The diagnosis is forensic. It's good at telling you what the browser looked like just before things went wrong.
That forensic detail has a cost that didn't matter much when humans were the only audience. Now that the audience is often an AI agent reading test output as part of a session, the cost shows up. A single Playwright failure can drop kilobytes of locator HTML, class lists, and page snapshots into the context window. The agent reads all of it. The token bill scales accordingly. On a long debugging loop where the agent runs the suite, reads the failure, edits a file, runs the suite again, that footprint stacks up fast.
TWD's failures are one-liners. The label name. The payload diff. The selector that didn't match. Cheap to read, cheap for an agent to act on, almost always enough to point at the line that broke.
If your debugging loop involves an LLM in any way, the per-error footprint stops being a stylistic preference and starts being a budget line.
What you get without having to build it
This is where the gap shows up the most.
TWD ships with coverage. Run the headless runner and you get a report. It also ships with contract testing: every mock you write gets validated against your OpenAPI spec on every run. You don't wire it up. It just runs.
Playwright's job is testing. Coverage and contract testing aren't part of that job by design. You can add them. There are packages and recipes for both. None of it is hard, but none of it is free either. You write fixtures, capture data, post-process the output, integrate with reporters. The work isn't huge, but it's the kind of work that quietly gets deprioritized.
This isn't a knock on Playwright. It's a question of which tool defines its scope to include the surrounding tooling, and which one stays narrow on purpose.
How they actually compose
The clean answer for choosing Playwright is real cross-browser. If your app has to work identically across Chromium, Firefox, and WebKit, Playwright gives you that natively. TWD runs in whichever browser you're developing in, one engine at a time. That's the lane. It's a real one. Most apps don't actually live in it, though.
But the framing that earned the most ground for me during the session wasn't "pick one." It was layering them.
- TWD on the inside. Every day. The tab you're already developing in. Component mocks, network mocks, fast feedback. Coverage and contract testing carried into CI as part of the same stack.
- Playwright at the gate. A small set of true black-box smokes that run in real Chromium, Firefox, and WebKit before a deploy. Login flow, checkout, anything that has to behave identically across engines. Half a dozen tests at most.
Most teams don't need 200 Playwright tests. They need 200 TWD tests and 6 Playwright tests. The math gets cheaper, the dev loop stays fast, and the cross-browser worry stays answered.
That's the stack worth running, if you're going to run both.
What the session left me with
Both tools earned their place by the end. Just not in the same place.
TWD was the right hand for everything I wrote and rewrote during the session: tests in the dev tab, instant feedback, errors short enough to read at a glance. The headless mode brought coverage and contract checks into CI without a separate setup. That's a lot of testing surface from one config.
Playwright was the right hand for the cross-engine question. Not for every spec. For the small set that has to behave identically across browsers.
Two tools, two scopes, one healthy boundary between them. That's the shape of it.
The TWD runner is at twd.dev. The repo is at BRIKEV/twd.
Top comments (0)