devdev

Posted on Feb 5

I got tired of writing E2E tests, so I built an AI that runs them for me

#testing #ai #playwright #saas

I got tired of writing E2E tests, so I built an AI that runs them for me

QA is unavoidable… but painful

Whether you’re building side projects or working professionally, one thing never goes away:

Quality Assurance.

No matter how great your product is, users won’t trust it if it’s full of bugs.

But here’s the reality:

In side projects, we barely have time to build features — QA gets skipped.
In companies, QA is mandatory — but it slows everything down.

The cycle becomes:

Fix → QA → Fix → Re-QA → Repeat forever

At some point I started wondering:

Why is ensuring quality still so expensive?

Why is QA still so manual?

The moment everything clicked

While thinking about this problem, I discovered an open-source project called

Browser Use.

It lets AI control a real browser.

That’s when the idea hit me:

What if we could write tests in plain English and let AI do the QA?

So I built it.

Meet Test Pilot

https://test-pilot.dev

Test Pilot is an AI QA tool that runs browser tests for you.

When you run a test, a real browser opens in the cloud (or locally) and the AI performs the QA automatically.

Here’s what it looks like:

Instead of writing test code, you simply describe what you want to verify.

For example:

Verify that a user can log in
Check that a product can be added to the cart
Ensure the purchase completion screen appears

That’s it.

The AI performs the steps and verifies the results.

Why natural language?

Because writing tests is expensive.

Test code requires:

Technical knowledge
Maintenance
Constant updates after UI changes

And eventually…

Tests become the thing teams postpone.

Natural language removes that barrier.

You can write tests the same way you write product specs.

This also makes tests accessible to:

PMs
Designers
Non-engineers

Now the whole team can understand what is being tested.

Record once, turn it into QA

We also built a browser extension that records your actions.

Instead of writing tests, you can:

Open your app
Click through a flow once
Save it as a test

No code. No scripting.

Just record and run.

This dramatically lowers the barrier to starting QA.

Run tests in the cloud or locally

Tests can run in:

Cloud (quick and easy)
Local Runner (Mac)

Local execution is especially useful when:

You need logged-in sessions
Your app is behind a firewall
You’re testing internal tools

This makes the tool usable from side projects to production teams.

Under the hood

Here’s the high-level stack:

Frontend

Next.js

Backend

Python / FastAPI

Browser Automation

Browser Use (Playwright-based)

Cloud Execution

Celery for async test runs

Nothing magical — just technology chosen to reduce the barrier to QA.

The real goal

This started as a tool I wanted for myself.

But it grew into the largest product I’ve ever built.

There’s still a lot to improve, and it’s far from perfect.

But if you’ve ever felt that QA is heavy, slow, or constantly postponed…

I hope this can help.

Try it

If this sounds interesting, I’d love your feedback.

https://test-pilot.dev

Top comments (10)

Mykola Kondratiuk • Feb 7

The test maintenance trap is real. Spent two months last year on a Flutter app where selector changes broke 40% of our E2E suite every sprint. Eventually just disabled half of them to ship faster, which defeats the purpose entirely.

Your approach with AI sounds interesting - how do you handle the balance between test coverage and false positives? I've found that overly smart test systems sometimes pass when they shouldn't, missing real regressions because the AI "figured out" a workaround.

devdev • Feb 8

Thanks for the thoughtful comment — this is something we think about a lot.

We try to balance coverage and false positives by defining tests around whether an AI acting as a user can successfully achieve a goal. Each test has a clear goal state, and success is determined by whether that state is actually reached.

If there’s a regression, the AI won’t be able to complete the task. And if the UI becomes confusing or harder to use, the AI tends to fail in the same way a real user would.

That said, the outcome still depends on how actions and assertions are written. We can’t claim this problem is fully solved yet, and we’re actively iterating on prompt design and test authoring to improve the balance.

Mykola Kondratiuk • Feb 8

The goal-state approach makes sense. I like that it forces you to think about what success actually looks like rather than just "did this button get clicked."

One thing I'm curious about - when the AI "figures out a workaround" that a human wouldn't take, does that get flagged? Like if the intended flow is broken but the AI finds an alternate path to the goal, that's almost worse than a straight fail because now you have a false pass hiding a UX regression.

Have you seen cases where the AI passed by doing something clever that revealed the test needed better constraints?

devdev • Feb 10

We have seen it a few times, although not very often. It tends to happen when there are alternative paths in the product.

For example, the intended flow might be going through the login page, but the AI ends up reaching the dashboard via an existing session or another entry point. If the success condition is only “goal reached,” the test can still pass. When this happens, it’s usually a sign that the success criteria in the test were too weak.

That said, in practice we see far more cases where the AI fails due to interpretation issues than cases where it’s “too clever” and passes. As you mentioned, vague or weak assertions make these false passes more likely.

We definitely don’t consider this problem solved yet — when it happens, we treat it as a signal that the test itself needs to be written more clearly.

Mykola Kondratiuk • Feb 12

Yeah exactly - it's way easier to say 'make the login work' than write a bunch of assertions about button clicks and form fills. The tricky part is teaching the AI to recognize when it actually succeeded vs just hitting a dead end. I've had cases where it thinks it's done but the app silently failed somewhere. Still beats writing those tests manually though.

Mykola Kondratiuk • Feb 8

the natural language test thing is super smart. I've been there with those brittle Playwright selectors that break every UI tweak - honestly spent more time fixing tests than building features sometimes. your browser recording idea reminds me of how I approach vibe coding now, where I just describe what I want and iterate. curious though, how do you handle flaky tests when the AI misinterprets an instruction or the page loads slow?

devdev • Feb 8

Thanks for the comment.

Flakiness is still a work in progress for us. Right now we mainly rely on basic safeguards like waiting a few seconds for pages to load, and we plan to strengthen this with retries and more robust stabilization.

We’ve also found that AI misinterpretation depends heavily on how the test cases are written, sometimes even more than the execution itself. Breaking steps into smaller actions and clearly defining the expected goal state helps reduce ambiguity, and we’re actively improving templates and prompts to make this more reliable.

Mykola Kondratiuk • Feb 8

Breaking steps into smaller actions is probably the right call. I've noticed with vibe coding that when I write overly clever prompts the AI gets creative in ways I didn't intend - same failure mode you're describing.

The test-case-as-specification thing is interesting though. If your test instructions are good enough to guide the AI reliably, they're basically executable documentation. That's way more valuable than Playwright selectors that only developers can read.

Have you thought about version control for the natural language test definitions? Like tracking when someone changes "click the login button" to "click the blue button on the top right" and whether that breaks things?

devdev • Feb 10

That’s exactly how we’re starting to think about it too — once the instructions are reliable enough for the AI to execute, they naturally become a kind of executable documentation.

Versioning the natural-language definitions is definitely something we’ve been thinking about. The moment tests stop being code and become plain language, more people can edit them, which is great — but it also means changes will happen much more often and we need better visibility into them.

The direction we’re interested in is treating each test/workflow like a versioned document, with history and diffs so you can see what changed and when. Being able to correlate wording changes with test outcomes would be really valuable — for example, spotting when a small wording tweak suddenly makes a test flaky or reduces reliability.

It feels like a natural next step once natural-language tests start behaving like a shared spec rather than just automation.

Mykola Kondratiuk • Feb 12

The diffs thing is huge. We ran into this exact problem where someone would tweak the wording on a workflow ("click the button" -> "press the submit button") and suddenly tests would break in weird ways.

Version control for natural language specs makes total sense. Like git blame but for test definitions - you could see exactly when someone changed "login" to "sign in" and correlate that with a 20% drop in success rate.

Honestly the hardest part is probably getting people to understand that a single word change in a natural language test can be just as breaking as changing a function signature. The mental model shift is real.