I got tired of writing E2E tests, so I built an AI that runs them for me
QA is unavoidable… but painful
Whether you’re building side projects or working professionally, one thing never goes away:
Quality Assurance.
No matter how great your product is, users won’t trust it if it’s full of bugs.
But here’s the reality:
- In side projects, we barely have time to build features — QA gets skipped.
- In companies, QA is mandatory — but it slows everything down.
The cycle becomes:
Fix → QA → Fix → Re-QA → Repeat forever
At some point I started wondering:
Why is ensuring quality still so expensive?
Why is QA still so manual?
The moment everything clicked
While thinking about this problem, I discovered an open-source project called
Browser Use.
It lets AI control a real browser.
That’s when the idea hit me:
What if we could write tests in plain English and let AI do the QA?
So I built it.
Meet Test Pilot
Test Pilot is an AI QA tool that runs browser tests for you.
When you run a test, a real browser opens in the cloud (or locally) and the AI performs the QA automatically.
Here’s what it looks like:
Instead of writing test code, you simply describe what you want to verify.
For example:
- Verify that a user can log in
- Check that a product can be added to the cart
- Ensure the purchase completion screen appears
That’s it.
The AI performs the steps and verifies the results.
Why natural language?
Because writing tests is expensive.
Test code requires:
- Technical knowledge
- Maintenance
- Constant updates after UI changes
And eventually…
Tests become the thing teams postpone.
Natural language removes that barrier.
You can write tests the same way you write product specs.
This also makes tests accessible to:
- PMs
- Designers
- Non-engineers
Now the whole team can understand what is being tested.
Record once, turn it into QA
We also built a browser extension that records your actions.
Instead of writing tests, you can:
- Open your app
- Click through a flow once
- Save it as a test
No code. No scripting.
Just record and run.
This dramatically lowers the barrier to starting QA.
Run tests in the cloud or locally
Tests can run in:
- Cloud (quick and easy)
- Local Runner (Mac)
Local execution is especially useful when:
- You need logged-in sessions
- Your app is behind a firewall
- You’re testing internal tools
This makes the tool usable from side projects to production teams.
Under the hood
Here’s the high-level stack:
Frontend
- Next.js
Backend
- Python / FastAPI
Browser Automation
- Browser Use (Playwright-based)
Cloud Execution
- Celery for async test runs
Nothing magical — just technology chosen to reduce the barrier to QA.
The real goal
This started as a tool I wanted for myself.
But it grew into the largest product I’ve ever built.
There’s still a lot to improve, and it’s far from perfect.
But if you’ve ever felt that QA is heavy, slow, or constantly postponed…
I hope this can help.
Try it
If this sounds interesting, I’d love your feedback.

Top comments (10)
The test maintenance trap is real. Spent two months last year on a Flutter app where selector changes broke 40% of our E2E suite every sprint. Eventually just disabled half of them to ship faster, which defeats the purpose entirely.
Your approach with AI sounds interesting - how do you handle the balance between test coverage and false positives? I've found that overly smart test systems sometimes pass when they shouldn't, missing real regressions because the AI "figured out" a workaround.
Thanks for the thoughtful comment — this is something we think about a lot.
We try to balance coverage and false positives by defining tests around whether an AI acting as a user can successfully achieve a goal. Each test has a clear goal state, and success is determined by whether that state is actually reached.
If there’s a regression, the AI won’t be able to complete the task. And if the UI becomes confusing or harder to use, the AI tends to fail in the same way a real user would.
That said, the outcome still depends on how actions and assertions are written. We can’t claim this problem is fully solved yet, and we’re actively iterating on prompt design and test authoring to improve the balance.
The goal-state approach makes sense. I like that it forces you to think about what success actually looks like rather than just "did this button get clicked."
One thing I'm curious about - when the AI "figures out a workaround" that a human wouldn't take, does that get flagged? Like if the intended flow is broken but the AI finds an alternate path to the goal, that's almost worse than a straight fail because now you have a false pass hiding a UX regression.
Have you seen cases where the AI passed by doing something clever that revealed the test needed better constraints?
We have seen it a few times, although not very often. It tends to happen when there are alternative paths in the product.
For example, the intended flow might be going through the login page, but the AI ends up reaching the dashboard via an existing session or another entry point. If the success condition is only “goal reached,” the test can still pass. When this happens, it’s usually a sign that the success criteria in the test were too weak.
That said, in practice we see far more cases where the AI fails due to interpretation issues than cases where it’s “too clever” and passes. As you mentioned, vague or weak assertions make these false passes more likely.
We definitely don’t consider this problem solved yet — when it happens, we treat it as a signal that the test itself needs to be written more clearly.
Yeah exactly - it's way easier to say 'make the login work' than write a bunch of assertions about button clicks and form fills. The tricky part is teaching the AI to recognize when it actually succeeded vs just hitting a dead end. I've had cases where it thinks it's done but the app silently failed somewhere. Still beats writing those tests manually though.
the natural language test thing is super smart. I've been there with those brittle Playwright selectors that break every UI tweak - honestly spent more time fixing tests than building features sometimes. your browser recording idea reminds me of how I approach vibe coding now, where I just describe what I want and iterate. curious though, how do you handle flaky tests when the AI misinterprets an instruction or the page loads slow?
Thanks for the comment.
Flakiness is still a work in progress for us. Right now we mainly rely on basic safeguards like waiting a few seconds for pages to load, and we plan to strengthen this with retries and more robust stabilization.
We’ve also found that AI misinterpretation depends heavily on how the test cases are written, sometimes even more than the execution itself. Breaking steps into smaller actions and clearly defining the expected goal state helps reduce ambiguity, and we’re actively improving templates and prompts to make this more reliable.
Breaking steps into smaller actions is probably the right call. I've noticed with vibe coding that when I write overly clever prompts the AI gets creative in ways I didn't intend - same failure mode you're describing.
The test-case-as-specification thing is interesting though. If your test instructions are good enough to guide the AI reliably, they're basically executable documentation. That's way more valuable than Playwright selectors that only developers can read.
Have you thought about version control for the natural language test definitions? Like tracking when someone changes "click the login button" to "click the blue button on the top right" and whether that breaks things?
That’s exactly how we’re starting to think about it too — once the instructions are reliable enough for the AI to execute, they naturally become a kind of executable documentation.
Versioning the natural-language definitions is definitely something we’ve been thinking about. The moment tests stop being code and become plain language, more people can edit them, which is great — but it also means changes will happen much more often and we need better visibility into them.
The direction we’re interested in is treating each test/workflow like a versioned document, with history and diffs so you can see what changed and when. Being able to correlate wording changes with test outcomes would be really valuable — for example, spotting when a small wording tweak suddenly makes a test flaky or reduces reliability.
It feels like a natural next step once natural-language tests start behaving like a shared spec rather than just automation.
The diffs thing is huge. We ran into this exact problem where someone would tweak the wording on a workflow ("click the button" -> "press the submit button") and suddenly tests would break in weird ways.
Version control for natural language specs makes total sense. Like git blame but for test definitions - you could see exactly when someone changed "login" to "sign in" and correlate that with a 20% drop in success rate.
Honestly the hardest part is probably getting people to understand that a single word change in a natural language test can be just as breaking as changing a function signature. The mental model shift is real.