DEV Community

Brian Mello
Brian Mello

Posted on

How to Test Your AI-Built App Without Writing a Single Test

You opened Cursor. You typed "build me a booking app." Forty-five minutes later, you have something that runs. The login works. The calendar mostly works. You ship it.

Then a friend tries it and the date picker goes blank on iOS. Another user finds that hitting back after a failed payment leaves the form locked. Someone else can't sign up because their email has a plus sign in it.

Welcome to the gap nobody talks about in the vibe coding era: AI-built apps are easier to ship than ever, and exactly as buggy as you'd expect from code you didn't fully read. Traditional testing — the unit tests, the integration tests, the Selenium suites — assumes you have time, expertise, and patience to write them. Vibe coders have none of those. So the apps go out untested.

This post is about the shift that's quietly happening: AI testing tools that act like real users, find real bugs in your AI-built app, and never ask you to write a line of test code.

Why traditional testing fails vibe coders

Selenium is from 2004. Cypress and Playwright are better, but the workflow hasn't really changed: you write a script that says click this, type that, assert this. Then your AI rebuilds the navbar and your selectors break. You spend an afternoon fixing tests instead of shipping features.

The friction is bad enough for full-time engineers. For someone who built their app in Bolt or Lovable over a weekend, it's a hard no. You're not going to become a QA engineer. You're going to ship and hope.

There is a third option, and it turns out it's pretty obvious in hindsight: have the AI do the testing too.

The shift: AI agents that test like users

The new generation of testing tools doesn't ask you to describe what to test. It asks for a URL.

You point the tool at your live app. An agent opens it, looks at the page, and behaves like a curious human. It clicks things. It fills in forms. It tries weird inputs. It tries to break the flow. It does the things you'd do if you sat down to test your own app, except it doesn't get bored after the third happy path.

This is fundamentally different from script-based testing. Scripts only test what you told them to test. An agent explores. It can find the dead end you didn't know existed because you never thought to script the click that gets you there.

The closest analogy is hiring a junior QA contractor who's actually thorough — except this one shows up in thirty seconds and costs less than a sandwich.

What "no scripts" actually means

When I say no scripts, I mean it literally. No selectors. No fixtures. No mocks. No test framework to install. The mental model is more like:

  • Here's my app: testing.example.com
  • Here's a sentence about what it does: a booking flow for a yoga studio
  • Find the bugs

That's the entire input. You spend more time writing the sentence than configuring anything else.

The output is a verdict. Not a failed test name and a stack trace — a plain-English description of what's broken, why it matters, and how to reproduce it. Screenshots included.

For a vibe coder, this is the whole point. You don't want to learn the testing tool. You want to know if your app is shippable.

Three AIs walk into a courtroom

Here's where it gets interesting. A single AI agent tests your app and tells you it found three bugs. How do you know it's right? AI agents hallucinate. They confidently report "the login button doesn't work" when actually they just couldn't find it because a cookie banner was in the way.

The fix is the same fix that's working everywhere else AI gets used in production: you ask more than one model, and you make them justify themselves.

In 2ndOpinion's testing product, three different agents test your app independently. Then they cross-examine each other's findings, courtroom style. Did Agent A really see this bug, or did it misread the page? Can Agent B reproduce it? Does Agent C agree that the failure mode is what A says it is?

What you get back is a verdict with confidence levels. The bugs that all three agents independently found are almost certainly real. The ones only one agent flagged usually aren't. This cuts the false-positive rate dramatically and saves you from chasing ghosts.

If you've ever been burned by an AI tool that confidently lied to you, this is the cure. Make them argue. The truth tends to survive the argument.

A typical workflow

Here's what testing an AI-built app looks like when you remove the scripting:

You finish a feature in Cursor, v0, Lovable, Replit — wherever you build. You deploy it. You paste the URL into your testing tool. You add a one-line description of what the app does and which flow you care about. You hit go.

A few minutes later, you have a list of issues. Not "test_login_button_failed at line 47." Something like: "If a user enters an email with a plus sign, the signup form silently fails. No error message appears, the button just stops responding. Reproduced in Chrome and Safari."

You take that to your AI coding tool. You paste the bug. You ask for a fix. You redeploy. You re-run the test. You ship.

The total cycle is maybe twenty minutes. Compare that to writing a single Cypress test from scratch, which is also twenty minutes — except you're still at zero coverage when you finish.

What this catches that you'd miss

The category of bugs that AI testing finds reliably is the one vibe coders most often ship by accident.

Edge cases in input handling. The plus sign in the email. The apostrophe in the last name. The phone number with a country code.

Broken back buttons and refresh behavior. The state that doesn't persist. The form that reposts on refresh. The "session expired" page that has no way out.

Mobile-specific weirdness. The viewport that doesn't scroll. The keyboard that covers the submit button. The autofocus that fights with the iOS keyboard.

Auth flows that work for the happy path and explode otherwise. Wrong password. Expired link. Already-registered email. OAuth cancellation halfway through.

These are the bugs your friends find on Twitter the day after you launch. They're also the ones a single AI rarely catches reliably, which is why the multi-agent cross-examination matters.

Where this is heading

Two things are going to happen over the next year. First, this kind of testing becomes default. Pasting a URL and getting a verdict will feel as obvious as pasting an error into ChatGPT. Second, the tools that survive will be the ones that handle disagreement honestly — that show you when their agents argued, who won, and why.

The vibe coder workflow has been bottlenecked on testing for two years. The unblocking is happening right now, and it doesn't involve learning Selenium.

If you've built something in Cursor, Bolt, or Lovable and you're nervous about shipping it: that's a reasonable feeling, and the tools to act on it finally exist.


If you want to try this on something you've built, 2ndOpinion Testing is the macOS desktop app I'm building for exactly this. Paste a URL, get a verdict. No scripts, no selectors, no test framework to learn.

Top comments (0)