Brian Mello

Posted on May 15

I Let Three AIs Argue About My Vibe-Coded App — Here's What They Caught

#ai #testing #webdev #vibecoding

I built a small side project in Cursor over a weekend. Login, dashboard, a couple of forms, a Stripe-style checkout flow. The kind of thing that feels done. Clicking around, everything works. The vibes are immaculate.

So I did the responsible adult thing: I shipped it.

It broke in three places within 48 hours. None of the breaks were in code I had written by hand. They were in code an AI had generated that I had skimmed, nodded at, and moved on from.

That's the trap of vibe coding. The AI is fluent. You're fluent at reading what the AI made. Neither of you is the kind of pedantic loser who notices that the "Cancel" button on the checkout modal actually submits the form on mobile Safari because someone forgot to add type="button" somewhere three components deep.

This is the story of the second app I built, where I tried something different. I let three AI testing agents argue about my app before I shipped it. They caught seven things I would have missed. They also disagreed with each other in ways that, weirdly, made me trust the result more.

The setup

The app was a simple expense-splitting tool — Splitwise but uglier and free. Built in Cursor, deployed on Vercel, total dev time around eight hours spread across two evenings. By the end I had:

Email/password signup
A "create group" flow
An "add expense" form with split logic
A settle-up view

Standard vibe-coded SaaS skeleton. Worked on my machine. Looked fine on my phone.

Instead of clicking around for an hour and calling it good, I pointed 2ndOpinion Testing at the URL. The pitch on the box is "AI agents test your app like real users, then cross-examine each other's findings." I'd seen the demo. This was the first time I'd used it on something I actually cared about.

What "three AIs arguing" actually looks like

The product runs three different model-backed agents at your app concurrently. Each one explores independently — clicking, typing, navigating like a confused new user who has never seen the thing before. They each file a report on what's broken or weird.

Then comes the part that earns the courtroom metaphor in the marketing: the agents cross-examine each other's findings. Agent A claims the signup form is broken. Agent B says they signed up just fine. The system makes them reproduce, defend, or retract.

You don't end up with three separate reports to read. You end up with one verdict: here's what's actually wrong, here's what one agent thought was wrong but couldn't reproduce, here's what all three independently flagged.

Reading the final verdict felt like reading the minutes of a deposition. In a good way.

The seven things they caught

I'll walk through them in increasing order of "ouch, I should have caught that."

1. The signup email field accepted "test" as a valid email. All three agents flagged this. Front-end validation was just required, no type="email". Cursor had generated a form with the bare minimum and I hadn't tightened it. Five-second fix. Would have looked terrible the first time a real user mistyped.

2. The "Add Expense" form let you submit $0. Two of three agents tried it, both succeeded, both filed it. The third agent said "this is probably intentional, some groups track zero-dollar IOUs." The system made them argue about it. They settled on "probably a bug, ask the developer." It was a bug.

3. The settle-up calculation rounded wrong on three-way splits. $10 split three ways became $3.33 + $3.33 + $3.33, which is $9.99. Someone was always going to be a penny off. One agent caught it by splitting a coffee three ways and noticing the totals didn't reconcile. The other two had only tested two-way splits.

4. Pressing Enter in the "group name" field submitted the form before I'd added any members. Only one agent caught this — the others were filling forms by clicking the submit button like polite humans. The one that pressed Enter found a half-broken state where the group existed but had no members and couldn't be edited.

5. The mobile nav menu didn't close after navigating. Two agents flagged it. Classic AI-generated React component thing. The menu had open/close state, but route changes didn't reset it.

6. The password reset email link 404'd. I had not, in fact, set up the password reset route. The "Forgot password?" link went to /reset-password which did not exist. I had written the link before writing the page and never come back to it. One agent found this by clicking every link on the login screen. Embarrassing.

7. The Stripe-style checkout for the (currently mocked) "Pro" tier accepted submissions but didn't go anywhere. I had stubbed out the Pro upgrade flow and forgotten about it. The button looked real. The page it led to was a 404.

Seven real things. None of them catastrophic, all of them the kind of thing that, on a launch day with twenty people poking at your app, accumulate into "this product feels janky."

The part I didn't expect: the disagreements

The disagreements are what convinced me this approach actually works. Here are two:

Was the signup flow too slow? One agent flagged the signup as "slow, took 4 seconds." The other two said it felt normal. The system made the first agent show its work. Turned out it had been testing on a throttled connection it had picked up from somewhere in its state, and the other two hadn't. The finding got retracted. If I had just had one agent, I'd have gone hunting for a phantom performance problem.

Was the "delete group" confirmation modal confusing? Two agents thought the wording was unclear. The third said it was fine. The argument ended with "this is subjective, flagging for human review." That's the right answer. The tool wasn't pretending to be sure when it wasn't.

I have used single-AI testing tools before. They sound confident about everything, including the wrong things. Watching three agents disagree and then resolve felt much closer to the experience of having three different humans review a PR. Some things were unanimous. Some things were noise. The noise got filtered before it got to me.

What I'd tell another vibe coder

A few things, in order of how often I've now had to repeat them to friends:

You don't need to learn Playwright. You don't need to write Cypress specs. You don't need to even know what "end-to-end testing" is in the traditional sense. If you built your app in Bolt, Lovable, v0, or Replit, the testing tool you want is the same kind of thing — point it at a URL, let it figure out what to do.

You do need to test before you ship, not after. The temptation when you've spent a weekend vibing with an AI is to deploy on Sunday night, post on X, and hope. Resist. A 20-minute pre-flight on a Sunday afternoon catches the seven things that would have been a soft launch disaster.

You should care about the disagreements more than the agreements. If your testing tool always sounds 100% confident, it's lying to you. Real bugs aren't unanimous. The interesting findings are the ones where one agent saw something and the others didn't — and you get told whether the holdout was right.

Try it

If you've vibe-coded anything in the last month and it's sitting in a Vercel deployment waiting for you to feel brave enough to share the link, I'd run it through this before you do.

Try 2ndOpinion Testing →

You paste a URL. Three AIs argue about it. You ship with fewer surprises. That's the whole product.

The Splitwise-but-uglier app is still up, by the way. Seven fewer embarrassments than it would have had. I'll take it.

Top comments (3)

Jill Mercer • May 16

getting different models to roast your code is a pro move—the disagreements are the real signal. i ran four agents on an app recently and they all failed differently. one timed out on a dynamic id—one invented a feature that doesn't exist—one couldn't even find the submit button. if i'd only run one model, i'd have assumed that single failure was the whole story. the cross-examination is how you stop trusting the confident wrong answer. that last 20% of the build is where the real headaches hide. once it's polished, drop it on stackapps.app—we need more focused tools over there. austin taught me: just start the thing.

Harjot Singh • May 29

multi-AI critique loops are way better than single-model audits for catching the soft bugs (state mgmt, race conditions, infinite loops). been doing the same internally at moonshift but as part of the gen pipeline: scaffold -> integrate -> audit phase, each w/ a different sub-agent that critiques the prior output. $3 per shipped saas (gh+vercel). first run free if u want a 4th AI to throw into ur next argument.

Harjot Singh • Jun 1

multiple AIs arguing to catch what one misses is basically adversarial verification, and it works. that's a core piece of Moonshift's harness: agents build + deploy + market a SaaS overnight, but findings get cross-checked before anything irreversible happens. you arrived at the same instinct manually. first run's free if you want to see it automated end to end.