The Agent Reviewed Its Own Code and Passed Itself. It Was Wrong.

#ai #programming #cli #vibecoding

I'm a self-taught solo dev. I vibe-code every day, and for a long time the same
quiet worry followed me around: the agent hands me code, it looks clean, the tests
pass — and I have no idea what I just let into my project.

It took me a while to see why that worry never went away. Most developers learn to
review code from someone — a senior leaning over their shoulder saying "this'll
bite you in a month." As a self-taught solo dev, I never had that. I learned to
write code. Nobody taught me what to check.

So I did the only thing I could: I asked. I put the question to people with more
miles on the clock than me, and the answers reshaped how I think about verifying AI
code. This is what they taught me — and the experiment it led to, which went wrong
in the most instructive way possible.

"Tests pass" was never the reassurance I treated it as

The first thing that landed: an agent tests what it assumed, not what reality
throws at it. The code runs fine on the happy path, the tests are green, and then
someone feeds it an empty string, a malformed page, an unexpected null — and it
falls apart. The agent skips the defensive thinking that only comes from actually
debugging things in production.

That reframed the whole problem. Green tests don't mean the code is correct. They
mean the code does what the agent thought it should do. Those are very different
claims, and the gap between them is where I kept getting burned.

You can't verify what you don't understand

The second thing was harder to swallow. The more code I let the agent write, the
less I understood what came out — and the worse I got at checking it. Verification
and comprehension are the same muscle. If I don't grasp what the code does, reading
every line is just theater.

One reviewer put the fix simply: before judging the lines, judge the shape. Have
the agent produce a systems-level overview — what calls what, where the boundaries
are — and verify that the structure makes sense before you ever look at
implementation. I'd been generating these as diagrams and still getting lost; the
real unlock was keeping the map as text, so I could hand it to a second model and
ask whether it actually matched the code.

Which points at the deepest problem of all.

An agent can't catch its own blind spots

Here's the thing nobody says out loud: when the same agent writes the code and
writes the tests, it's not verifying anything. It's confirming its own assumptions.
The tests pass because they test the same things the agent already believed. The
blind spot in the code and the blind spot in the test are the same blind spot.

The reviewers' answer was consistent: the check has to come from outside the
author. A fresh model, a separate pass, something that doesn't share the original's
assumptions. Don't let the thing that wrote the code be the only thing that grades it.

So I built exactly that. A red-team step for my orchestrator — a command that takes
the work the agent just finished and forces it into a different role: not the proud
author, but an independent reviewer whose job is to break it.

What I actually told the agent to do

The prompt is the whole trick, and it's blunt on purpose:

Switch into the role of an independent reviewer who did NOT write this code. Your
job is not to confirm it works — it's to find what breaks it. Assume there's a bug.

Then four concrete passes, straight from what the seniors drilled into me:

Unhappy path. Empty, malformed, unexpected input. Null, timeout, race. Show me a specific input that knocks it over.
Silent assumptions. Where does it assume a type, a shape, a state without checking? Where can failures cascade quietly instead of failing loud?
Premature complexity. Is there a layer solving a problem that doesn't exist yet? AI loves architecture that looks clean while serving no one.
Tests. If they exist — do they exercise failure, or just the happy path? What is NOT covered?

And one rule that matters more than the rest: don't tell me "looks good." If you
genuinely find nothing, list specifically what you checked and how. A confident
"looks solid" is the exact thing I'm trying to escape.

The part where it went wrong

Here's where it gets good, and a little absurd.

The agent built the red-team command. Then I pointed the command at the code that
had just produced it. It immediately tore into its own work — listed what it had
done wrong, what it had to fix. The agent admitted the problems and fixed them.

Win, right? Self-correcting AI. Except I ran it a second time.

The second pass found more. Things the first "fix everything" round had missed —
because the agent, even while critiquing itself, was still working from the same
assumptions that put the bugs there in the first place. One adversarial pass didn't
purge the blind spots. It just surfaced the ones the agent could already see.

That's the lesson, and it's sharper than anything I could have written on purpose:
giving an agent a tool to criticize itself doesn't remove its blind spots — it
only reveals the ones it was already capable of seeing. Verification isn't a step
you run once and tick off. The author grading its own work, even adversarially, is
still the author.

Why the step is optional, not a rule

I almost made adversarial review a global rule — run it on everything, always. I'm
glad I didn't.

All of this costs tokens. More scrutiny means more output, more passes, more model
time — and that runs directly against the thing I care about, which is keeping the
context I send lean. Verification has a price, and the price is real. If I red-teamed
every trivial change, I'd be drowning in self-reflection over things that don't
warrant it.

So the step is opt-in. You spend the scrutiny where the stakes justify it — input
handling, parsing, anything touching the outside world — and you skip it where it's
noise. That trade-off, honestly, is the part I'm least done thinking about. Verifying
costs context, and context isn't free. I haven't solved that tension. I've just
decided to pay it on purpose, where it counts, instead of everywhere by reflex.

What I'd tell the version of me from six months ago

The fix was never about writing better code, or even better prompts. It was about
accepting that the agent and I have different jobs. It writes. I decide what's true.
And deciding what's true means refusing to take the agent's word for it — especially
when the agent is grading itself.

I still don't have a senior leaning over my shoulder. But I've learned to build the
shoulder out of separate passes, fresh models, and a stubborn refusal to trust a
green checkmark. If you're self-taught and that quiet worry follows you around too —
that's not a gap in your skill. It's the start of the right instinct.