I build a SaaS QA platform. When AI coding assistants became good enough to trust with real work, I leaned in — and my velocity went up noticeably. Less boilerplate, faster feature implementation, fewer rabbit holes on syntax I'd forgotten.
But something shifted that I didn't expect. I wasn't writing less code and saving time. I was writing less code and reviewing more instead.
For a while I thought that was fine. Then one evening I pushed a feature involving Google SSO and suddenly users couldn't log in. Not "the Google button doesn't work" — I mean the entire sign-in and logout flow was broken. Silent. No obvious error.
I dug in. The AI had been working on routing middleware to serve pre-rendered pages for SEO. Reasonable task. What it also did, without flagging it, was intercept every GET request that looked like a browser navigation — including /auth/google. The OAuth redirect came back from Google, hit the middleware, got served the SPA shell instead, and the session was never established.
The fix was one line. The damage could have been significant if it had reached more users.
Here's what stayed with me: the feature it was building worked perfectly. The tests I ran against the new functionality passed. What broke was something adjacent — code I didn't ask it to touch, changed in a way that was internally logical but contextually wrong.
That's the real lesson about working with AI on production code. It doesn't lack skill. It lacks awareness of consequences outside the task boundary. It will solve the problem you gave it and not think twice about what it quietly rearranged to get there.
So now I review differently. I don't just check that the feature works. I read the diff the way I'd read a pull request from a smart junior developer who might not know what they don't know — looking for what changed that I didn't ask to change.
The trade-off is real: I write far less code than before, but I spend more time in review. Whether that's a net win depends on how careful you are. If you test properly and treat every AI output as a PR that needs approval, the productivity gain holds. If you trust the result because the happy path works, you're accumulating invisible risk.
I've caught CORS middleware generating 500 stack traces from bot traffic. A VAT field being HTML-encoded into / before storing in the database. A soft-delete flag silently stripped from a user object because the context mapper didn't know about the new field.
Each one was a sensible decision in isolation. Each one was wrong in context.
AI didn't make me a worse engineer. But it did make me a much more deliberate reviewer. That wasn't in the marketing material.
Do you have similar situations — where AI solved the task but broke something it wasn't supposed to touch? How do you review AI-generated code differently now?
Top comments (6)
The "hard way" is usually trusting one model too much.
What changed my approach: requiring multiple models to agree
before acting on a finding. Single-model confidence is an
illusion — consensus is where the signal actually lives.
Interesting approach - using consensus as a confidence filter. I've been thinking about this differently: rather than multiple models agreeing, I focus on narrowing the task boundary so there's less room for silent side effects. Curious whether you apply the multi-model approach before or after the change is made?
Great distinction — narrowing task boundary is
a solid approach to reduce surface area for errors.
To answer your question: I use multi-model consensus
BEFORE the change ships — as a pre-commit validation layer.
The idea: if 3/4 models flag the same issue independently,
confidence is high enough to block the PR.
If only 1 flags it, it's likely noise or model-specific
bias — filtered out automatically.
Your approach and mine are actually complementary:
The combination reduces both false positives AND
silent side effects significantly.
I've been building this into NexaVerify —
happy to share results if you're curious.
I’ve had AI “fix” one thing and quietly break auth/cookies elsewhere.
Now I always review diffs like a junior PR + test flows I didn’t touch… that’s where the real bugs hide.
Exactly — the diff is the real deliverable, not the feature. If you're not reading every line that changed, you're not actually reviewing.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.