Why I ask three models before I trust one
A few weeks ago I realized I was paying for per-use AI review calls, and the cost was making me skip checks on the work that most deserved them.
The dollar amount was small, but it changed how often I ran the check. I skipped it on exactly the sort of annoying work that needed a second pass: a rule change, a migration, a new background job, a piece of logic I was just confident enough to stop questioning.
I fixed the workflow so those routine reviews are easier to run. The implementation is boring: the same prompt goes to three models, the raw answers get saved, and I look at where they split.
One answer can make a bad assumption sound normal
I used to paste a decision into one strong model and ask whether it held up. That caught obvious contradictions. It did not reliably catch the place where my own description had left an assumption doing hidden work.
The model would usually find the most charitable version of what I meant. Sometimes that was helpful. Sometimes it meant I got a fluent explanation for why my half-written rule was reasonable.
The example that changed the habit was a knowledge-base rule I almost made permanent. I had watched one model lose track of a running count across a long input, so I wrote the failure down too broadly: do not trust models for cumulative counting over long context. One reviewer agreed immediately.
Another model pushed back with a narrower rule: rerun the count under the exact model and prompt before making it policy. That was enough to make me retest the case before writing a standing instruction.
That is the job I want from a review. I want it to point at the sentence where I got too broad before it turns into code.
What I run
My default council uses three models. Each gets the same question and answers independently:
- one GPT-class model
- one mid-tier Claude
- one top-tier Claude
I ask each for a verdict and the reasoning behind it. Two models can both say "ship" for completely different reasons, and that mismatch is where I slow down.
I do not count votes; I read the path each model took and use any disagreement as a constraint check.
Manual checks stay in the chat UIs and local CLI tools I already have open. Scheduled jobs and product workflows use API calls, with logs and metering attached.
Three has been enough for my use. With five, I usually waited longer and got two more versions of points already on the table. There are exceptions, but for daily engineering decisions three is still light enough that I run it while git diff is empty.
The part I read first
When all three answers say the same obvious thing, I skim that part and move on.
The useful part is the split. In one review, two models were comfortable with an automated watchlist because the ranker looked reasonable. The third focused on the acceptance test I had written: the list was capped at eight names, but I had also demanded zero misses. Those two requirements could not both survive a busy market day. That was the paragraph I needed.
This has changed how I write prompts too. I no longer ask, "is this good?" I ask what could fail and which assumption would make the answer wrong. For each review I include the intended change, the rollback path if I know it, and the acceptance test I think proves it, then paste that at the top of the request.
Where the council is the wrong tool
I do not use this when I need a brand-new idea. A small panel tends to converge, and when I am genuinely stuck I usually need exploration instead. I also skip it for one-off decisions where the blast radius is tiny.
It is useful for the daily middle, where a rule might become policy or a migration has enough blast radius that I want the rollback attacked first. By the time I decide, I have a short list of objections in front of me.
Fillip Kosorukov is a solo founder who builds software with AI assistance and writes about workflows that hold up in production work. More at fillipkosorukov.net ยท ORCID 0009-0004-1430-2859.
Top comments (0)