One model is a guess. Three that agree is a plan.

#ai #llm #productivity #agents

Why I shipped multi-model consensus as a plugin, plus two quieter tools that keep agents honest.

Updated (28.5.2026): /consensus is now a 3-stage loop. Stage 2 adds a blind cross-review where each external rates the others' answers, anonymized, before Claude adjudicates. Pattern from karpathy/llm-council.

Updated (26.5.2026): Grok (xAI) joins GPT and Gemini as the third external provider, Gemini 3 is the default via Google's Antigravity CLI, two new experts (Researcher and Debugger) bring the count to seven, reviews are severity-graded, and /consensus now forces Claude to commit a blind verdict in the transcript before dispatching - arbiter-mediated, not pure democracy.

Ask one model to plan something hard - a migration, a refactor, a cutover - and you get a fluent, confident answer. Fluency is not correctness. A single model is articulate and alone, and being alone is the problem: nothing in the loop disagrees with it, so it rationalizes its first guess into a plan.

The expensive failures with coding agents are almost never syntax. They are plans that read well and were wrong: the wrong abstraction, the missed blast radius, the migration step that bricks state. You find out three hours into execution, not at review time.

I have been running the fix for the last few months - on Terraform modules, and on the everyday work of running compliance.tf, not only on code. It is not a bigger model. It is making models disagree on purpose and then forcing them to resolve it. That is what consensus does, and it is the reason the agent-plugins repo exists. Two other tools ship with it; they are narrower, and I will get to them.

One model is a guess

A single model samples one distribution with no adversary in the room. Two independent models rarely make the same mistake on a plan. Where they diverge is, almost exactly, the risky part of the plan - the assumption nobody checked. A consensus loop turns that disagreement from noise into a signal you can act on.

What `/consensus` actually does

/consensus runs GPT (via Codex), Gemini 3 (via Antigravity), and Grok (via the xAI API) against the same artifact, with Claude as the arbiter. Each round has three stages.

Stage 1 - parallel verdicts. Claude posts its own verdict (APPROVE / REQUEST CHANGES / REJECT) into the transcript first, blind, before any external sees the work. The pre-commitment sits there in writing, so Claude's judgment cannot drift later to match what the others say. Then GPT, Gemini, and Grok review the artifact in parallel, each in a fresh thread, single-shot. None sees another's review.

Stage 2 - blind cross-review. Each external then rates the OTHER externals' answers, identity stripped best-effort. Votes of "not viable" become candidate critical issues the arbiter has to weigh. This catches the case where Stage 1 looks like agreement but is really three reviewers each rationalizing past the same hole. Pattern adapted from karpathy/llm-council. Stage 2 fires every round 1, and after that only when Stage 1 disagreed or the previous Stage 2 surfaced an accepted issue.

Stage 3 - arbiter adjudication. Claude reconciles the Stage 1 verdicts, the Stage 2 candidate issues, and its own blind verdict. Every objection is accepted, dismissed with a recorded reason, or deferred. Claude revises the artifact and the loop runs again, up to five rounds. It converges only when every responding external approves and Claude's pre-committed verdict agrees, with reasons on both sides where it walked back. If the group cannot agree, it says so plainly instead of faking it.

Click for the detailed diagram with bias guards and per-model flow.

Independence is not only about which model. It is sharper when each reviewer wears a different hat. claude-delegator ships seven expert profiles - Architect, Plan Reviewer, Scope Analyst, Code Reviewer, Security Analyst, Researcher, and Debugger.

Combine the axes. A Security Analyst on Gemini and an Architect on GPT fight about different things than one model reviewing twice. Different profiles catch different categories of mistake.

For a migration plan I run the Plan Reviewer broadly and add a Security Analyst pass on top. "Is this safe to run" and "is this the right shape" get argued by separate reviewers, not averaged into one bland verdict.

There is a second kind of contamination worth naming: me. In consensus, every round sends the reviewers the same artifact text, cold, in a fresh thread. They never see my triage, my running verdict, or how I framed the previous round - only the artifact and bounded round metadata. My judgment is applied after they report, not baked into what they receive.

The ask-* commands are the opposite by design: the expert gets exactly the prompt I assembled from the conversation - what I chose to hand it. Fast for a second opinion, but the input is mine, not independent. consensus keeps the input independent and pays for it in extra rounds.

It does not have to be a plan. The loop runs on anything you can put in text - a design, a runbook, a decision memo, a spec. Plans are simply where I reach for it most, and where it converges fastest. The looser and fuzzier the input, the more rounds it takes to agree, so non-plan runs tend to run longer - worth it when the answer matters, overkill for a quick lookup. For that, single-shot ask-gpt, ask-gemini, ask-grok, and ask-all (which fans out to all three in parallel) are right there. consensus is for when it has to be right.

Nothing about this is Terraform, or even code. That generality is why it is the headline.

The quieter two

Same release, narrower scope:

code-intelligence - a language-agnostic skill. Agents grab text grep when they should ask the language server, and silently swap tools when one is missing, then report "found all references." This encodes search precedence: the language server (LSP) for symbols, rg/ripgrep for exact text, an embedding/semantic grep (such as mgrep) for fuzzy discovery - and a hard rule to disclose any substitution on the first line of the reply. You learn the moment coverage drops.
terraform-skill - routes a Terraform/OpenTofu request to its real failure mode (identity churn, blast radius, state corruption) before emitting HCL. It is terraform-ls aware: it knows the language server has no rename provider, so it runs the safe manual reference workflow instead of a blind find-replace. It is approaching 2,000 GitHub stars - the part of this post I am quietly proud of.

Those two are discipline for the agent's hands. Consensus is discipline for its judgment, and judgment generalizes further.

A note on claude-delegator

I did not invent the delegation layer. claude-delegator is a fork of Jarrod Watts' original (MIT, upstream currently quiet), fully based on his design - I kept the structure and the license.

What I added is what months of daily use exposed: a Gemini bridge that wraps Google's Antigravity CLI (agy) with auto-gemini-3 as the routing default and recovers an answer the CLI flushed to disk after a soft timeout instead of failing the call; a fresh Grok bridge over the xAI API that is advisory-only but reads attached files via the xAI Files API (with TTL-based cleanup); two more experts (Researcher and Debugger) on top of the original five; severity-graded reviews so three parallel reports merge cleanly; and a hardened /consensus loop where Claude pre-commits a blind verdict before any external sees the artifact, with a Stage 2 blind cross-review on top (adapted from karpathy/llm-council). Plus the bundled ask-gpt / ask-gemini / ask-grok / ask-all / consensus commands, so the workflow ships with the plugin instead of living in my dotfiles. The seven expert prompts borrow from oh-my-openagent and PAL; both credited in the README.

Credit for the foundation is his; the bug-fixing scar tissue is mine.

Why a plugin, not a blog post

One honest caveat, because I have hit it myself: skills are model-triggered, which makes them soft. Packaging this as a plugin improves reuse and discoverability. It does not guarantee the agent obeys every time - hard enforcement (a real pre-tool gate) is a separate, still-open problem.

I keep finding and fixing bugs in all three. That constant repair is the only reason I trust them enough to write this - and I would rather say so than oversell the fix.

The takeaway

Stop accepting the first confident plan from one model. Make them argue, and only move when they stop. The release is at github.com/antonbabenko/agent-plugins; consensus ships in claude-delegator from the same marketplace. The cheapest review is the one that happens before you execute.

Top comments (3)

Harjot Singh • May 31

One model is a guess, three that agree is a plan is a great line, and the mechanism behind it is real: independent models cross-checking reduces the confident-but-wrong failure that a single model gives you no signal about, one model's hallucination is exposed when two others don't share it. The whole value, though, lives in the word independently, and that's the part worth stressing. Consensus only buys you something if the errors are uncorrelated, and three models that share a training lineage or get the same prompt framing will often agree on the same wrong answer, giving you false confidence that's worse than a lone guess because now it looks corroborated. So the design that makes this work is deliberate diversity (different model families, varied prompts/temperatures) plus a real adjudication path for the disagreement case, because disagreement is the most informative outcome and how you resolve it (tie-break model, escalate to a human, abstain) is where the reliability actually comes from. The cost flip side is nice too: three cheaper models voting can beat one expensive model on reliability per dollar, as long as they're genuinely independent. Agreement is only evidence when the voters can fail differently. That diversity-makes-consensus-meaningful instinct is core to how I think about verification in Moonshift. When your three disagree, what breaks the tie, a designated adjudicator model, or do you surface it for a human?

Anton Babenko AWS Heroes • Jun 3

Exactly. Agreement only counts when the voters can fail differently, which is why Deliberation makes you pick the panel by hand instead of running three cousins.

The tie-break is layered: the arbiter writes down its verdict before seeing anyone else, models cross-review with names stripped, and a disagreement triggers another round rather than an instant winner. If the room still can't agree, it returns UNRESOLVED instead of faking a number. That's the human's cue.

I answered your question better in the post than I can here: dev.to/aws-heroes/meet-deliberatio...