DEV Community

Cover image for One model is a guess. Three that agree is a plan.
Anton Babenko for AWS Heroes

Posted on

One model is a guess. Three that agree is a plan.

Why I shipped multi-model consensus as a plugin, plus two quieter tools that keep agents honest.

Ask one model to plan something hard - a migration, a refactor, a cutover - and you get a fluent, confident answer. Fluency is not correctness. A single model is articulate and alone, and being alone is the problem: nothing in the loop disagrees with it, so it rationalizes its first guess into a plan.

The expensive failures with coding agents are almost never syntax. They are plans that read well and were wrong: the wrong abstraction, the missed blast radius, the migration step that bricks state. You find out three hours into execution, not at review time.

I have been running the fix for the last few months - on Terraform modules, and on the everyday work of running compliance.tf, not only on code. It is not a bigger model. It is making models disagree on purpose and then forcing them to resolve it. That is what consensus does, and it is the reason the agent-plugins repo exists. Two other tools ship with it; they are narrower, and I will get to them.

One model is a guess

A single model samples one distribution with no adversary in the room. Two independent models rarely make the same mistake on a plan. Where they diverge is, almost exactly, the risky part of the plan - the assumption nobody checked. A consensus loop turns that disagreement from noise into a signal you can act on.

What consensus actually does

/claude-delegator:consensus runs GPT (via Codex), Gemini, and Claude over the same artifact. Each critiques it independently, and independence here is literal: a fresh thread per call, single-shot per round, no shared provider memory. None of them sees another's review. If one model wanders off on a bad tangent, it cannot drag the other two with it - the disagreement stays honest, which is the entire point.

Independence is not only about which model. It is sharper when each reviewer wears a different hat. claude-delegator ships five expert profiles - Architect, Plan Reviewer, Scope Analyst, Code Reviewer, Security Analyst.

Combine the axes. A Security Analyst on Gemini and an Architect on GPT fight about different things than one model reviewing twice. Different weights catch different mistakes; different profiles catch different categories of mistake.

For a migration plan I run the Plan Reviewer broadly and add a Security Analyst pass on top. "Is this safe to run" and "is this the right shape" get argued by separate reviewers, not averaged into one bland verdict.

Claude then triages every objection - accept it, dismiss it with a recorded reason, or defer it - revises the artifact, and sends the new version back out. It loops, up to five rounds, and stops only when all three sign off. If they cannot, it reports the unresolved disagreement plainly instead of faking agreement.

There is a second kind of contamination worth naming: me. In consensus, every round sends both reviewers the same artifact text, cold, in a fresh thread. They never see my triage, my running verdict, or how I framed the previous round - only the artifact and bounded round metadata. My judgment is applied after they report, not baked into what they receive.

The ask-* commands are the opposite by design: the expert gets exactly the prompt I assembled from the conversation - what I chose to hand it. Fast for a second opinion, but the input is mine, not independent. consensus keeps the input independent and pays for it in extra rounds.

It does not have to be a plan. The loop runs on anything you can put in text - a design, a runbook, a decision memo, a spec. Plans are simply where I reach for it most, and where it converges fastest. The looser and fuzzier the input, the more rounds it takes to agree, so non-plan runs tend to run longer - worth it when the answer matters, overkill for a quick lookup. For that, single-shot ask-gpt, ask-gemini, and ask-both are right there. consensus is for when it has to be right.

Nothing about this is Terraform, or even code. That generality is why it is the headline.

The quieter two

Same release, narrower scope:

  • code-intelligence - a language-agnostic skill. Agents grab text grep when they should ask the language server, and silently swap tools when one is missing, then report "found all references." This encodes search precedence: the language server (LSP) for symbols, rg/ripgrep for exact text, an embedding/semantic grep (such as mgrep) for fuzzy discovery - and a hard rule to disclose any substitution on the first line of the reply. You learn the moment coverage drops.
  • terraform-skill - routes a Terraform/OpenTofu request to its real failure mode (identity churn, blast radius, state corruption) before emitting HCL. It is terraform-ls aware: it knows the language server has no rename provider, so it runs the safe manual reference workflow instead of a blind find-replace. It is approaching 2,000 GitHub stars - the part of this post I am quietly proud of.

Those two are discipline for the agent's hands. Consensus is discipline for its judgment, and judgment generalizes further.

A note on claude-delegator

I did not invent the delegation layer. claude-delegator is a fork of Jarrod Watts' original (MIT, upstream currently quiet), fully based on his design - I kept the structure and the license.

What I added is what months of daily use exposed: a hardened Gemini bridge that recovers an answer the CLI flushed to disk after a soft timeout instead of failing the call, automatic retry on the Gemini trusted-directory check, sturdier JSON-RPC parsing, and a GEMINI_DEFAULT_MODEL override. Plus the bundled ask-gpt / ask-gemini / ask-both / consensus commands, so the workflow ships with the plugin instead of living in my dotfiles.

Credit for the foundation is his; the bug-fixing scar tissue is mine.

Why a plugin, not a blog post

One honest caveat, because I have hit it myself: skills are model-triggered, which makes them soft. Packaging this as a plugin improves reuse and discoverability. It does not guarantee the agent obeys every time - hard enforcement (a real pre-tool gate) is a separate, still-open problem.

I keep finding and fixing bugs in all three. That constant repair is the only reason I trust them enough to write this - and I would rather say so than oversell the fix.

The takeaway

Stop accepting the first confident plan from one model. Make them argue, and only move when they stop. The release is at github.com/antonbabenko/agent-plugins; consensus ships in claude-delegator from the same marketplace. The cheapest review is the one that happens before you execute.

Top comments (0)