DEV Community: Anton Babenko

Meet Deliberation: 400+ models is easy, knowing which ones earn a place is hard.

Anton Babenko — Wed, 03 Jun 2026 08:54:04 +0000

A follow-up: how a three-model consensus tool grew into a configurable, measurable panel - and why I now make every model prove it pays for its slot.

It is open source and ready to use today: github.com/antonbabenko/deliberation. If you only do one thing, star the repo and install it in your own agent - it takes about two minutes, and the install section below has the exact command for Claude Code, Codex, Cursor, Kiro, and OpenCode.

Here is the part that surprises people: you are probably already paying to access a few models. A ChatGPT subscription includes Codex which allows you to use models like gpt-5.5 and gpt-5.3-codex from Codex CLI. A Claude subscription includes Claude Code. So you can wire GPT and Claude to review each other right now, at no extra cost, and add Gemini, Grok, or any OpenRouter model when you want a third opinion.

A few weeks ago I wrote that one model is a guess and three that agree is a plan. I still believe it. But it skipped the next problem, the one I hit the moment the trick worked: once you can ask three models, you can ask thirty seven. And more voices is not the same as more signal. A slow model that always agrees with a faster one is not a second opinion. It is a bill.

So the project grew up, and it got a name: Deliberation.

The word fits. Deliberation is the slow, careful weighing of a decision - the thing a jury or a council does before it returns a verdict. That is exactly the job here: a panel of models that weighs one artifact and argues its way to a call, instead of one model blurting the first thing that sounds right. It answers a different question than the first post did. The first post asked should you make models argue? This one is about which models, under what rules, and how do you know any of it was worth the wait?

It still runs the everyday work of compliance.tf, same as before. Most of what follows came from that: not from theory, from watching real runs and asking why one took four minutes to tell me what another said in twenty seconds.

A few words first

If you are new to this, here are the only terms you need. Plain version, with examples.

Agent host. The tool you already code in: Claude Code, the Codex CLI, Cursor, Kiro, OpenCode. Deliberation plugs into all of them and behaves the same way in each.
MCP. A standard plug that lets your agent host talk to outside tools. Deliberation is one MCP server, so any host that speaks MCP can use it. You install it once; you do not write glue code.
Provider / model. A provider is a company (OpenAI, Google, xAI, OpenRouter). A model is one brain from that provider (GPT, Gemini, Grok, or any of OpenRouter's 400-plus, like Qwen, DeepSeek, or Kimi).
Panel. The set of models you ask at once. You choose who is on it. Example: a panel of GPT + Gemini + Grok.
Expert (or persona). A hat a model wears for one job: Architect, Security Analyst, Code Reviewer, Plan Reviewer, and three more. The same model reviews differently depending on the hat.
Arbiter. The one who reads everybody's answers and makes the call. It can be a model, or it can be your own agent (you).

That is the whole vocabulary. The rest is just rules for how the panel talks.

Install it now

Deliberation is one MCP server with native packaging for each host. Set only the providers you use - a missing key just turns that one provider off. Full per-host guides: public-docs/hosts.

Claude Code: add the marketplace, install the plugin, run setup.

  /plugin marketplace add antonbabenko/agent-plugins
  /plugin install deliberation
  /deliberation:setup

Codex CLI: codex plugin marketplace add antonbabenko/deliberation, then install deliberation from /plugins.
Cursor: drop in the rule file and use the one-click MCP deeplink (guide).
Kiro: "Import power from GitHub" (guide).
OpenCode: add the opencode.json MCP snippet (guide).

Credentials come from your environment: GPT uses your Codex/OpenAI login (run codex login), Gemini signs in once through its new Antigravity CLI (run agy), Grok reads XAI_API_KEY, OpenRouter reads OPENROUTER_API_KEY. Start with GPT and Claude, since you likely already pay for both.

What actually changed

Three things.

It is one MCP server now, not a Claude-only plugin. It works in Claude Code, Cursor, Codex, Kiro, OpenCode - anything that speaks MCP. Your primary agent stays in charge and calls the panel when a decision is worth a second look.

The panel opened up. It used to be three fixed externals: GPT, Gemini, Grok. Those are still the built-in voices, but you can now add any of OpenRouter's 400-plus models (including Qwen, DeepSeek, Kimi, and others) as named records in a config file. You pick them. You say which ones join a quick fan-out, which ones sit on the consensus panel, and which expert hat each one is allowed to wear - not all are equally good for all tasks.

And the whole thing learned to measure itself. That last part is the real subject of this post.

I will walk through it as five categories. Each one has a switch you flip in config, and each one votes - or refuses to - in a specific way.

Quick one-offs: just ask

Before the heavy machinery, the everyday move. Most of the time you do not want a five-round debate - you want a fast second opinion from someone who is not the model you have been talking to all session.

So you just ask, in plain words: "Ask Grok and Gemini whether this retry loop can deadlock." Your agent routes that to the right voices and brings back the answers. Or use the explicit commands when you know who you want:

/ask-gpt does this IAM policy grant more than it should?
/ask-grok poke holes in this rollback plan
/ask-gemini summarize the risk in this migration in 5 bullets

These are single-shot and advisory: one question, one answer, no loop, no context contamination from your session. The exact same commands work in Claude Code, Codex, Cursor, Kiro, and OpenCode - the server is the same everywhere, so a prompt you learn in one host transfers to the rest unchanged.

And none of this is a separate ritual you have to schedule. Call /ask-all or /consensus at any point in the work - while you are still scoping a feature, halfway through writing a plan, or in the middle of a /grill-me session when you want a real outside voice in the room instead of arguing with only yourself. The panel is available the whole time, not just at the review at the end. The earlier you pull in a dissenting model, the cheaper the disagreement is to act on.

1. Fan-out with no cross-talk

The next cheapest move: ask several models the same question at once and read the answers side by side. Nobody sees anyone else's reply, so you get real independence instead of an echo.

Two examples where this works the best:

/ask-all can this Terraform state migration be rolled back safely, and what breaks if it cannot?
/ask-all what are the failure modes of switching this table's primary key on a live database?

/ask-all runs the parallel version. panel tells you exactly who would answer for your current config before you spend a token. ask-one lets you fire each member yourself, so each answer lands on screen as it finishes instead of waiting on the slowest one.

Switches: routing.maxFanout caps how many OpenRouter models join (default 3), and each model record has an askAll flag to opt in or out.

There is no voting here. It is parallel sampling. You and your agent (Claude Code, Codex, or anything else) get all the answers and decide.

2. The consensus loop, where the voting lives

This is the heart of it, and it is a state machine, not a vibe. One round goes:

Blind pre-commit. The arbiter writes down its own verdict - approve, request changes, or reject - before any other model sees the work. In writing, first. So its judgment cannot quietly drift to match the crowd later.
Parallel peer review. The panel reviews the same artifact, each in a fresh thread, none seeing another's review.
Blind cross-review. Each model then rates the others' answers with names stripped off (to avoid bias and deference). A "not viable" vote becomes a candidate problem the arbiter has to deal with. This catches the case where everyone looked like they agreed but were each walking past the same hole. (Pattern borrowed from karpathy/llm-council.)
Adjudication. The arbiter goes through every objection and accepts it, dismisses it with a written reason, or defers it. Then it revises the artifact and the round runs again.

It converges only when every responding model approves and the arbiter's pre-committed verdict agrees with them. If they cannot get there inside the round cap, it returns UNRESOLVED and says so. It does not fake a number to look finished.

Click for the detailed diagram with the bias guards and per-model flow.

The answers are not free text the arbiter has to guess at. Every voice returns a structured opinion, and the engine parses it into fixed fields:

a recommendation (the actual call),
a confidence label, so a weak "maybe" does not count the same as a firm "yes",
dissent points, assumptions, and tradeoffs as separate lists, so a disagreement is attached to a reason instead of buried in prose.

Reviews add a verdict (approve / request changes / reject) and a list of critical issues, each sorted into a closed six-category taxonomy - so "three reviewers, nine objections" collapses into a clean, deduplicated set the arbiter can act on. Models are asked to emit this as JSON; the parser is best-effort and never throws, so if a model returns slightly malformed JSON or plain text, the content is salvaged and tagged with how it was read (clean parse vs recovered). Structured where it can be, never brittle.

Switches: consensus.arbiter (auto, the host model, a named provider, or a dedicated model record), consensus.blindVote (add the blind pre-vote on the synthesis path), consensus.maxRounds (default 5, capped at 50), and a per-model consensus flag for who sits on the panel.

So there are four distinct votes in play: the arbiter's blind pre-commit, each peer's independent verdict, the anonymized cross-review rating, and the final convergence check. They are doing different jobs on purpose.

One honest note, since a reviewer pushed me on it: that convergence rule is a heuristic, not a proof. Unanimous approval means nobody in the room objected. It does not mean the answer is right. More on that below.

3. Synthesis, when there is no verdict to give

Not everything is approve-or-reject. "Which of these two designs should we pick?" has no yes or no. Flip synthesizeAlways: true and instead of the loop you get a single arbiter pass that reads every opinion and writes one combined answer - free text, no verdict, no rounds. Use the loop for go/no-go calls. Use synthesis for open questions. Same panel, two shapes.

4. Two drivers, one rulebook

The loop logic lives in one place - a single state machine - and two things drive it.

consensus runs the whole loop server-side with a model as the arbiter and hands you back the result in one call. consensus-step lets your own agent be the arbiter and drive the loop one step at a time, so every move shows up in your transcript where you can see it.

Two entry points is more surface area than one, and they do not behave identically - one is visible step by step, one is a single call. The win is that the rules (how rounds count, when it converges) are written once and shared, so the two paths cannot drift apart on the part that matters. That was worth the extra work.

5. Make the panel stay accountable

In a parallel fan-out, your wall-clock time is the slowest model, not the average. One slow member that rarely says anything new sets the clock for everyone. You will not notice by feel. You need numbers.

So there is an opt-in debug log. Turn on debug.enabled and it writes one line per model call and per round: latency (p50, p95, max), token counts, error rate, the reasoning effort used. It never records your prompts, the responses, or the issue text - only the timing and the outcome of each vote.

Turn on sessions.persist and runs are saved too, so you can ask how often a given model's verdict matched the final call - and, just as useful, how often it was the lone voice that caught something the others missed, or the lone voice that was simply wrong. A model that always agrees adds cost; a model that disagrees and is usually right is the one you keep.

Then an analyze tool reads both back and tells you, in plain terms, who is slow, who is redundant, and which config line to change. From my own panel last week: one model sat at a 200-second p95 while another finished in 15 (Grok 4.3 is often the fastest). analyze flagged it and suggested the one-line switch to drop it from the default fan-out. I had been waiting on that model for almost a week without realizing it.

There is also a small dedup cache, so an identical advisory question inside one session returns instantly instead of paying twice. Well, in tool-heavy work the prompts vary, so it hits less than you would hope.

This is the whole shift from the first post. Consensus was the idea. Making the panel prove it earns its seat is the engineering!

The parts I will not oversell

Multi-model review has real failure modes, and a post that hides them deserves the distrust it gets. So:

Agreement is not truth. Three models can be confidently, unanimously wrong - especially when they were trained on overlapping data and share the same blind spot. Consensus lowers the odds of a lone model's odd mistake. It does not discover facts. Treat it as risk reduction, not a truth oracle.
Pick models that actually differ. A panel of three close cousins is one model agreeing with itself in three accents. The value comes from genuinely different models arguing, which is exactly why the config lets you choose them by hand.
UNRESOLVED is a feature. When the room cannot agree, the honest output is "we could not." That is a signal to slow down and look, not a bug to smooth out. If you wire this into CI, decide on purpose what a deadlock means there - it should probably stop the line, not wave it through.
It is slow, and that is fine for the right job. A full consensus loop can take minutes. Run it on plans, designs, and reviews in the background. For a fast second opinion, use single-shot fan-out or synthesis. Do not put a five-round loop in an interactive path.
Mind where the text goes. Fanning a compliance artifact out to 400 third-party models is a data-boundary decision, not a free upgrade. The config defaults the long tail to off for exactly this reason. Turn models on deliberately.

None of this is specific to Terraform, AWS, or even to code. The loop runs on anything you can put in text - a plan, a runbook, a decision memo. That generality is the point. It just happens to have been built running a compliance product, which is a good place to learn that "looks done" and "is done" are different claims.

The takeaway

The first lesson was: do not trust one confident model, make a few argue. The second one is harder and more useful: adding voices is easy, and most of them will not earn their seat. Measure the panel like you would measure any other service or people, and cut what does not pay.

Deliberation is open source, it works in the agent you already use, and most of the models are ones you already pay for. It ships in the agent-plugins marketplace and as a standalone MCP server on npm (@antonbabenko/deliberation-mcp). Source, and a star button I would genuinely appreciate, are at github.com/antonbabenko/deliberation. Install it, point GPT and Claude at each other, and let me know where it breaks. The cheapest review still happens before you execute - just make sure every reviewer in the room is worth the wait and the cost.

One model is a guess. Three that agree is a plan.

Anton Babenko — Mon, 18 May 2026 08:04:46 +0000

Why I shipped multi-model consensus as a plugin, plus two quieter tools that keep agents honest.

Updated (28.5.2026): /consensus is now a 3-stage loop. Stage 2 adds a blind cross-review where each external rates the others' answers, anonymized, before Claude adjudicates. Pattern from karpathy/llm-council.

Updated (26.5.2026): Grok (xAI) joins GPT and Gemini as the third external provider, Gemini 3 is the default via Google's Antigravity CLI, two new experts (Researcher and Debugger) bring the count to seven, reviews are severity-graded, and /consensus now forces Claude to commit a blind verdict in the transcript before dispatching - arbiter-mediated, not pure democracy.

Ask one model to plan something hard - a migration, a refactor, a cutover - and you get a fluent, confident answer. Fluency is not correctness. A single model is articulate and alone, and being alone is the problem: nothing in the loop disagrees with it, so it rationalizes its first guess into a plan.

The expensive failures with coding agents are almost never syntax. They are plans that read well and were wrong: the wrong abstraction, the missed blast radius, the migration step that bricks state. You find out three hours into execution, not at review time.

I have been running the fix for the last few months - on Terraform modules, and on the everyday work of running compliance.tf, not only on code. It is not a bigger model. It is making models disagree on purpose and then forcing them to resolve it. That is what consensus does, and it is the reason the agent-plugins repo exists. Two other tools ship with it; they are narrower, and I will get to them.

One model is a guess

A single model samples one distribution with no adversary in the room. Two independent models rarely make the same mistake on a plan. Where they diverge is, almost exactly, the risky part of the plan - the assumption nobody checked. A consensus loop turns that disagreement from noise into a signal you can act on.

What `/consensus` actually does

/consensus runs GPT (via Codex), Gemini 3 (via Antigravity), and Grok (via the xAI API) against the same artifact, with Claude as the arbiter. Each round has three stages.

Stage 1 - parallel verdicts. Claude posts its own verdict (APPROVE / REQUEST CHANGES / REJECT) into the transcript first, blind, before any external sees the work. The pre-commitment sits there in writing, so Claude's judgment cannot drift later to match what the others say. Then GPT, Gemini, and Grok review the artifact in parallel, each in a fresh thread, single-shot. None sees another's review.

Stage 2 - blind cross-review. Each external then rates the OTHER externals' answers, identity stripped best-effort. Votes of "not viable" become candidate critical issues the arbiter has to weigh. This catches the case where Stage 1 looks like agreement but is really three reviewers each rationalizing past the same hole. Pattern adapted from karpathy/llm-council. Stage 2 fires every round 1, and after that only when Stage 1 disagreed or the previous Stage 2 surfaced an accepted issue.

Stage 3 - arbiter adjudication. Claude reconciles the Stage 1 verdicts, the Stage 2 candidate issues, and its own blind verdict. Every objection is accepted, dismissed with a recorded reason, or deferred. Claude revises the artifact and the loop runs again, up to five rounds. It converges only when every responding external approves and Claude's pre-committed verdict agrees, with reasons on both sides where it walked back. If the group cannot agree, it says so plainly instead of faking it.

Click for the detailed diagram with bias guards and per-model flow.

Independence is not only about which model. It is sharper when each reviewer wears a different hat. claude-delegator ships seven expert profiles - Architect, Plan Reviewer, Scope Analyst, Code Reviewer, Security Analyst, Researcher, and Debugger.

Combine the axes. A Security Analyst on Gemini and an Architect on GPT fight about different things than one model reviewing twice. Different profiles catch different categories of mistake.

For a migration plan I run the Plan Reviewer broadly and add a Security Analyst pass on top. "Is this safe to run" and "is this the right shape" get argued by separate reviewers, not averaged into one bland verdict.

There is a second kind of contamination worth naming: me. In consensus, every round sends the reviewers the same artifact text, cold, in a fresh thread. They never see my triage, my running verdict, or how I framed the previous round - only the artifact and bounded round metadata. My judgment is applied after they report, not baked into what they receive.

The ask-* commands are the opposite by design: the expert gets exactly the prompt I assembled from the conversation - what I chose to hand it. Fast for a second opinion, but the input is mine, not independent. consensus keeps the input independent and pays for it in extra rounds.

It does not have to be a plan. The loop runs on anything you can put in text - a design, a runbook, a decision memo, a spec. Plans are simply where I reach for it most, and where it converges fastest. The looser and fuzzier the input, the more rounds it takes to agree, so non-plan runs tend to run longer - worth it when the answer matters, overkill for a quick lookup. For that, single-shot ask-gpt, ask-gemini, ask-grok, and ask-all (which fans out to all three in parallel) are right there. consensus is for when it has to be right.

Nothing about this is Terraform, or even code. That generality is why it is the headline.

The quieter two

Same release, narrower scope:

code-intelligence - a language-agnostic skill. Agents grab text grep when they should ask the language server, and silently swap tools when one is missing, then report "found all references." This encodes search precedence: the language server (LSP) for symbols, rg/ripgrep for exact text, an embedding/semantic grep (such as mgrep) for fuzzy discovery - and a hard rule to disclose any substitution on the first line of the reply. You learn the moment coverage drops.
terraform-skill - routes a Terraform/OpenTofu request to its real failure mode (identity churn, blast radius, state corruption) before emitting HCL. It is terraform-ls aware: it knows the language server has no rename provider, so it runs the safe manual reference workflow instead of a blind find-replace. It is approaching 2,000 GitHub stars - the part of this post I am quietly proud of.

Those two are discipline for the agent's hands. Consensus is discipline for its judgment, and judgment generalizes further.

A note on claude-delegator

I did not invent the delegation layer. claude-delegator is a fork of Jarrod Watts' original (MIT, upstream currently quiet), fully based on his design - I kept the structure and the license.

What I added is what months of daily use exposed: a Gemini bridge that wraps Google's Antigravity CLI (agy) with auto-gemini-3 as the routing default and recovers an answer the CLI flushed to disk after a soft timeout instead of failing the call; a fresh Grok bridge over the xAI API that is advisory-only but reads attached files via the xAI Files API (with TTL-based cleanup); two more experts (Researcher and Debugger) on top of the original five; severity-graded reviews so three parallel reports merge cleanly; and a hardened /consensus loop where Claude pre-commits a blind verdict before any external sees the artifact, with a Stage 2 blind cross-review on top (adapted from karpathy/llm-council). Plus the bundled ask-gpt / ask-gemini / ask-grok / ask-all / consensus commands, so the workflow ships with the plugin instead of living in my dotfiles. The seven expert prompts borrow from oh-my-openagent and PAL; both credited in the README.

Credit for the foundation is his; the bug-fixing scar tissue is mine.

Why a plugin, not a blog post

One honest caveat, because I have hit it myself: skills are model-triggered, which makes them soft. Packaging this as a plugin improves reuse and discoverability. It does not guarantee the agent obeys every time - hard enforcement (a real pre-tool gate) is a separate, still-open problem.

I keep finding and fixing bugs in all three. That constant repair is the only reason I trust them enough to write this - and I would rather say so than oversell the fix.

The takeaway

Stop accepting the first confident plan from one model. Make them argue, and only move when they stop. The release is at github.com/antonbabenko/agent-plugins; consensus ships in claude-delegator from the same marketplace. The cheapest review is the one that happens before you execute.

DEV Community: Anton Babenko

Meet Deliberation: 400+ models is easy, knowing which ones earn a place is hard.

A few words first

Install it now

What actually changed

Quick one-offs: just ask

1. Fan-out with no cross-talk

2. The consensus loop, where the voting lives

3. Synthesis, when there is no verdict to give

4. Two drivers, one rulebook

5. Make the panel stay accountable

The parts I will not oversell

The takeaway

One model is a guess. Three that agree is a plan.

One model is a guess

What /consensus actually does

The quieter two

A note on claude-delegator

Why a plugin, not a blog post

The takeaway

What `/consensus` actually does