Anton Babenko for AWS Heroes

Posted on Jun 3

Meet Deliberation: 400+ models is easy, knowing which ones earn a place is hard.

#ai #llm #council #deliberation

A follow-up: how a three-model consensus tool grew into a configurable, measurable panel - and why I now make every model prove it pays for its slot.

It is open source and ready to use today: github.com/antonbabenko/deliberation. If you only do one thing, star the repo and install it in your own agent - it takes about two minutes, and the install section below has the exact command for Claude Code, Codex, Cursor, Kiro, and OpenCode.

Here is the part that surprises people: you are probably already paying to access a few models. A ChatGPT subscription includes Codex which allows you to use models like gpt-5.5 and gpt-5.3-codex from Codex CLI. A Claude subscription includes Claude Code. So you can wire GPT and Claude to review each other right now, at no extra cost, and add Gemini, Grok, or any OpenRouter model when you want a third opinion.

A few weeks ago I wrote that one model is a guess and three that agree is a plan. I still believe it. But it skipped the next problem, the one I hit the moment the trick worked: once you can ask three models, you can ask thirty seven. And more voices is not the same as more signal. A slow model that always agrees with a faster one is not a second opinion. It is a bill.

So the project grew up, and it got a name: Deliberation.

The word fits. Deliberation is the slow, careful weighing of a decision - the thing a jury or a council does before it returns a verdict. That is exactly the job here: a panel of models that weighs one artifact and argues its way to a call, instead of one model blurting the first thing that sounds right. It answers a different question than the first post did. The first post asked should you make models argue? This one is about which models, under what rules, and how do you know any of it was worth the wait?

It still runs the everyday work of compliance.tf, same as before. Most of what follows came from that: not from theory, from watching real runs and asking why one took four minutes to tell me what another said in twenty seconds.

A few words first

If you are new to this, here are the only terms you need. Plain version, with examples.

Agent host. The tool you already code in: Claude Code, the Codex CLI, Cursor, Kiro, OpenCode. Deliberation plugs into all of them and behaves the same way in each.
MCP. A standard plug that lets your agent host talk to outside tools. Deliberation is one MCP server, so any host that speaks MCP can use it. You install it once; you do not write glue code.
Provider / model. A provider is a company (OpenAI, Google, xAI, OpenRouter). A model is one brain from that provider (GPT, Gemini, Grok, or any of OpenRouter's 400-plus, like Qwen, DeepSeek, or Kimi).
Panel. The set of models you ask at once. You choose who is on it. Example: a panel of GPT + Gemini + Grok.
Expert (or persona). A hat a model wears for one job: Architect, Security Analyst, Code Reviewer, Plan Reviewer, and three more. The same model reviews differently depending on the hat.
Arbiter. The one who reads everybody's answers and makes the call. It can be a model, or it can be your own agent (you).

That is the whole vocabulary. The rest is just rules for how the panel talks.

Install it now

Deliberation is one MCP server with native packaging for each host. Set only the providers you use - a missing key just turns that one provider off. Full per-host guides: public-docs/hosts.

Claude Code: add the marketplace, install the plugin, run setup.

  /plugin marketplace add antonbabenko/agent-plugins
  /plugin install deliberation
  /deliberation:setup

Codex CLI: codex plugin marketplace add antonbabenko/deliberation, then install deliberation from /plugins.
Cursor: drop in the rule file and use the one-click MCP deeplink (guide).
Kiro: "Import power from GitHub" (guide).
OpenCode: add the opencode.json MCP snippet (guide).

Credentials come from your environment: GPT uses your Codex/OpenAI login (run codex login), Gemini signs in once through its new Antigravity CLI (run agy), Grok reads XAI_API_KEY, OpenRouter reads OPENROUTER_API_KEY. Start with GPT and Claude, since you likely already pay for both.

What actually changed

Three things.

It is one MCP server now, not a Claude-only plugin. It works in Claude Code, Cursor, Codex, Kiro, OpenCode - anything that speaks MCP. Your primary agent stays in charge and calls the panel when a decision is worth a second look.

The panel opened up. It used to be three fixed externals: GPT, Gemini, Grok. Those are still the built-in voices, but you can now add any of OpenRouter's 400-plus models (including Qwen, DeepSeek, Kimi, and others) as named records in a config file. You pick them. You say which ones join a quick fan-out, which ones sit on the consensus panel, and which expert hat each one is allowed to wear - not all are equally good for all tasks.

And the whole thing learned to measure itself. That last part is the real subject of this post.

I will walk through it as five categories. Each one has a switch you flip in config, and each one votes - or refuses to - in a specific way.

Quick one-offs: just ask

Before the heavy machinery, the everyday move. Most of the time you do not want a five-round debate - you want a fast second opinion from someone who is not the model you have been talking to all session.

So you just ask, in plain words: "Ask Grok and Gemini whether this retry loop can deadlock." Your agent routes that to the right voices and brings back the answers. Or use the explicit commands when you know who you want:

/ask-gpt does this IAM policy grant more than it should?
/ask-grok poke holes in this rollback plan
/ask-gemini summarize the risk in this migration in 5 bullets

These are single-shot and advisory: one question, one answer, no loop, no context contamination from your session. The exact same commands work in Claude Code, Codex, Cursor, Kiro, and OpenCode - the server is the same everywhere, so a prompt you learn in one host transfers to the rest unchanged.

And none of this is a separate ritual you have to schedule. Call /ask-all or /consensus at any point in the work - while you are still scoping a feature, halfway through writing a plan, or in the middle of a /grill-me session when you want a real outside voice in the room instead of arguing with only yourself. The panel is available the whole time, not just at the review at the end. The earlier you pull in a dissenting model, the cheaper the disagreement is to act on.

1. Fan-out with no cross-talk

The next cheapest move: ask several models the same question at once and read the answers side by side. Nobody sees anyone else's reply, so you get real independence instead of an echo.

Two examples where this works the best:

/ask-all can this Terraform state migration be rolled back safely, and what breaks if it cannot?
/ask-all what are the failure modes of switching this table's primary key on a live database?

/ask-all runs the parallel version. panel tells you exactly who would answer for your current config before you spend a token. ask-one lets you fire each member yourself, so each answer lands on screen as it finishes instead of waiting on the slowest one.

Switches: routing.maxFanout caps how many OpenRouter models join (default 3), and each model record has an askAll flag to opt in or out.

There is no voting here. It is parallel sampling. You and your agent (Claude Code, Codex, or anything else) get all the answers and decide.

2. The consensus loop, where the voting lives

This is the heart of it, and it is a state machine, not a vibe. One round goes:

Blind pre-commit. The arbiter writes down its own verdict - approve, request changes, or reject - before any other model sees the work. In writing, first. So its judgment cannot quietly drift to match the crowd later.
Parallel peer review. The panel reviews the same artifact, each in a fresh thread, none seeing another's review.
Blind cross-review. Each model then rates the others' answers with names stripped off (to avoid bias and deference). A "not viable" vote becomes a candidate problem the arbiter has to deal with. This catches the case where everyone looked like they agreed but were each walking past the same hole. (Pattern borrowed from karpathy/llm-council.)
Adjudication. The arbiter goes through every objection and accepts it, dismisses it with a written reason, or defers it. Then it revises the artifact and the round runs again.

It converges only when every responding model approves and the arbiter's pre-committed verdict agrees with them. If they cannot get there inside the round cap, it returns UNRESOLVED and says so. It does not fake a number to look finished.

Click for the detailed diagram with the bias guards and per-model flow.

The answers are not free text the arbiter has to guess at. Every voice returns a structured opinion, and the engine parses it into fixed fields:

a recommendation (the actual call),
a confidence label, so a weak "maybe" does not count the same as a firm "yes",
dissent points, assumptions, and tradeoffs as separate lists, so a disagreement is attached to a reason instead of buried in prose.

Reviews add a verdict (approve / request changes / reject) and a list of critical issues, each sorted into a closed six-category taxonomy - so "three reviewers, nine objections" collapses into a clean, deduplicated set the arbiter can act on. Models are asked to emit this as JSON; the parser is best-effort and never throws, so if a model returns slightly malformed JSON or plain text, the content is salvaged and tagged with how it was read (clean parse vs recovered). Structured where it can be, never brittle.

Switches: consensus.arbiter (auto, the host model, a named provider, or a dedicated model record), consensus.blindVote (add the blind pre-vote on the synthesis path), consensus.maxRounds (default 5, capped at 50), and a per-model consensus flag for who sits on the panel.

So there are four distinct votes in play: the arbiter's blind pre-commit, each peer's independent verdict, the anonymized cross-review rating, and the final convergence check. They are doing different jobs on purpose.

One honest note, since a reviewer pushed me on it: that convergence rule is a heuristic, not a proof. Unanimous approval means nobody in the room objected. It does not mean the answer is right. More on that below.

3. Synthesis, when there is no verdict to give

Not everything is approve-or-reject. "Which of these two designs should we pick?" has no yes or no. Flip synthesizeAlways: true and instead of the loop you get a single arbiter pass that reads every opinion and writes one combined answer - free text, no verdict, no rounds. Use the loop for go/no-go calls. Use synthesis for open questions. Same panel, two shapes.

4. Two drivers, one rulebook

The loop logic lives in one place - a single state machine - and two things drive it.

consensus runs the whole loop server-side with a model as the arbiter and hands you back the result in one call. consensus-step lets your own agent be the arbiter and drive the loop one step at a time, so every move shows up in your transcript where you can see it.

Two entry points is more surface area than one, and they do not behave identically - one is visible step by step, one is a single call. The win is that the rules (how rounds count, when it converges) are written once and shared, so the two paths cannot drift apart on the part that matters. That was worth the extra work.

5. Make the panel stay accountable

In a parallel fan-out, your wall-clock time is the slowest model, not the average. One slow member that rarely says anything new sets the clock for everyone. You will not notice by feel. You need numbers.

So there is an opt-in debug log. Turn on debug.enabled and it writes one line per model call and per round: latency (p50, p95, max), token counts, error rate, the reasoning effort used. It never records your prompts, the responses, or the issue text - only the timing and the outcome of each vote.

Turn on sessions.persist and runs are saved too, so you can ask how often a given model's verdict matched the final call - and, just as useful, how often it was the lone voice that caught something the others missed, or the lone voice that was simply wrong. A model that always agrees adds cost; a model that disagrees and is usually right is the one you keep.

Then an analyze tool reads both back and tells you, in plain terms, who is slow, who is redundant, and which config line to change. From my own panel last week: one model sat at a 200-second p95 while another finished in 15 (Grok 4.3 is often the fastest). analyze flagged it and suggested the one-line switch to drop it from the default fan-out. I had been waiting on that model for almost a week without realizing it.

There is also a small dedup cache, so an identical advisory question inside one session returns instantly instead of paying twice. Well, in tool-heavy work the prompts vary, so it hits less than you would hope.

This is the whole shift from the first post. Consensus was the idea. Making the panel prove it earns its seat is the engineering!

The parts I will not oversell

Multi-model review has real failure modes, and a post that hides them deserves the distrust it gets. So:

Agreement is not truth. Three models can be confidently, unanimously wrong - especially when they were trained on overlapping data and share the same blind spot. Consensus lowers the odds of a lone model's odd mistake. It does not discover facts. Treat it as risk reduction, not a truth oracle.
Pick models that actually differ. A panel of three close cousins is one model agreeing with itself in three accents. The value comes from genuinely different models arguing, which is exactly why the config lets you choose them by hand.
UNRESOLVED is a feature. When the room cannot agree, the honest output is "we could not." That is a signal to slow down and look, not a bug to smooth out. If you wire this into CI, decide on purpose what a deadlock means there - it should probably stop the line, not wave it through.
It is slow, and that is fine for the right job. A full consensus loop can take minutes. Run it on plans, designs, and reviews in the background. For a fast second opinion, use single-shot fan-out or synthesis. Do not put a five-round loop in an interactive path.
Mind where the text goes. Fanning a compliance artifact out to 400 third-party models is a data-boundary decision, not a free upgrade. The config defaults the long tail to off for exactly this reason. Turn models on deliberately.

None of this is specific to Terraform, AWS, or even to code. The loop runs on anything you can put in text - a plan, a runbook, a decision memo. That generality is the point. It just happens to have been built running a compliance product, which is a good place to learn that "looks done" and "is done" are different claims.

The takeaway

The first lesson was: do not trust one confident model, make a few argue. The second one is harder and more useful: adding voices is easy, and most of them will not earn their seat. Measure the panel like you would measure any other service or people, and cut what does not pay.

Deliberation is open source, it works in the agent you already use, and most of the models are ones you already pay for. It ships in the agent-plugins marketplace and as a standalone MCP server on npm (@antonbabenko/deliberation-mcp). Source, and a star button I would genuinely appreciate, are at github.com/antonbabenko/deliberation. Install it, point GPT and Claude at each other, and let me know where it breaks. The cheapest review still happens before you execute - just make sure every reviewer in the room is worth the wait and the cost.

DEV Community