Why I Built a Tool That Makes Two AIs Argue With Each Other
I use AI models heavily — Claude, Codex, Gemini — and I kept noticing the same pattern: one AI gives me a confident answer, I ship it, and a problem I didn't think of shows up later.
The fix wasn't to find a better model. It was to stop asking one model.
When I started routing the same question to two models independently, then comparing their answers, the quality of my decisions improved noticeably. Not because one model was smarter. Because the disagreement itself was informative.
But doing this manually is painful. You copy the question, paste the context, wait, read two long answers, mentally diff them, paste one answer back to the other model, wait again. On a good day you do this once. Usually you skip it.
So I built WhaleCouncil — a CLI tool that automates the loop.
What it does
council review --models claude-cli,gemini-cli,codex-cli --rounds 2 \
"Should I use Redis or Postgres for session storage?"
Round 1 — each model answers independently, without seeing the other's response. This matters. If model B reads model A's answer first, it anchors on it. You get a polite agreement, not an independent opinion.
Round 2 — each model sees the other's Round 1 answer and is asked: do you change your position? What do you still disagree with, and why?
Synthesis — a judge model (by default, the same Claude you already have) reads both rounds and extracts the structural diff: what they agree on, where they actually diverge, what fact would resolve the disagreement, and what to do next.
Here's what that looks like on the Redis vs Postgres question:
◆ Round 1
claude-cli ✓ Default: Redis. Session access is high-frequency,
short-TTL, key-value — Redis's home turf. Native
EXPIRE beats a Postgres cleanup job every time...
gemini-cli ✓ It depends on what you already run. If Postgres is
already in the stack, a sessions table with a
partial index on expiry is one less moving part to
operate. Don't add Redis for sessions alone...
codex-cli ✓ Postgres, with an UNLOGGED TABLE for the session
rows. You skip a new service, get SQL for
debugging, and UNLOGGED removes the WAL cost that
makes people reach for Redis in the first place...
⚡ Disagreements
▸ Greenfield default: claude-cli defaults to Redis on access
pattern alone. gemini-cli and codex-cli both default to
Postgres to avoid operating a second datastore. Same facts,
opposite tiebreaker.
▸ Postgres table design: gemini-cli recommends a normal table
+ partial index + cleanup job. codex-cli recommends UNLOGGED
TABLE for speed. UNLOGGED wipes on crash — which silently
logs every user out, a cost codex-cli didn't price in.
◆ Round 2
claude-cli ✓ After seeing the others: I'll qualify "default
Redis." If you already run Postgres and not Redis,
a sessions table is the right call — adding Redis
for sessions alone is premature...
gemini-cli ✓ Position unchanged, but I want to flag codex-cli's
UNLOGGED suggestion: a crash logging out every
active user is an availability incident, not a
perf optimization. Use a logged table...
codex-cli ✓ Conceding the UNLOGGED point — crash-on-restart
wiping all sessions is worse than the WAL cost I
was trying to avoid. Logged table + partial index.
My Postgres-first stance otherwise stands...
→ Synthesis
Converged on: don't add a datastore you don't already run. If
Postgres is in the stack and Redis isn't, a logged sessions
table is the right default. All three rejected UNLOGGED TABLE
by Round 2 (codex-cli conceded).
Remaining: claude-cli still prefers Redis purely on the access
pattern; gemini-cli/codex-cli weight operational cost higher.
This resolves to one question: what is expected session QPS?
Next step: answer (a) do you already run Redis, (b) expected
QPS, (c) do you need forced logout/revocation?
Three different model families came in with different tiebreakers. Round 2 surfaced that one of them had an internal inconsistency (recommending UNLOGGED TABLE, which silently logs every user out on a crash). The others pushed back, and it corrected itself after seeing their answers.
That's not something a single model run gives you.
One AI vs two AI is a qualitative difference
When you ask one model, you get one perspective. It might be wrong. It might be missing an option. It might be confidently hallucinating a detail. You have no way to know from the answer itself.
When you ask two models independently:
- If they agree → you have corroborating evidence
- If they disagree → the disagreement tells you something the answers themselves don't: there's a real tradeoff, an unstated assumption, or a missing input
- If one changes position in Round 2 → the other's argument was stronger
The value isn't in the answers. It's in the diff.
No API key required
WhaleCouncil works with local subscriptions — if you have Claude Max or a Codex subscription, you already have what you need:
pip install whalecouncil
council init # creates ~/.council.toml
council review --models claude-cli,gemini-cli,codex-cli "your question"
The claude-cli, gemini-cli, and codex-cli adapters drive your local claude, gemini, and codex binaries. No API keys needed — your existing subscriptions work out of the box.
Works with pipes
# Code review
cat auth.py | council review --models claude-cli,gemini-cli,codex-cli \
"Any security issues?"
# Diff review before merging
git diff main | council review --models claude-cli,gemini-cli,codex-cli \
--rounds 2 "What could go wrong with this change?"
# Save to file
council review --file design.md --output markdown > report.md
The design principle
Agent evaluation should measure work, not words.
WhaleCouncil doesn't try to find the "best" model. It uses disagreement as signal. The point is not to get a better answer — it's to surface the questions you should be asking before you commit to an answer.
Try it
pip install whalecouncil
GitHub: openwhale-labs/whalecouncil
Early days — issues, PRs, and use cases welcome.
Top comments (0)