Why I Built a Tool That Makes Two AIs Argue With Each Other

#python #ai #opensource #cli

Why I Built a Tool That Makes Two AIs Argue With Each Other

I use AI models heavily — Claude, Codex, Gemini — and I kept noticing the same pattern: one AI gives me a confident answer, I ship it, and a problem I didn't think of shows up later.

The fix wasn't to find a better model. It was to stop asking one model.

When I started routing the same question to two models independently, then comparing their answers, the quality of my decisions improved noticeably. Not because one model was smarter. Because the disagreement itself was informative.

But doing this manually is painful. You copy the question, paste the context, wait, read two long answers, mentally diff them, paste one answer back to the other model, wait again. On a good day you do this once. Usually you skip it.

So I built WhaleCouncil — a CLI tool that automates the loop.

What it does

council review --models claude-cli,gemini-cli,codex-cli --rounds 2 \
  "Should I use Redis or Postgres for session storage?"

Round 1 — each model answers independently, without seeing the other's response. This matters. If model B reads model A's answer first, it anchors on it. You get a polite agreement, not an independent opinion.

Round 2 — each model sees the other's Round 1 answer and is asked: do you change your position? What do you still disagree with, and why?

Synthesis — a judge model (by default, the same Claude you already have) reads both rounds and extracts the structural diff: what they agree on, where they actually diverge, what fact would resolve the disagreement, and what to do next.

Here's what that looks like on the Redis vs Postgres question:

◆ Round 1
claude-cli ✓  Default: Redis. Session access is high-frequency, 
              short-TTL, key-value — Redis's home turf. Native 
              EXPIRE beats a Postgres cleanup job every time...

gemini-cli ✓  It depends on what you already run. If Postgres is 
              already in the stack, a sessions table with a 
              partial index on expiry is one less moving part to 
              operate. Don't add Redis for sessions alone...

codex-cli  ✓  Postgres, with an UNLOGGED TABLE for the session 
              rows. You skip a new service, get SQL for 
              debugging, and UNLOGGED removes the WAL cost that 
              makes people reach for Redis in the first place...

⚡ Disagreements
  ▸ Greenfield default: claude-cli defaults to Redis on access 
    pattern alone. gemini-cli and codex-cli both default to 
    Postgres to avoid operating a second datastore. Same facts, 
    opposite tiebreaker.

  ▸ Postgres table design: gemini-cli recommends a normal table 
    + partial index + cleanup job. codex-cli recommends UNLOGGED 
    TABLE for speed. UNLOGGED wipes on crash — which silently 
    logs every user out, a cost codex-cli didn't price in.

◆ Round 2
claude-cli ✓  After seeing the others: I'll qualify "default 
              Redis." If you already run Postgres and not Redis, 
              a sessions table is the right call — adding Redis 
              for sessions alone is premature...

gemini-cli ✓  Position unchanged, but I want to flag codex-cli's 
              UNLOGGED suggestion: a crash logging out every 
              active user is an availability incident, not a 
              perf optimization. Use a logged table...

codex-cli  ✓  Conceding the UNLOGGED point — crash-on-restart 
              wiping all sessions is worse than the WAL cost I 
              was trying to avoid. Logged table + partial index. 
              My Postgres-first stance otherwise stands...

→ Synthesis
  Converged on: don't add a datastore you don't already run. If 
  Postgres is in the stack and Redis isn't, a logged sessions 
  table is the right default. All three rejected UNLOGGED TABLE 
  by Round 2 (codex-cli conceded).

  Remaining: claude-cli still prefers Redis purely on the access 
  pattern; gemini-cli/codex-cli weight operational cost higher. 
  This resolves to one question: what is expected session QPS?

  Next step: answer (a) do you already run Redis, (b) expected 
  QPS, (c) do you need forced logout/revocation?

Three different model families came in with different tiebreakers. Round 2 surfaced that one of them had an internal inconsistency (recommending UNLOGGED TABLE, which silently logs every user out on a crash). The others pushed back, and it corrected itself after seeing their answers.

That's not something a single model run gives you.

One AI vs two AI is a qualitative difference

When you ask one model, you get one perspective. It might be wrong. It might be missing an option. It might be confidently hallucinating a detail. You have no way to know from the answer itself.

When you ask two models independently:

If they agree → you have corroborating evidence
If they disagree → the disagreement tells you something the answers themselves don't: there's a real tradeoff, an unstated assumption, or a missing input
If one changes position in Round 2 → the other's argument was stronger

The value isn't in the answers. It's in the diff.

No API key required

WhaleCouncil works with local subscriptions — if you have Claude Max or a Codex subscription, you already have what you need:

pip install whalecouncil
council init   # creates ~/.council.toml
council review --models claude-cli,gemini-cli,codex-cli "your question"

The claude-cli, gemini-cli, and codex-cli adapters drive your local claude, gemini, and codex binaries. No API keys needed — your existing subscriptions work out of the box.

Works with pipes

# Code review
cat auth.py | council review --models claude-cli,gemini-cli,codex-cli \
  "Any security issues?"

# Diff review before merging
git diff main | council review --models claude-cli,gemini-cli,codex-cli \
  --rounds 2 "What could go wrong with this change?"

# Save to file
council review --file design.md --output markdown > report.md

The design principle

Agent evaluation should measure work, not words.

WhaleCouncil doesn't try to find the "best" model. It uses disagreement as signal. The point is not to get a better answer — it's to surface the questions you should be asking before you commit to an answer.