that-github-user

Posted on Apr 2

pass@1 is a gamble — how ensemble coding enhances AI reliability

#ai #coding #opensource #typescript

You ask Claude to fix a bug. It nails it. You ask it again with slightly different phrasing. It refactors half your module and breaks three unrelated tests. Same model, same task, different result.

This is the fundamental problem with AI coding today: pass@1 — the chance a single attempt succeeds — is a gamble.

Running the same task multiple times and picking the best result dramatically improves reliability. It's the same principle behind ensemble methods in ML — and recent research confirms it works for code generation too, though it warns that naive consensus can amplify shared mistakes. Selection method matters as much as ensemble size. We built thinktank to make this practical — thinktank currently uses a single model (Claude), so test execution is the primary quality signal, not consensus.

One command

thinktank run "fix the authentication bypass" -n 5 -t "npm test"

Under the hood:

N isolated git clones — each agent gets a fully independent copy of your repo
N parallel Claude Code agents, each solving the task with zero knowledge of the others
Test verification — your test suite runs in each clone
Convergence analysis — clusters agents by diff similarity
Copeland pairwise scoring — ranks agents via head-to-head comparison
thinktank apply — applies the winner to your working tree

The cost is real: 5 agents = 5× the API tokens. A typical 5-agent run on a medium codebase costs roughly $1-5 depending on task complexity and model (Sonnet is cheaper, Opus is pricier). But they run in parallel, so wall-clock time is the slowest agent, not the sum — typically a few minutes.

"Why not just spawn N agents yourself?" You could. But thinktank handles the parts that are tedious to DIY: git isolation and cleanup, parallel orchestration with timeout/retry, test execution per clone, diff similarity across all pairs, and Copeland scoring. The isolation and the analysis are the product — not the parallelism.

Where it gets interesting: A* pathfinding

We gave 5 agents a grid-based pathfinding challenge: implement A* with their choice of heuristic, data structures, and optimizations. We ran it in both TypeScript and Python.

Python (pytest): 5 agents, 5/5 pass all 7 tests, 71% convergence
TypeScript (node:test): 5 agents, 3/5 pass, 68% convergence

All agents independently chose Manhattan distance and a min-heap priority queue — the textbook approach. But the implementations diverged:

Agents 1-3: Standard A* in ~38 lines, no heap tiebreaking
Agents 4-5: A* with a heap tiebreak counter (~46 lines) — prevents pathological node exploration when cells have equal f-scores. up to 37% fewer nodes explored, same optimal path.

Here's the twist: Copeland scoring recommends Agent #1 — smallest diff, passes all tests, high convergence. By thinktank's reliability criteria, it's the safest pick. But Agent #4's approach is algorithmically superior, and you'd never know it existed with one agent.

This is the deeper value: not just picking a winner, but seeing the design space. Copeland gives you the safe choice. The full ensemble reveals approaches worth stealing.

Want to see why Agent #4's approach was different?

$ thinktank compare 1 4

  Comparing Agent #1 vs Agent #4
  ──────────────────────────────────────────────────────────

  Agent #1 (recommended): success | tests: pass | +399/-0 | 1 files
  Agent #4: success | tests: pass | +259/-0 | 1 files

  Similarity: ████░░░░░░░░░░░░░░░░ 19%

  Files changed:
    both  .../astar-python/test_pathfinding_generated.py

  Added lines:
    Shared:        58
    Only #1:      158
    Only #4:      91

Only 19% similarity — both agents wrote valid tests for the same module, but took very different approaches. Apply the winner, or pick a specific agent:

$ thinktank apply
  Applying changes from Agent #1...
  Changes applied successfully.
  Cleaning up clones...
  Done.

  Review the changes with: git diff
  Commit when ready: git add -A && git commit

$ thinktank undo   # changed your mind? roll it back
  Undo complete — last applied patch has been reversed.

Same pattern, different domain: gradient descent

We ran the same experiment on ML: 5 agents implementing linear regression via batch gradient descent. All five wrote structurally identical code — normalize features, compute gradients, update weights. 76% convergence. But:

Agent	Epochs	Tests	Result
#1, #3, #4	1,000	5/7	fail `test_perfect_fit`
#2	2,000	6/6	pass — Copeland pick
#5	10,000	6/6	pass — over-engineered

Three agents under-train. One over-trains. The Copeland pick (#2) is the Goldilocks solution — just enough iterations to converge, no wasted compute. The algorithm is identical across all five; the only difference is a hyperparameter choice that only surfaces on edge-case test inputs. Exactly the kind of subtle difference that ensemble catches and single-agent misses.

How Copeland scoring works

Most "pick the best" systems use weighted scoring — assign points for tests, convergence, diff size, sum them up. We tried this. The weights felt arbitrary, and across 57 runs, weighted scoring disagreed with pairwise methods about a third of the time (Cochran's Q: p<0.0001).

Instead, thinktank uses Copeland's method from social choice theory. Every pair of agents is compared head-to-head on four criteria:

Tests passed — did the test suite pass?
Convergence — how many other agents took a similar approach?
Scope — fewer production files changed = less risk
Test coverage — did the agent add or update tests?

The agent winning more criteria gets +1, the loser gets −1. No arbitrary point weights. Two independent ranking methods — Copeland and Borda count — agree on the recommendation 84-96% of the time (84% across 74 evaluated runs, 96% on the 53-run subset with stored per-agent scores), while weighted scoring disagrees about a third of the time.

What we learned building this

thinktank was built using thinktank — 80 PRs, ~250 tests, 103 ensemble runs across TypeScript, Python, pathfinding, and ML tasks.

About half of individual agent attempts fail their tests. This sounds bad, but it's the point — in a 5-agent ensemble, you only need one to succeed. If your task is simple enough for pass@1, a single agent is fine.

Correlated failures are the limit. All agents use the same Claude model. When it has a systematic blind spot, all 5 may fail the same way. Multi-model ensembles (Claude + GPT + Gemini) would help — thinktank has a runner abstraction for this but only Claude Code is implemented today.

The sweet spot is medium-complexity tasks. "Fix this bug and add a test," "add rate limiting to this API," "refactor this module" — tasks where the model can succeed but the approach isn't obvious.

Convergence is a confidence signal, not a correctness signal. Tests are the oracle. When all agents converge on the same wrong answer, it's the tests that catch it — not the consensus. (More on this below.)

Don't use it for simple tasks (wasting tokens), without a test suite (the primary oracle), or for tasks that need iterative refinement (each agent starts fresh).

The false oracle problem: ensemble test generation

This was our most expensive lesson. Early in development, a single agent wrote a test asserting a maze's shortest path was 13 steps. The correct answer was 9. That bad test became the oracle for 13+ ensemble runs — every A* implementation looked "broken" when they were all correct.

The fix: use the ensemble to write the tests first.

# Phase 1: generate tests — no test command, just convergence analysis
$ thinktank run "write unit tests for A* pathfinding on a 5x5 grid" -n 3

  Convergence
  ────────────────────────────────────────────────────────────
  Agents [1, 3]: ████████████████░░░░ 67%

  Agent #1: assert shortestPath(grid) === 9
  Agent #2: assert shortestPath(grid) === 13    ← disagrees
  Agent #3: assert shortestPath(grid) === 9

Two out of three agents independently computed 9. The disagreement flags the bad oracle before it poisons anything. Apply the majority's tests, then use them as ground truth:

# Phase 2: implement against validated tests
$ thinktank run "implement A* pathfinding" -n 5 -t "npm test"

This two-phase workflow — ensemble tests, then ensemble implementation — is how we avoided false oracles for the rest of development.

The numbers from 103 runs

We dogfooded thinktank on itself. Here's what the data looks like:

$ thinktank stats

  thinktank stats
  ─────────────────────────────
  Total runs:          103
  Avg agents/run:      4.3
  Avg convergence:     64.8%
  Avg test pass rate:  47.5%

$ thinktank stats --passed-only

  thinktank stats
  ─────────────────────────────
  Filters:             passed-only
  Total runs:          57
  Avg agents/run:      4.6
  Avg convergence:     64.4%
  Avg test pass rate:  79.2%

47.5% test pass rate — roughly half of individual agent attempts fail. That's the whole point: in a 5-agent ensemble, you don't need every agent to succeed, you need one. Filter to runs where at least one agent passed and you're at 79.2%.

If your pass rate is dropping, your prompts are too vague or your tests are too strict. If convergence is trending down, the tasks are ambiguous — rewrite the prompt before spending more tokens.

$ thinktank evaluate

  Scoring Method Evaluation
  ──────────────────────────────────────────────────────────
  Usable runs: 74 (of 103 total)

  Run     Agents  Weighted  Copeland  Borda   Agree?
  ──────────────────────────────────────────────────────────
  #4      5       #1        #1        #1      yes
  #10     5       #1        #2        #2      NO
  #14     5       #1        #3        #3      NO
  ...

  Agreement Rates
  ──────────────────────────────
  All three agree:         45/74 (61%)
  Weighted = Copeland:     53/74 (72%)
  Weighted = Borda:        47/74 (64%)
  Copeland = Borda:        62/74 (84%)

When Copeland and Borda disagree on a run (16% of the time), that's a signal to manually review with thinktank compare instead of blindly applying. When all three methods agree, apply with confidence. This is how we discovered that weighted scoring was the outlier — and why Copeland became the default.

We're not the only ones thinking about this

The ensemble idea is in the air. Mozilla AI's Star Chamber fans out code reviews to Claude, GPT, and Gemini in parallel, using consensus tiers and optional debate rounds where anonymized feedback circulates back to all models. Karpathy's llm-council runs a deliberation pipeline — parallel query, peer review, chairman synthesis — for general-purpose tasks. Roundtable orchestrates multiple AI CLI tools through a unified MCP interface. Composio's Agent Orchestrator manages parallel coding agents with git worktrees for task decomposition — different agents working on different sub-tasks. And Aider's Architect Mode pairs two models in complementary roles (planner + editor).

Thinktank is doing something specific: same-task ensemble with true isolation. Every agent gets an independent git clone, solves the identical problem with zero knowledge of the others, and results are ranked by Copeland pairwise scoring — not majority vote, not debate, not model synthesis. The isolation is the point: it's what makes the ensemble math work, and it's why convergence is a meaningful signal rather than an artifact of shared context.

Try it

npm install -g thinktank-ai
thinktank init
thinktank run "your task here" -n 5 -t "npm test"

Requires Claude Code CLI. Works with Anthropic API keys or Amazon Bedrock (pass any model ID starting with anthropic., e.g. --model anthropic.claude-opus-4-6-v1 — AWS credentials from your environment are inherited automatically). MIT licensed. Contributions welcome — especially runners for other AI coding tools.

What's next

thinktank currently runs Claude Code only. The runner interface is designed to be pluggable — adding support for other AI coding tools (OpenCode, Aider, Codex CLI, Gemini CLI) is the highest-priority roadmap item. Multi-tool ensembles would address the single-model diversity limitation and unlock the gains the research says are there.

Beyond that: more algorithmic showcases, a web dashboard for visual diff comparison, and more.

thinktank on GitHub · npm install -g thinktank-ai · Technical report