DEV Community

Cover image for We Ran 52 AI Coding Benchmarks. Here's Every Uncomfortable Thing We Found.
Greg B.
Greg B.

Posted on

We Ran 52 AI Coding Benchmarks. Here's Every Uncomfortable Thing We Found.

TL;DR: The biggest variable in AI-assisted development isn't the model, the tool, or parallelism. It's what you write before the AI starts. A structured brief (CONTRACT.md) reduces cost 54% and raises quality from 5/10 to 9/10. Agent Teams cost 73–124% more with no quality gain. Retry loops degrade quality from 9/10 to 6/10. We validated all of this across 52+ controlled runs and open-sourced the tool.

github.com/UpGPT-ai/upcommander

npm install -g @upgpt/upcommander-cli


Why We Did This

We had just run 25 parallel AI workers across 7 swarms simultaneously and produced 12,500 lines of code across 96 files in 36 minutes. We had no idea what it cost. We hadn't measured quality. We'd just shipped fast.

So we ran a benchmark. Then another. Then 50 more.

What started as "let's figure out if parallel workers are worth it" turned into a set of findings that overturned almost every assumption we started with.


What We Tested

Task types:

  • T3 — Notes CRUD: SQL migration + TypeScript types + 2 API routes + Vitest tests. 3 workers. Small-to-medium.
  • T6 — Notifications system: large greenfield. 8 workers. Complex.
  • T7 — SMS refactor: modifying existing code. Pure edit.

Approaches:

  • V1 — minimal, vague prompts. Workers guess at interfaces and import paths.
  • V2 — CONTRACT.md added: workers get exact interfaces, column names, import paths, SQL conventions upfront.
  • NS — V2 with self-evolution: worker checks its own output and retries if it falls short.
  • NSX — V2 with cross-model verification: Opus reads the worker's output and writes line-level critique before retry.
  • V2O — V2 with a one-shot Opus review pass at the end (no retry loop — just a targeted surgical edit).

Architecture comparisons: Sequential · UpCommander (tmux workers) · Agent Teams (Anthropic native sub-agents)

Independent variables: CONTRACT.md on/off, architecture type, model (Haiku/Sonnet/Opus), grader (Opus/GPT-4o/Gemini).


Finding 1: CONTRACT.md is the entire game

A structured brief before the task — exact TypeScript interfaces, exact column names, exact import paths, SQL conventions, explicit non-goals — made the single largest difference of anything we tested.

2×2 factorial experiment (20 controlled runs):

CONTRACT.md Effect — 2×2 Factorial, N=20

The CONTRACT.md effect: -65% cost, -68% time, quality from 5 to 9/10. Architecture was secondary. Same model, same codebase, just a different document.

What goes in the brief that matters:

## CONTRACT.md

### Interfaces
interface Note {
  id: string;
  user_id: string;
  content: string;
  created_at: string;
}

### Database
Table: platform.notes
Columns: id (uuid), user_id (uuid FK auth.users), content (text), created_at (timestamptz)
SQL conventions: CREATE TABLE IF NOT EXISTS, no DROP POLICY

### Import paths
Types: @/lib/platform/notes/types
Supabase client: @/lib/supabase-server (server components)

### Non-goals
- No pagination in this PR
- No soft delete
- No full-text search
Enter fullscreen mode Exit fullscreen mode

Workers stop exploring and start executing.


Finding 2: Agent Teams cost 73–124% more with zero quality gain

Anthropic markets Agent Teams as a way to parallelize work. Technically true. The data:

Agent Teams vs Sequential — T3 Task

T6 (large task) results:

Agent Teams vs Sequential — T6 Large Task

Every agent loads the full codebase context independently. Three agents = three copies of your 80K-token context. The cache burn dominates. Agent Teams never wins on cost. Sequential + CONTRACT wins cost every time.


Finding 3: Retry loops make the output worse

The Tsinghua NLH paper (March 2026) found that self-evolution (acceptance-gated retry loops) was the best-performing harness module in their benchmarks (+4.8 on SWE-bench). We built it.

N=5 on T3 with deliberate traps (wrong import paths, missing exports):

Retry Loops Degrade Quality — N=5

Self-evolution improved acceptance criteria by 1 item but degraded overall quality from 9/10 to 6/10 and cost 2.1× more.

Why? The model doesn't make surgical edits. It regenerates entire files. Fixing a broken import path means rewriting the whole route file — and losing all the CRUD endpoints and tests that were correct the first time. We observed this across every single retry attempt across 3 runs. It never didn't happen.

There's also a ceiling: the model cannot see the blindspot it keeps creating. Every run, every retry, stalled at exactly 4/5 ACs. The 5th requirement — the one the model kept failing — never resolved regardless of how specific the hint was.

NS-run-1: Fix import path → regenerates route.ts → loses 3 endpoints → 4/5 ACs, 6/10
NS-run-2: Fix import path → regenerates route.ts → loses 3 endpoints → 4/5 ACs, 6/10
...same pattern, 15 retry attempts across 3 runs
Enter fullscreen mode Exit fullscreen mode

Don't use retry loops for code generation. The architecture is the problem.

This contradicts the Tsinghua finding. We're not claiming they're wrong — different task types, different models, different definition of self-evolution. We observed a specific failure mode on our setup. Replication caveat applies.


Finding 4: Opus one-shot review adds nothing when the contract is good

We tested V2O (V2 + Opus reads the full output and makes surgical edits — not a retry loop, just a targeted one-shot patch):

Clean N=5 retest (full file context, no truncation):

Opus One-Shot Review — Clean N=5 Retest

Zero quality gain. +56% cost. When the CONTRACT.md is well-formed, Sonnet already reaches 9.8/10 — there's nothing for Opus to fix.

The lesson: Write the contract right. Don't retry. Don't add a review pass. The brief is the quality lever.


Finding 5: AST compression cuts tokens 91%

CONTRACT generation for refactoring tasks was expensive — the generator had to read the entire codebase ($0.36 vs $0.15–0.17 for greenfield). We adapted the AST-summary approach from agora-code: tree-sitter parsing, export-only extraction, cached by git blob SHA.

Results on 28 production files:

AST Index Compression — Production Codebase

118x compression. For a large T6 session: baseline $5.45 → $0.85 stacked with CONTRACT.md. Zero quality tradeoff.


Finding 6: Haiku + CONTRACT ≈ Sonnet + CONTRACT (at 64% less cost)

Does the Stanford Meta-Harness finding hold? (Haiku with an optimized harness beat hand-built Opus harnesses on Terminal-Bench 2.) We tested all three models with identical CONTRACT.md prompts:

Model Comparison — Same CONTRACT.md (N=5 each)

Haiku scores 9.0/10 at 36% of Sonnet's cost. The scaffolding does most of the work. Opus adds 0.2 points at a 69% premium — not justified.

Implication: For boilerplate workers in multi-worker sessions, route Haiku. Route Sonnet only to workers making non-trivial design decisions.


Finding 7: Cross-vendor grading agrees within ±1 point

All quality scoring in this project uses Opus as the grader. We validated this by grading the same 5 V2 outputs with three model families simultaneously:

Cross-Vendor Grading — Same 5 Outputs

Cross-vendor spread: ±1.0 pts. Opus grading is directionally reliable. Gemini is systematically stricter and catches issues the others miss (unused NoteListOptions in the test file) — worth adding to production quality pipelines.


The Stacked Numbers

All improvements applied to a large T6 session:

Stacked Savings — T6 Session

$5.45 → $0.83. -85%. Same model throughout.


The Six Rules

  1. Write the CONTRACT first. Always, for any task touching 3+ files. Costs ~$0.15 to generate. Saves 47–54% on the task. Paid back on the first run, every time.

  2. Don't use Agent Teams for cost-sensitive work. 73–124% more expensive. No quality benefit. Empirically proven across N=5.

  3. Don't use retry loops. They degrade quality (9→6/10) and cost 2×. The model regenerates whole files when it retries — correct sections disappear. Skip self-evolution entirely.

  4. Don't add an Opus review pass when your contract is good. Sonnet + CONTRACT already hits 9.8/10. Clean N=5 confirmation. Write a better brief instead of paying for a review.

  5. Compress your file context with AST extraction. 91% token reduction, zero quality tradeoff.

  6. Use Haiku for boilerplate workers. 9.0/10 quality at 64% of Sonnet's cost. The scaffolding does the work.


The CLI

# Install
npm install -g @upgpt/upcommander-cli

# Set your key
export ANTHROPIC_API_KEY=sk-ant-...

# Generate contract + run worker
upcommander run "add pagination to the notes API"

# Or: generate contract first, review it, then run
upcommander contract "add pagination to the notes API"
upcommander run "add pagination to the notes API"

# Quality review on specific files (Opus one-shot)
upcommander review src/app/api/notes/route.ts

# Regenerate the codebase index
upcommander index
Enter fullscreen mode Exit fullscreen mode

The repo includes:

  • Contract generator (Sonnet, ~$0.15 per contract)
  • L0/L1/L2 codebase index (118x compression)
  • AST-summary module (tree-sitter, 12 languages)
  • Ephemeral Haiku orchestrator (-96% orchestration cost)
  • Worker recipes (K2_Solo, Pipeline-3Tier)
  • All 52+ benchmark evaluation files in /evaluations

What's Still Open

  1. Human quality review — all scoring is model-on-model. Same-family bias acknowledged. Independent human review pending.
  2. Non-greenfield at scale — all real data is greenfield. Large refactoring at production scale needs its own benchmark series.
  3. OpenRouter multi-model routing — infrastructure exists, benchmarks pending.

Full methodology, raw data, and source

GitHub: github.com/UpGPT-ai/upcommander

npm: npm install -g @upgpt/upcommander-cli

Benchmark data: /evaluations in the repo — all raw JSON, every run

Product page: upgpt.ai/tools/upcommander


Built at UpGPT. Questions, replications or if you test and find something different, please reach out: hello@upgpt.ai

Top comments (0)