<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Greg B.</title>
    <description>The latest articles on DEV Community by Greg B. (@greg_c6450ed3e296e7).</description>
    <link>https://dev.to/greg_c6450ed3e296e7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3856620%2Faf60e9d7-dbcd-44cc-8de7-28abb14bdcb2.jpg</url>
      <title>DEV Community: Greg B.</title>
      <link>https://dev.to/greg_c6450ed3e296e7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/greg_c6450ed3e296e7"/>
    <language>en</language>
    <item>
      <title>We Ran 52 AI Coding Benchmarks. Here's Every Uncomfortable Thing We Found.</title>
      <dc:creator>Greg B.</dc:creator>
      <pubDate>Tue, 21 Apr 2026 01:08:32 +0000</pubDate>
      <link>https://dev.to/greg_c6450ed3e296e7/we-ran-52-ai-coding-benchmarks-heres-every-uncomfortable-thing-we-found-d72</link>
      <guid>https://dev.to/greg_c6450ed3e296e7/we-ran-52-ai-coding-benchmarks-heres-every-uncomfortable-thing-we-found-d72</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; The biggest variable in AI-assisted development isn't the model, the tool, or parallelism. It's what you write before the AI starts. A structured brief (CONTRACT.md) reduces cost 54% and raises quality from 5/10 to 9/10. Agent Teams cost 73–124% more with no quality gain. Retry loops degrade quality from 9/10 to 6/10. We validated all of this across 52+ controlled runs and open-sourced the tool.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/UpGPT-ai/upcommander" rel="noopener noreferrer"&gt;github.com/UpGPT-ai/upcommander&lt;/a&gt;&lt;br&gt;&lt;br&gt;
→ &lt;code&gt;npm install -g @upgpt/upcommander-cli&lt;/code&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Why We Did This
&lt;/h2&gt;

&lt;p&gt;We had just run 25 parallel AI workers across 7 swarms simultaneously and produced 12,500 lines of code across 96 files in 36 minutes. We had no idea what it cost. We hadn't measured quality. We'd just shipped fast.&lt;/p&gt;

&lt;p&gt;So we ran a benchmark. Then another. Then 50 more.&lt;/p&gt;

&lt;p&gt;What started as "let's figure out if parallel workers are worth it" turned into a set of findings that overturned almost every assumption we started with.&lt;/p&gt;


&lt;h2&gt;
  
  
  What We Tested
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Task types:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T3&lt;/strong&gt; — Notes CRUD: SQL migration + TypeScript types + 2 API routes + Vitest tests. 3 workers. Small-to-medium.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T6&lt;/strong&gt; — Notifications system: large greenfield. 8 workers. Complex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T7&lt;/strong&gt; — SMS refactor: modifying existing code. Pure edit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Approaches:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;V1&lt;/strong&gt; — minimal, vague prompts. Workers guess at interfaces and import paths.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V2&lt;/strong&gt; — CONTRACT.md added: workers get exact interfaces, column names, import paths, SQL conventions upfront.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NS&lt;/strong&gt; — V2 with self-evolution: worker checks its own output and retries if it falls short.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NSX&lt;/strong&gt; — V2 with cross-model verification: Opus reads the worker's output and writes line-level critique before retry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V2O&lt;/strong&gt; — V2 with a one-shot Opus review pass at the end (no retry loop — just a targeted surgical edit).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture comparisons:&lt;/strong&gt; Sequential · UpCommander (tmux workers) · Agent Teams (Anthropic native sub-agents)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Independent variables:&lt;/strong&gt; CONTRACT.md on/off, architecture type, model (Haiku/Sonnet/Opus), grader (Opus/GPT-4o/Gemini).&lt;/p&gt;


&lt;h2&gt;
  
  
  Finding 1: CONTRACT.md is the entire game
&lt;/h2&gt;

&lt;p&gt;A structured brief before the task — exact TypeScript interfaces, exact column names, exact import paths, SQL conventions, explicit non-goals — made the single largest difference of anything we tested.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2×2 factorial experiment (20 controlled runs):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00ka4f8engyuub9x43l7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00ka4f8engyuub9x43l7.png" alt="CONTRACT.md Effect — 2×2 Factorial, N=20" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The CONTRACT.md effect: &lt;strong&gt;-65% cost, -68% time, quality from 5 to 9/10&lt;/strong&gt;. Architecture was secondary. Same model, same codebase, just a different document.&lt;/p&gt;

&lt;p&gt;What goes in the brief that matters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="err"&gt;##&lt;/span&gt; &lt;span class="nx"&gt;CONTRACT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;md&lt;/span&gt;

&lt;span class="err"&gt;###&lt;/span&gt; &lt;span class="nx"&gt;Interfaces&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;Note&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="err"&gt;###&lt;/span&gt; &lt;span class="nx"&gt;Database&lt;/span&gt;
&lt;span class="nx"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;platform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;notes&lt;/span&gt;
&lt;span class="nx"&gt;Columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;id &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;user_id &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;uuid&lt;/span&gt; &lt;span class="nx"&gt;FK&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;content &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;created_at &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;timestamptz&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;SQL&lt;/span&gt; &lt;span class="nx"&gt;conventions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CREATE&lt;/span&gt; &lt;span class="nx"&gt;TABLE&lt;/span&gt; &lt;span class="nx"&gt;IF&lt;/span&gt; &lt;span class="nx"&gt;NOT&lt;/span&gt; &lt;span class="nx"&gt;EXISTS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;no&lt;/span&gt; &lt;span class="nx"&gt;DROP&lt;/span&gt; &lt;span class="nx"&gt;POLICY&lt;/span&gt;

&lt;span class="err"&gt;###&lt;/span&gt; &lt;span class="nx"&gt;Import&lt;/span&gt; &lt;span class="nx"&gt;paths&lt;/span&gt;
&lt;span class="nx"&gt;Types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="sr"&gt;/lib/&lt;/span&gt;&lt;span class="nx"&gt;platform&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;types&lt;/span&gt;
&lt;span class="nx"&gt;Supabase&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="sr"&gt;/lib/&lt;/span&gt;&lt;span class="nx"&gt;supabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;server &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="nx"&gt;components&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="err"&gt;###&lt;/span&gt; &lt;span class="nx"&gt;Non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;goals&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;No&lt;/span&gt; &lt;span class="nx"&gt;pagination&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt; &lt;span class="nx"&gt;PR&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;No&lt;/span&gt; &lt;span class="nx"&gt;soft&lt;/span&gt; &lt;span class="k"&gt;delete&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;No&lt;/span&gt; &lt;span class="nx"&gt;full&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="nx"&gt;search&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Workers stop exploring and start executing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 2: Agent Teams cost 73–124% more with zero quality gain
&lt;/h2&gt;

&lt;p&gt;Anthropic markets Agent Teams as a way to parallelize work. Technically true. The data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa33ac79rr16t89vpx9u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa33ac79rr16t89vpx9u.png" alt="Agent Teams vs Sequential — T3 Task" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T6 (large task) results:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxuy6tlzl2pwmqmz05ru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxuy6tlzl2pwmqmz05ru.png" alt="Agent Teams vs Sequential — T6 Large Task" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every agent loads the full codebase context independently. Three agents = three copies of your 80K-token context. The cache burn dominates. &lt;strong&gt;Agent Teams never wins on cost. Sequential + CONTRACT wins cost every time.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 3: Retry loops make the output worse
&lt;/h2&gt;

&lt;p&gt;We tested whether self-evolution (acceptance-gated retry loops) could improve the harness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;N=5 on T3 with deliberate traps&lt;/strong&gt; (wrong import paths, missing exports):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehyrwd52hd593jep7opx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehyrwd52hd593jep7opx.png" alt="Retry Loops Degrade Quality — N=5" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Self-evolution improved acceptance criteria by 1 item but &lt;strong&gt;degraded overall quality from 9/10 to 6/10&lt;/strong&gt; and cost 2.1× more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; The model doesn't make surgical edits. It regenerates entire files. Fixing a broken import path means rewriting the whole route file — and losing all the CRUD endpoints and tests that were correct the first time. We observed this across every single retry attempt across 3 runs. It never didn't happen.&lt;/p&gt;

&lt;p&gt;There's also a ceiling: the model cannot see the blindspot it keeps creating. Every run, every retry, stalled at exactly 4/5 ACs. The 5th requirement — the one the model kept failing — never resolved regardless of how specific the hint was.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NS-run-1: Fix import path → regenerates route.ts → loses 3 endpoints → 4/5 ACs, 6/10
NS-run-2: Fix import path → regenerates route.ts → loses 3 endpoints → 4/5 ACs, 6/10
...same pattern, 15 retry attempts across 3 runs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Don't use retry loops for code generation.&lt;/strong&gt; The architecture is the problem.&lt;/p&gt;

&lt;p&gt;We're not claiming this can't work — different task types, different models, different definition of self-evolution and maybe it can. We observed a specific failure mode on our setup. Replication caveat applies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 4: Opus one-shot review adds nothing when the contract is good
&lt;/h2&gt;

&lt;p&gt;We tested V2O (V2 + Opus reads the full output and makes surgical edits — not a retry loop, just a targeted one-shot patch):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean N=5 retest (full file context, no truncation):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6vfm8gx58sut3ckl1ju.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6vfm8gx58sut3ckl1ju.png" alt="Opus One-Shot Review — Clean N=5 Retest" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Zero quality gain. +56% cost. When the CONTRACT.md is well-formed, Sonnet already reaches 9.8/10 — there's nothing for Opus to fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Write the contract right. Don't retry. Don't add a review pass. The brief is the quality lever.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 5: AST compression cuts tokens 91%
&lt;/h2&gt;

&lt;p&gt;CONTRACT generation for refactoring tasks was expensive — the generator had to read the entire codebase ($0.36 vs $0.15–0.17 for greenfield). We adapted the AST-summary approach from &lt;a href="https://github.com/thebnbrkr/agora-code" rel="noopener noreferrer"&gt;agora-code&lt;/a&gt;: tree-sitter parsing, export-only extraction, cached by git blob SHA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results on 28 production files:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmgn3ygn0vo3pb3lxxrcb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmgn3ygn0vo3pb3lxxrcb.png" alt="AST Index Compression — Production Codebase" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;118x compression.&lt;/strong&gt; For a large T6 session: baseline $5.45 → &lt;strong&gt;$0.85&lt;/strong&gt; stacked with CONTRACT.md. Zero quality tradeoff.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 6: Haiku + CONTRACT ≈ Sonnet + CONTRACT (at 64% less cost)
&lt;/h2&gt;

&lt;p&gt;What if Haiku with an optimized harness could beat hand-built Opus harnesses on Terminal-Bench?  We tested all three models with identical CONTRACT.md prompts:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfajl2rlyf5e3ctt1479.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfajl2rlyf5e3ctt1479.png" alt="Model Comparison — Same CONTRACT.md (N=5 each)" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Haiku scores 9.0/10 at 36% of Sonnet's cost. &lt;strong&gt;The scaffolding does most of the work.&lt;/strong&gt; Opus adds 0.2 points at a 69% premium — not justified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication:&lt;/strong&gt; For boilerplate workers in multi-worker sessions, route Haiku. Route Sonnet only to workers making non-trivial design decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 7: Cross-vendor grading agrees within ±1 point
&lt;/h2&gt;

&lt;p&gt;All quality scoring in this project uses Opus as the grader. We validated this by grading the same 5 V2 outputs with three model families simultaneously:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fef3vuf9mafy0rzhixim9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fef3vuf9mafy0rzhixim9.png" alt="Cross-Vendor Grading — Same 5 Outputs" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cross-vendor spread: &lt;strong&gt;±1.0 pts&lt;/strong&gt;. Opus grading is directionally reliable. Gemini is systematically stricter and catches issues the others miss (unused &lt;code&gt;NoteListOptions&lt;/code&gt; in the test file) — worth adding to production quality pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stacked Numbers
&lt;/h2&gt;

&lt;p&gt;All improvements applied to a large T6 session:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuzzftjqovnezkp65lbp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuzzftjqovnezkp65lbp.png" alt="Stacked Savings — T6 Session" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;$5.45 → $0.83. -85%.&lt;/strong&gt; Same model throughout.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Six Rules
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write the CONTRACT first.&lt;/strong&gt; Always, for any task touching 3+ files. Costs ~$0.15 to generate. Saves 47–54% on the task. Paid back on the first run, every time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't use Agent Teams&lt;/strong&gt; for cost-sensitive work. 73–124% more expensive. No quality benefit. Empirically proven across N=5.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't use retry loops.&lt;/strong&gt; They degrade quality (9→6/10) and cost 2×. The model regenerates whole files when it retries — correct sections disappear. Skip self-evolution entirely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't add an Opus review pass&lt;/strong&gt; when your contract is good. Sonnet + CONTRACT already hits 9.8/10. Clean N=5 confirmation. Write a better brief instead of paying for a review.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compress your file context&lt;/strong&gt; with AST extraction. 91% token reduction, zero quality tradeoff.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use Haiku for boilerplate workers.&lt;/strong&gt; 9.0/10 quality at 64% of Sonnet's cost. The scaffolding does the work.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The CLI
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @upgpt/upcommander-cli

&lt;span class="c"&gt;# Set your key&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...

&lt;span class="c"&gt;# Generate contract + run worker&lt;/span&gt;
upcommander run &lt;span class="s2"&gt;"add pagination to the notes API"&lt;/span&gt;

&lt;span class="c"&gt;# Or: generate contract first, review it, then run&lt;/span&gt;
upcommander contract &lt;span class="s2"&gt;"add pagination to the notes API"&lt;/span&gt;
upcommander run &lt;span class="s2"&gt;"add pagination to the notes API"&lt;/span&gt;

&lt;span class="c"&gt;# Quality review on specific files (Opus one-shot)&lt;/span&gt;
upcommander review src/app/api/notes/route.ts

&lt;span class="c"&gt;# Regenerate the codebase index&lt;/span&gt;
upcommander index
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repo includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contract generator (Sonnet, ~$0.15 per contract)&lt;/li&gt;
&lt;li&gt;L0/L1/L2 codebase index (118x compression)&lt;/li&gt;
&lt;li&gt;AST-summary module (tree-sitter, 12 languages)&lt;/li&gt;
&lt;li&gt;Ephemeral Haiku orchestrator (-96% orchestration cost)&lt;/li&gt;
&lt;li&gt;Worker recipes (K2_Solo, Pipeline-3Tier)&lt;/li&gt;
&lt;li&gt;All 52+ benchmark evaluation files in &lt;code&gt;/evaluations&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Still Open
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Human quality review&lt;/strong&gt; — all scoring is model-on-model. Same-family bias acknowledged. Independent human review pending.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-greenfield at scale&lt;/strong&gt; — all real data is greenfield. Large refactoring at production scale needs its own benchmark series.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter multi-model routing&lt;/strong&gt; — infrastructure exists, benchmarks pending.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Full methodology, raw data, and source
&lt;/h2&gt;

&lt;p&gt;→ &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/UpGPT-ai/upcommander" rel="noopener noreferrer"&gt;github.com/UpGPT-ai/upcommander&lt;/a&gt;&lt;br&gt;&lt;br&gt;
→ &lt;strong&gt;npm:&lt;/strong&gt; &lt;code&gt;npm install -g @upgpt/upcommander-cli&lt;/code&gt;&lt;br&gt;&lt;br&gt;
→ &lt;strong&gt;Benchmark data:&lt;/strong&gt; &lt;code&gt;/evaluations&lt;/code&gt; in the repo — all raw JSON, every run&lt;br&gt;&lt;br&gt;
→ &lt;strong&gt;Product page:&lt;/strong&gt; &lt;a href="https://upgpt.ai/tools/upcommander" rel="noopener noreferrer"&gt;upgpt.ai/tools/upcommander&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built at &lt;a href="https://upgpt.ai" rel="noopener noreferrer"&gt;UpGPT&lt;/a&gt;. Questions, replications or if you test and find something different, please reach out: &lt;a href="mailto:hello@upgpt.ai"&gt;hello@upgpt.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>productivity</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
