Model Showdown Round 5
ended with a leaderboard. Sonnet 4.6 won on the rubric. Opus 4.7 placed
second. Qwen 3.5 contributed almost nothing structural. That's the
measurement story.
This is the methodology story — what happened after the scores were
revealed.
The Problem With Picking a Winner
The naive workflow after a bakeoff is: pick the best run, merge it to
main, ship it. Winner takes all.
That's wrong, and Round 5 made it obvious why.
The winning run (Sonnet 4.6) had the best overall rubric score. It also
had a weaker path validator than Opus 4.7, and its orphan-matching logic
would have missed real-world cases that Opus 4.6 caught. The second-place
run (Opus 4.7) had the best validator and the cleanest route structure, but
the worst data source choice — reading from the build-time filesystem
instead of the live GitHub Contents API.
No individual run was what I'd ship. Each one had at least one bad call.
The bakeoff's real output wasn't a winner. It was a map.
When 4 of 4 models made the same design choice, that choice was obviously
right. When they diverged — on validation strictness, on data source, on
UX for destructive actions — that divergence was the signal. Those were the
actual design decisions, the ones worth spending judgment on.
The Three Passes
What emerged from Round 5 is a pattern I've now run twice and would reach
for again on any feature where the design space is unclear:
Pass 1 — Bakeoff. Run N models (I used 4) on the same prompt in
isolated sessions. Judge blind, before you know which branch is which.
Score against a rubric. The output of this pass isn't any of the N
implementations — it's the decision map. You now know which choices are
contested and which are obvious.
Pass 2 — Merge. Write down a merge plan before touching any code: for
each contested layer, which run's approach wins and why. Then ask an agent
to compose the merged best-of from those inputs. The merge is strictly
better than any individual bakeoff run because it draws on information none
of the bakeoff contestants had — the scored comparison of all four.
For Round 5 the plan looked like this:
| Layer | Source | Why |
|---|---|---|
| Path validator | Opus 4.7 (Run 1) | Only run with 2-segment enforcement + .. block + non-empty checks |
| Three-tier orphan match | Opus 4.6 (Run 2) | Only run that noticed exact-match missed real cases like day-four
|
| Type-narrowed body parsing | Sonnet 4.6 (Run 3) |
typeof body === "object" && "path" in body, no as casts |
| GitHub Contents API | Opus 4.6 / Sonnet 4.6 | Live state vs. build-time filesystem snapshot |
| Confirm-modal UX | Sonnet 4.6 | Best visual polish in the screenshots |
Qwen 3.5 contributed nothing structural to this table. The bakeoff said
"skip this one" clearly enough that there was nothing to debate. That's
useful information too — knowing which pieces to skip is part of the map.
The merge was 13 files changed, +990/-9. One TypeScript error caught and
fixed. Build passed first try after that. Opened as a PR with the heritage
table in the description so future reviewers can trace any decision back to
its source run.
Pass 3 — Polish. The merged feature went live. I opened it against
real production data and spotted four things immediately: truncated
directory names with no tooltip, delete buttons invisible on touch devices,
no bulk delete UI despite the API supporting paths: [], and an orphaned
section header that would show with count 0 after the lone orphan was
deleted.
None of those were predictable before live use. You can't predict friction
from a code review — you observe it. The polish pass had to come after the
merge because the artifact it was polishing didn't exist until then.
The polish was 6 files changed, +265/-54 and about 20 minutes of agent
time.
When to Use It
The pattern has a real cost: the bakeoff is N full agent sessions, each
producing a complete implementation that you won't ship. For Round 5 that
was ~$35 in inference and a few hours of judging.
That's cheap insurance when the feature has any of these properties:
- Destructive verbs. Delete, update, payment, permission change. The cost of getting validation wrong outweighs the cost of the bakeoff.
- Multiple defensible architectures. Where should validation live? What's the data source? How does auth thread through? When you genuinely don't know the right answer, a bakeoff shows you the option space.
- Hard to change later. Database schemas. Public API contracts. Anything that will accumulate callers.
It's overkill for a 20-line UI tweak or a feature with a single obvious
implementation. The signal value of the bakeoff scales with how uncertain
you are about the design.
What I'd Do Differently
Three things I'd change for the next run:
Name the contestant chats before pasting the prompt. All four Round 5
chats showed up as "New Chat" in the Coder API cost summary, which meant
20 minutes of token-volume detective work to figure out which cost belonged
to which run. Five seconds of effort would have prevented that.
Capture per-phase stats. I have clean bakeoff numbers. I don't have
separate merge or polish numbers — they're folded into the judging thread.
A lightweight wrapper script around each phase would make the next
iteration measurable end-to-end.
Write the polish friction items down before fixing them. I noticed four
issues and fixed them in one pass, which collapsed the "observed" list and
the "fixed" list into the same moment. Separating them — even by five
minutes — would have made the "what does live-review surface" lesson
sharper for the writeup. And occasionally you'll notice something that
isn't worth fixing.
By the Numbers
- 3 phases: Bakeoff (4 parallel attempts), Merge (1 informed pass), Polish (1 live-review pass)
- 4 implementations produced in the bakeoff, 0 shipped to main as-is
- 3 of 4 bakeoff runs contributed at least one structural piece to the merge
- 13 files changed in the merge pass (+990/-9)
- 6 files changed in the polish pass (+265/-54)
- 4 friction items caught in polish that couldn't have been predicted before live use
- ~$35.56 inference cost for the bakeoff phase
- ~45 min bakeoff (parallel), ~30 min merge, ~20 min polish
Top comments (0)