DEV Community

Rob
Rob

Posted on • Originally published at vibescoder.dev

Showdown Thoughts: The Three-Pass Pattern

Model Showdown Round 5
ended with a leaderboard. Sonnet 4.6 won on the rubric. Opus 4.7 placed
second. Qwen 3.5 contributed almost nothing structural. That's the
measurement story.

This is the methodology story — what happened after the scores were
revealed.

The Problem With Picking a Winner

The naive workflow after a bakeoff is: pick the best run, merge it to
main, ship it. Winner takes all.

That's wrong, and Round 5 made it obvious why.

The winning run (Sonnet 4.6) had the best overall rubric score. It also
had a weaker path validator than Opus 4.7, and its orphan-matching logic
would have missed real-world cases that Opus 4.6 caught. The second-place
run (Opus 4.7) had the best validator and the cleanest route structure, but
the worst data source choice — reading from the build-time filesystem
instead of the live GitHub Contents API.

No individual run was what I'd ship. Each one had at least one bad call.
The bakeoff's real output wasn't a winner. It was a map.

When 4 of 4 models made the same design choice, that choice was obviously
right. When they diverged — on validation strictness, on data source, on
UX for destructive actions — that divergence was the signal. Those were the
actual design decisions, the ones worth spending judgment on.

The Three Passes

What emerged from Round 5 is a pattern I've now run twice and would reach
for again on any feature where the design space is unclear:

Pass 1 — Bakeoff. Run N models (I used 4) on the same prompt in
isolated sessions. Judge blind, before you know which branch is which.
Score against a rubric. The output of this pass isn't any of the N
implementations — it's the decision map. You now know which choices are
contested and which are obvious.

Pass 2 — Merge. Write down a merge plan before touching any code: for
each contested layer, which run's approach wins and why. Then ask an agent
to compose the merged best-of from those inputs. The merge is strictly
better than any individual bakeoff run because it draws on information none
of the bakeoff contestants had — the scored comparison of all four.

For Round 5 the plan looked like this:

Layer Source Why
Path validator Opus 4.7 (Run 1) Only run with 2-segment enforcement + .. block + non-empty checks
Three-tier orphan match Opus 4.6 (Run 2) Only run that noticed exact-match missed real cases like day-four
Type-narrowed body parsing Sonnet 4.6 (Run 3) typeof body === "object" && "path" in body, no as casts
GitHub Contents API Opus 4.6 / Sonnet 4.6 Live state vs. build-time filesystem snapshot
Confirm-modal UX Sonnet 4.6 Best visual polish in the screenshots

Qwen 3.5 contributed nothing structural to this table. The bakeoff said
"skip this one" clearly enough that there was nothing to debate. That's
useful information too — knowing which pieces to skip is part of the map.

The merge was 13 files changed, +990/-9. One TypeScript error caught and
fixed. Build passed first try after that. Opened as a PR with the heritage
table in the description so future reviewers can trace any decision back to
its source run.

Pass 3 — Polish. The merged feature went live. I opened it against
real production data and spotted four things immediately: truncated
directory names with no tooltip, delete buttons invisible on touch devices,
no bulk delete UI despite the API supporting paths: [], and an orphaned
section header that would show with count 0 after the lone orphan was
deleted.

None of those were predictable before live use. You can't predict friction
from a code review — you observe it. The polish pass had to come after the
merge because the artifact it was polishing didn't exist until then.

The polish was 6 files changed, +265/-54 and about 20 minutes of agent
time.

When to Use It

The pattern has a real cost: the bakeoff is N full agent sessions, each
producing a complete implementation that you won't ship. For Round 5 that
was ~$35 in inference and a few hours of judging.

That's cheap insurance when the feature has any of these properties:

  • Destructive verbs. Delete, update, payment, permission change. The cost of getting validation wrong outweighs the cost of the bakeoff.
  • Multiple defensible architectures. Where should validation live? What's the data source? How does auth thread through? When you genuinely don't know the right answer, a bakeoff shows you the option space.
  • Hard to change later. Database schemas. Public API contracts. Anything that will accumulate callers.

It's overkill for a 20-line UI tweak or a feature with a single obvious
implementation. The signal value of the bakeoff scales with how uncertain
you are about the design.

What I'd Do Differently

Three things I'd change for the next run:

Name the contestant chats before pasting the prompt. All four Round 5
chats showed up as "New Chat" in the Coder API cost summary, which meant
20 minutes of token-volume detective work to figure out which cost belonged
to which run. Five seconds of effort would have prevented that.

Capture per-phase stats. I have clean bakeoff numbers. I don't have
separate merge or polish numbers — they're folded into the judging thread.
A lightweight wrapper script around each phase would make the next
iteration measurable end-to-end.

Write the polish friction items down before fixing them. I noticed four
issues and fixed them in one pass, which collapsed the "observed" list and
the "fixed" list into the same moment. Separating them — even by five
minutes — would have made the "what does live-review surface" lesson
sharper for the writeup. And occasionally you'll notice something that
isn't worth fixing.

By the Numbers

  • 3 phases: Bakeoff (4 parallel attempts), Merge (1 informed pass), Polish (1 live-review pass)
  • 4 implementations produced in the bakeoff, 0 shipped to main as-is
  • 3 of 4 bakeoff runs contributed at least one structural piece to the merge
  • 13 files changed in the merge pass (+990/-9)
  • 6 files changed in the polish pass (+265/-54)
  • 4 friction items caught in polish that couldn't have been predicted before live use
  • ~$35.56 inference cost for the bakeoff phase
  • ~45 min bakeoff (parallel), ~30 min merge, ~20 min polish

Top comments (0)