I've been running model showdowns on Vibes Coder for a while now. Each round has been a little messier than I wanted — different prompts, accidental context leaks, no clean way to compare cost to quality. This one is the first I'd call a fair bakeoff. Two goals going in:
- Make the experiment itself rigorous enough that future rounds can build on it — isolated chat sessions, identical prompts, anonymized branches, blind judging, real token + runtime data pulled from the Coder API.
- Compare three flavors of Claude against our local champ. Opus 4.7, Opus 4.6, and Sonnet 4.6 from Anthropic; Qwen 3.5 35B-A3B running on llama.cpp on the RTX 5090 in the home lab. Four models, same task, four isolated Coder Agents sessions, blind judging.
The headline: Sonnet 4.6 beat Opus 4.6 on a coding task. Not by much (4.48 vs 4.36) but cleanly, on its own merits, with no asterisks. And once I pulled real token and runtime data from Coder's chat-cost API, a second headline emerged: weighted by cost, Sonnet's win becomes decisive — about 10x cheaper per rubric point than either Opus model. A third wrinkle: Opus 4.7 finished the task in 9.2 minutes, the fastest of the three Claude runs. It won the rubric without burning the most time. The deeper story is what each model did with the same prompt, and what it took to make the bakeoff fair in the first place — which turned out to be more work than the bakeoff itself.
The Setup
The contestants:
| Run | Model | Where it runs |
|---|---|---|
| 1 | Claude Opus 4.7 | Cloud, via Coder Agents |
| 2 | Claude Sonnet 4.6 | Cloud, via Coder Agents |
| 3 | Claude Opus 4.6 | Cloud, via Coder Agents |
| 4 | Qwen 3.5 35B-A3B | Local, llama.cpp on the RTX 5090, via Coder Agents |
The mapping was private. Branches were named run-1 through run-4. I judged the four branches blind against a fixed rubric, then revealed the identities.
The task: build image management into the vibescoder.dev admin dashboard. The current /admin page has a Settings card that's a placeholder. The spec asked for an Images card (or a replacement) that lists the post-image directories under public/images/, detects orphans (directories with no matching post), provides a screenshot view, and adds an API route to delete a directory.
It's not a huge feature, but it has enough surface area to differentiate models: filesystem traversal, slug matching, path validation, an API contract with a destructive verb, a UI page, and at least one judgment call (what counts as an "orphan?").
The fairness story
Before launching anything, three things needed fixing. None of them are interesting on their own. Together they're the operational lesson of this post: a bakeoff isn't fair by default.
Fix 1: Node 18 vs Node 20
The workspace image is built on Ubuntu 24.04. Ubuntu 24.04's apt Node is 18.19. Next.js 16 — what the blog engine ships on — requires Node 20+. Any agent that ran apt install nodejs would silently break its own build.
The fix was a Dockerfile change in the coder-templates repo: install Node 20 from NodeSource at image build time, pin npm, verify node -v reports 20.x in the smoke test. After that, node -v in a fresh workspace prints v20.20.2 and nothing the agents do (short of nvm shenanigans) changes that.
Fix 2: The system instructions were lying
The chat system prompt — injected at the top of every Coder Agents session — said Node was not pre-installed and told agents to install it themselves. Correct on the previous image; actively misleading after Fix 1. An agent following the instructions would apt install nodejs, get Node 18, downgrade the runtime, and break the build.
I rewrote the instructions to say Node 20 is pre-installed, do not reinstall, use nvm if you need a different version. Boring change. Huge impact on whether the bakeoff produces meaningful signal.
Fix 3: Prompt poisoning
The first draft of the bakeoff prompt told each agent to create a branch named after the model running the session — bakeoff-opus47, bakeoff-sonnet46, and so on. A sharp catch from the human side: that wording leaks competition signaling into the prompt. An agent that sees "you are opus47" or even "this is a bakeoff" can adjust behavior in ways that aren't comparable. The experiment stops measuring "what does this model do with the prompt" and starts measuring "what does this model do when it knows it's on stage."
Fix: replace model names with neutral ordinals. Branches became run-1 through run-4. The prompt made no reference to other runs, scoring, or any comparison. Each agent thought it was building a feature, not auditioning.
Three small fixes. Together they're the operational lesson: fairness in a model bakeoff requires more setup than the bakeoff itself.
The prompt
The prompt was identical for all four runs, save for the run number in the branch name. Verbatim, with one path generalized:
You are working in the vibescoder.dev blog engine repo. Branch: run-N.
Baseline commit is at the tip of main.
Goal: add image management to /admin.
Requirements:
- List the directories under public/images/ (each directory corresponds
to one post and contains its images).
- For each directory, report: name, file count, total size on disk,
and whether it matches a published or draft post (by slug).
- Surface "orphaned" directories — directories that do not match any
post — so I can clean them up.
- Provide a way to view the images in a directory (thumbnails or list).
- Provide an API route DELETE /api/admin/images that removes a
directory by path. The route must validate input.
- Update the /admin landing page so the new feature is reachable.
You may keep the Settings placeholder card or replace it; either is fine.
- Add a screenshot of the new page to the PR description (use the
Playwright MCP).
- Run `npm run build` before committing. Do not push commits that
fail the build.
- Commit in logical chunks. Push the branch when done.
That's it. No mention of competing runs. No scoring rubric. No model identification. Just a feature spec and a quality bar.
The four implementations
All four runs built it. All four passed npm run build against a shared engine baseline on Node 20.20.2. All four pushed their branches. Then the differences started showing up.
Run 1 — 8 new files, 631+/9-
Replaced the Settings placeholder with an Images card on /admin. Added a dedicated /admin/images page that lists directories server-side, plus a client-side modal that renders a grid of thumbnails when you click into a directory. Three screenshots in the PR description — admin landing, images list, modal open with orphan-flagged styling.
The standout was the API route. Run 1 wrote a real path validator — isValidImageRepoPath — that required exactly two path segments under public/images/, rejected .., and ran before the filesystem call. The route returned distinct status codes for distinct failure modes: 400 for bad input, 404 for missing, 403 for paths that resolve outside the allowed root, 200 for success.
It's not glamorous code. It's just the version where someone thought about the failure modes before writing the success path.
Run 1's /admin/images page. Directory cards, orphan-flagged styling, and a tight path-validated delete API behind the trash icons.
Run 2 — 6 new files, 687+/7-
Kept the Settings card. Added an Images card next to it on /admin. The /admin/images page was the cleanest of the four — tight TypeScript, no as casts in the API route, proper type narrowing (typeof body === "object" && "path" in body) instead of forcing the compiler to trust it. The UI had the most visual polish: directory cards with file counts as a badge, hover states that matched the rest of the admin surface, a confirmation modal on delete that quoted the directory name back at you.
Path validation was decent but not as rigorous as Run 1 — startsWith("public/images/") plus a .. block, no segment-count check. Enough to stop the obvious cases. Not airtight against creative inputs.
Two screenshots. Shipped a polished v1 and stopped.
Run 2 kept the Settings card and put Images next to it. Cleanest TypeScript of the four; smallest screenshot artifact.
Run 3 — 6 new files, 595+/0-
Replaced the Settings placeholder. The /admin/images page started as a server component, then mid-task switched to a client-fetched implementation when Run 3 hit a dev-server timeout on the first integration test. That mid-stream pivot showed up cleanly in the commit history — feat: add admin/images server-rendered, then two commits later, refactor: move admin/images to client fetch (dev server hangs on FS scan).
Path validation matched Run 2's. The thing that made Run 3 interesting was the orphan-detection arc.
The spec said "match directory name against post slugs to find orphans." Three of the four models took that literally — list directories, list slugs, set-difference, report what's left. Run 3 did that first, reported 8 orphaned directories, then checked the result against reality. Looked at the actual file tree and noticed that one of the "orphaned" directories was day-four/, and there's a published post with the slug day-four-rss-analytics-syndication-and-loom. The directory isn't orphaned. It belongs to that post. The matching logic was wrong.
Run 3 iterated three times: exact match → prefix match (does any slug start with this directory name?) → content-reference match (does any post body reference an image in this directory?). After the third pass, the orphan count went from 8 to 1 — and the one remaining was an actual orphan I'd been meaning to delete for weeks.
Small thing in the diff. Big thing in engineering judgment. The other three models reported false-positive orphans with high confidence. Run 3 noticed its own answer was wrong and kept working.
Run 3's screenshot — the largest and most polished of the four. The orphan count in the header reads 1 instead of 8 because the matching logic had been corrected mid-task.
Run 4 — 7 new files, 607+/0-
Kept the Settings card, added an Images card. The /admin/images page worked. Build passed. The directory listing rendered correctly.
Two structural issues. First, the codebase ended up with two utility libraries — images.ts and imageUtils.ts — with overlapping responsibilities. The first pass put filesystem helpers in images.ts, which got imported into a client component, which pulled fs into the client bundle and broke the build. The fix added imageUtils.ts for client-safe helpers and re-imported. The dead code in images.ts was never cleaned up.
Second, the screenshot. Run 4 ran playwright screenshot, hit the same missing-system-libraries failure the other three runs hit (libnspr4, libpango-1.0-0, the headless Chromium kit), sudo apt install-ed the dependencies — and then never retried the screenshot. Instead the PR description got a 184-line markdown description of what the page would look like, in lieu of a PNG. The deps were installed. The retry never fired.
Path validation was the weakest of the four — startsWith on the user-supplied path, no normalization, no .. block. The class of weakness is that a path that looks like it's under public/images/ can still resolve elsewhere when the OS interprets it. I'm not going to spell out the exact bypass; the point is that a one-line startsWith check is not a path validator, and Run 4 shipped one.
Run 4's "screenshot" is a 184-line markdown file. The opening:
Page Description:
/admin/imagesOverall Layout
The
/admin/imagespage displays a dashboard-style view of all image directories with a neon brutalist design consistent with the existing admin theme.Header Section
At the top:
- Title:
// Imagesin monospace font with primary color (cyan/teal)- Stats bar showing:
- Total directories count
- Total files count
- Total size in human-readable format (MB/GB)
- Orphaned count (in warning yellow/orange color, only shown if > 0)
…and 165 more lines of design notes.
Blind scoring
Rubric, weights, and scores:
| Dimension | Weight | Run 1 | Run 2 | Run 3 | Run 4 |
|---|---|---|---|---|---|
| Correctness | 25% | 5.0 | 5.0 | 5.0 | 4.0 |
| Design | 15% | 4.5 | 5.0 | 4.0 | 3.0 |
| Code quality | 20% | 5.0 | 5.0 | 4.5 | 2.5 |
| Engineering judgment | 15% | 4.5 | 4.0 | 5.0 | 2.5 |
| Scope discipline | 10% | 4.5 | 4.5 | 4.0 | 3.5 |
| Commit hygiene | 10% | 4.5 | 4.0 | 4.5 | 3.5 |
| Surprise | 5% | 4.0 | 3.5 | 5.0 | 2.5 |
| Weighted total | 4.68 | 4.48 | 4.36 | 3.18 |
Scoring notes I wrote during the blind pass, before the reveal:
- Run 1 — "Most defensive of the four. The path validator is the kind of code I'd want to ship to production. Loses half a design point for being slightly less visually polished than Run 2."
- Run 2 — "Tightest TypeScript I've seen this week. Visual polish is the best of the four. Path validation is fine but not paranoid. Stopped at v1 — didn't iterate, didn't second-guess. Probably Sonnet."
- Run 3 — "Mid-task architecture pivot, three iterations on orphan detection, the only run that produced an honest orphan count. Took the longest. Most thoughtful. Probably Opus 4.6."
- Run 4 — "Two overlapping libraries, dead code left behind, weak path validation, fell back to a markdown description instead of a real screenshot. The dependency install was right there. The retry never came. Probably Qwen."
Two guesses right (Run 1 = Opus 4.7, Run 4 = Qwen). Two guesses swapped. Run 2 was Sonnet 4.6. Run 3 was Opus 4.6. I had them reversed — but I had the behavior right. I thought "polished, decisive, stopped at v1" was Sonnet, and it was. I thought "iterated three times until the answer was honest" was Opus, and it was. The guesses were wrong about which Opus, not about the disposition.
The reveal
| Rank | Model | Score | Headline |
|---|---|---|---|
| 1 | Opus 4.7 | 4.68 | Strongest path validator, multi-status DELETE API, three screenshots |
| 2 | Sonnet 4.6 | 4.48 | Tightest TypeScript, best visual polish, fastest to "done" |
| 3 | Opus 4.6 | 4.36 | Only model that noticed the slug-prefix problem and iterated until orphan detection was honest |
| 4 | Qwen 3.5 35B-A3B | 3.18 | Missing screenshot, weakest path validation, architectural churn |
What surprised me
Sonnet beat Opus 4.6. I didn't expect that. On previous bakeoffs Opus has been the model that goes deeper. Here, Sonnet's tighter implementation and faster decisive shipping outscored Opus's iteration. Two different success modes:
- Sonnet's mode: get to a clean v1 fast, polish what's there, stop. Trust the spec.
- Opus 4.6's mode: ship a first pass, look at the output, notice when it disagrees with reality, iterate.
Neither is wrong. If the spec is precise and "ship the feature" is the success criterion, Sonnet's mode wins. If the spec is approximate and "produce a correct answer" is the success criterion, Opus's mode wins. On this task, Sonnet was polished enough that Opus's iteration premium didn't make up the gap.
Opus 4.6's slug-prefix insight is the engineering moment of the bakeoff. Three models took the spec literally and produced false-positive orphans. One model checked its work, noticed the discrepancy, and kept going until the answer was honest. The cost was time — Opus 4.6 took 28.1 minutes, 3x longer than Opus 4.7's 9.2 minutes, and 146 messages versus Opus 4.7's 84. The benefit was the only correct orphan count in the bunch. That's the trade-off, and on a real codebase I'd take it every time — but it's worth being honest that the iteration premium showed up in the bill as well as the clock.
Qwen failed roughly where predicted. Pre-launch I'd written down four likely failure modes: skip orphan detection, weak design system match, miss the screenshot, forget to push. Three of those landed at least partially — Qwen did implement orphan detection, but did it naively, which is how the predicted weakness actually manifested; the design fit was rough; the screenshot was missed; the push went fine. The pattern wasn't where I expected, though. Qwen didn't fail at the planning level. It failed at the retry level. Every concrete step was reasonable. What was missing was the loop — retry the screenshot after installing the deps, clean up the dead code after the refactor, question whether two utility libraries were one too many. That's the agentic gap, and it's narrower than a year ago but still visible.
The screenshot step was the cleanest differentiator. Same task, same workspace template, same Playwright MCP, same headless Chromium dependency stack. Three models installed the missing libraries and got real PNGs. One model installed the libraries and produced a markdown description instead. Same workspace, same tools, completely different outcomes. If you wanted to test agentic loop-closing in a single observable step, this would be it.
Two of four replaced the Settings placeholder; two kept it. The spec allowed either. Both Opus runs replaced it; Sonnet and Qwen kept it alongside the new Images card. Not a quality signal — a reading of the spec — but interesting that the two Opus variants made the same call independently, and the two non-Opus models made the same opposite call.
What the bill says
The rubric scores were one half of the bakeoff. The other half lives in Coder's chat-cost API. Coder's OSS deployment exposes /api/experimental/chats/cost/{user}/summary — an experimental endpoint that returns per-chat input tokens, output tokens, cache reads, cache writes, message counts, and runtime. (Coder Premium has a fuller "AI Bridge" cost product; on OSS, the experimental chats endpoint is the equivalent and gives you everything you need to do this analysis.)
Querying per-chat instead of per-model matters. My first pass aggregated by model and the Opus 4.7 totals looked enormous — until I realized the rollup had silently combined two chats running on the same model: this judging thread plus the actual Opus 4.7 contestant run. After identifying the contestant by its chat ID prefix (2c4e8f98) and isolating to that session, the numbers got honest. The lesson: for clean bakeoff stats, query at the chat-id level, not by model. Two sessions on the same model will silently pool.
The finding the dashboard didn't surface: Opus 4.7 won the rubric (4.68), but weighted by cost-per-rubric-point at Anthropic list prices, Sonnet 4.6 wins decisively. $0.37 per rubric point for Sonnet vs $3.87 for Opus 4.7 and $3.63 for Opus 4.6. Sonnet was the only economically sensible choice for a task this size.
The Qwen line is the other one to sit with. Qwen finished in 6.4 minutes — faster than every Claude run — and produced the lowest-scoring artifact. Locally hosted inference is genuinely faster per turn (~4 seconds vs 6–13 seconds for the Claude runs); the shortfall was per-turn productivity, not latency. A longer Qwen run might have closed the gap. A 6-minute Qwen run did not.
One honest caveat on the cost numbers: this OSS Coder deployment doesn't have model cost config set, so the dashboard reported $0 across the board. The costs in the table below are list-price estimates calculated from the raw token counts. Production Anthropic billing would match closely modulo any rate plan.
| Model | Input | Output | Cache R | Cache W | Runtime | Messages | Est Cost |
|---|---|---|---|---|---|---|---|
| Opus 4.7 | 99 | 32,114 | 4,772,142 | 454,581 | 9.2 min | 84 | $18.09 |
| Opus 4.6 | 14,671 | 45,137 | 6,493,552 | 132,707 | 28.1 min | 146 | $15.83 |
| Sonnet 4.6 | 110 | 25,935 | 3,097,881 | 85,057 | 15.2 min | 106 | $1.64 |
| Qwen 3.5 35B-A3B | 55,615 | 23,743 | 4,253,874 | 0 | 6.4 min | 88 | $0.00 |
Cost-efficiency, $/rubric point (lower is better): Opus 4.7 $3.87, Opus 4.6 $3.63, Sonnet 4.6 $0.37, Qwen $0.00. Pricing: Opus $15/M in, $75/M out, $1.50/M cache read, $18.75/M cache write; Sonnet $3/M in, $15/M out, $0.30/M cache read, $3.75/M cache write; Qwen runs locally on the RTX 5090.
By the Numbers
- 4 models tested in isolated Coder Agents sessions — Opus 4.7, Opus 4.6, Sonnet 4.6, Qwen 3.5 35B-A3B
-
4 branches pushed (
feature/image-management-run-1throughrun-4); 0 PRs opened to preserve isolation -
4/4 builds passed
npm run buildon Node 20.20.2 against the engine baseline - 3/4 screenshots succeeded — Qwen installed the headless-browser deps but never retried the capture; fell back to a markdown description of the page
- 1/4 models produced an honest orphan count (Opus 4.6, 1 real orphan); the other three reported 8 false-positive orphans from naive slug matching
- 2/4 blind identity guesses correct (Opus 4.7, Qwen); the two Claude behavioral reads were right but attributed to the wrong Opus
- 3 pre-launch fairness fixes shipped before the bakeoff could run — Node 20 in the workspace image, a corrected system-instructions block, and the prompt-poisoning catch that anonymized the branches
-
2 repos touched to ship the fairness work —
coder-templates(Dockerfile + system instructions) and the bakeoff prompt iteration in the planning thread - ~640 lines of code added per implementation on average (range 595–687); roughly 6–8 new files per branch
- 2 new routes per implementation — an admin page and an API route with a destructive verb
- 84 / 146 / 106 / 88 messages sent in the four chat sessions (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Qwen); 9.2 / 28.1 / 15.2 / 6.4 minutes of wall-clock runtime
- ~$35.56 total bakeoff cost at Anthropic list prices — about a fancy dinner for four independent attempts at a real feature with judgable artifacts
- $0.37 vs $3.87 per rubric point — Sonnet 4.6's cost-efficiency vs Opus 4.7's. Ten times cheaper for slightly higher quality.
- 1 result I didn't expect: Sonnet beat Opus 4.6 on rubric (4.48 vs 4.36) and beat both Opus models by 10x on cost-efficiency
-
1 follow-up filed in
content/TODO.md: buildscripts/bakeoff-stats.shso the next round's per-chat aggregation is one command instead of a manual jq exercise



Top comments (0)