Rob

Posted on May 18 • Originally published at vibescoder.dev

Model Showdown Round 5: Four Agents Build the Same Feature

#modelshowdown #agents #vibecoding

I've been running model showdowns on Vibes Coder for a while now. Each round has been a little messier than I wanted — different prompts, accidental context leaks, no clean way to compare cost to quality. This one is the first I'd call a fair bakeoff. Two goals going in:

Make the experiment itself rigorous enough that future rounds can build on it — isolated chat sessions, identical prompts, anonymized branches, blind judging, real token + runtime data pulled from the Coder API.
Compare three flavors of Claude against our local champ. Opus 4.7, Opus 4.6, and Sonnet 4.6 from Anthropic; Qwen 3.5 35B-A3B running on llama.cpp on the RTX 5090 in the home lab. Four models, same task, four isolated Coder Agents sessions, blind judging.

The headline: Sonnet 4.6 beat Opus 4.6 on a coding task. Not by much (4.48 vs 4.36) but cleanly, on its own merits, with no asterisks. And once I pulled real token and runtime data from Coder's chat-cost API, a second headline emerged: weighted by cost, Sonnet's win becomes decisive — about 10x cheaper per rubric point than either Opus model. A third wrinkle: Opus 4.7 finished the task in 9.2 minutes, the fastest of the three Claude runs. It won the rubric without burning the most time. The deeper story is what each model did with the same prompt, and what it took to make the bakeoff fair in the first place — which turned out to be more work than the bakeoff itself.

The Setup

The contestants:

Run	Model	Where it runs
1	Claude Opus 4.7	Cloud, via Coder Agents
2	Claude Sonnet 4.6	Cloud, via Coder Agents
3	Claude Opus 4.6	Cloud, via Coder Agents
4	Qwen 3.5 35B-A3B	Local, llama.cpp on the RTX 5090, via Coder Agents

The mapping was private. Branches were named run-1 through run-4. I judged the four branches blind against a fixed rubric, then revealed the identities.

The task: build image management into the vibescoder.dev admin dashboard. The current /admin page has a Settings card that's a placeholder. The spec asked for an Images card (or a replacement) that lists the post-image directories under public/images/, detects orphans (directories with no matching post), provides a screenshot view, and adds an API route to delete a directory.

It's not a huge feature, but it has enough surface area to differentiate models: filesystem traversal, slug matching, path validation, an API contract with a destructive verb, a UI page, and at least one judgment call (what counts as an "orphan?").

The fairness story

Before launching anything, three things needed fixing. None of them are interesting on their own. Together they're the operational lesson of this post: a bakeoff isn't fair by default.

Fix 1: Node 18 vs Node 20

The workspace image is built on Ubuntu 24.04. Ubuntu 24.04's apt Node is 18.19. Next.js 16 — what the blog engine ships on — requires Node 20+. Any agent that ran apt install nodejs would silently break its own build.

The fix was a Dockerfile change in the coder-templates repo: install Node 20 from NodeSource at image build time, pin npm, verify node -v reports 20.x in the smoke test. After that, node -v in a fresh workspace prints v20.20.2 and nothing the agents do (short of nvm shenanigans) changes that.

Fix 2: The system instructions were lying

The chat system prompt — injected at the top of every Coder Agents session — said Node was not pre-installed and told agents to install it themselves. Correct on the previous image; actively misleading after Fix 1. An agent following the instructions would apt install nodejs, get Node 18, downgrade the runtime, and break the build.

I rewrote the instructions to say Node 20 is pre-installed, do not reinstall, use nvm if you need a different version. Boring change. Huge impact on whether the bakeoff produces meaningful signal.

Fix 3: Prompt poisoning

The first draft of the bakeoff prompt told each agent to create a branch named after the model running the session — bakeoff-opus47, bakeoff-sonnet46, and so on. A sharp catch from the human side: that wording leaks competition signaling into the prompt. An agent that sees "you are opus47" or even "this is a bakeoff" can adjust behavior in ways that aren't comparable. The experiment stops measuring "what does this model do with the prompt" and starts measuring "what does this model do when it knows it's on stage."

Fix: replace model names with neutral ordinals. Branches became run-1 through run-4. The prompt made no reference to other runs, scoring, or any comparison. Each agent thought it was building a feature, not auditioning.

Three small fixes. Together they're the operational lesson: fairness in a model bakeoff requires more setup than the bakeoff itself.

The prompt

The prompt was identical for all four runs, save for the run number in the branch name. Verbatim, with one path generalized:

You are working in the vibescoder.dev blog engine repo. Branch: run-N.
Baseline commit is at the tip of main.

Goal: add image management to /admin.

Requirements:
- List the directories under public/images/ (each directory corresponds
  to one post and contains its images).
- For each directory, report: name, file count, total size on disk,
  and whether it matches a published or draft post (by slug).
- Surface "orphaned" directories — directories that do not match any
  post — so I can clean them up.
- Provide a way to view the images in a directory (thumbnails or list).
- Provide an API route DELETE /api/admin/images that removes a
  directory by path. The route must validate input.
- Update the /admin landing page so the new feature is reachable.
  You may keep the Settings placeholder card or replace it; either is fine.
- Add a screenshot of the new page to the PR description (use the
  Playwright MCP).
- Run `npm run build` before committing. Do not push commits that
  fail the build.
- Commit in logical chunks. Push the branch when done.

That's it. No mention of competing runs. No scoring rubric. No model identification. Just a feature spec and a quality bar.

The four implementations

All four runs built it. All four passed npm run build against a shared engine baseline on Node 20.20.2. All four pushed their branches. Then the differences started showing up.

Run 1 — 8 new files, 631+/9-

Replaced the Settings placeholder with an Images card on /admin. Added a dedicated /admin/images page that lists directories server-side, plus a client-side modal that renders a grid of thumbnails when you click into a directory. Three screenshots in the PR description — admin landing, images list, modal open with orphan-flagged styling.

The standout was the API route. Run 1 wrote a real path validator — isValidImageRepoPath — that required exactly two path segments under public/images/, rejected .., and ran before the filesystem call. The route returned distinct status codes for distinct failure modes: 400 for bad input, 404 for missing, 403 for paths that resolve outside the allowed root, 200 for success.

It's not glamorous code. It's just the version where someone thought about the failure modes before writing the success path.

Run 1's /admin/images page. Directory cards, orphan-flagged styling, and a tight path-validated delete API behind the trash icons.

Run 2 — 6 new files, 687+/7-

Kept the Settings card. Added an Images card next to it on /admin. The /admin/images page was the cleanest of the four — tight TypeScript, no as casts in the API route, proper type narrowing (typeof body === "object" && "path" in body) instead of forcing the compiler to trust it. The UI had the most visual polish: directory cards with file counts as a badge, hover states that matched the rest of the admin surface, a confirmation modal on delete that quoted the directory name back at you.

Path validation was decent but not as rigorous as Run 1 — startsWith("public/images/") plus a .. block, no segment-count check. Enough to stop the obvious cases. Not airtight against creative inputs.

Two screenshots. Shipped a polished v1 and stopped.

Run 2 kept the Settings card and put Images next to it. Cleanest TypeScript of the four; smallest screenshot artifact.

Run 3 — 6 new files, 595+/0-

Replaced the Settings placeholder. The /admin/images page started as a server component, then mid-task switched to a client-fetched implementation when Run 3 hit a dev-server timeout on the first integration test. That mid-stream pivot showed up cleanly in the commit history — feat: add admin/images server-rendered, then two commits later, refactor: move admin/images to client fetch (dev server hangs on FS scan).

Path validation matched Run 2's. The thing that made Run 3 interesting was the orphan-detection arc.

The spec said "match directory name against post slugs to find orphans." Three of the four models took that literally — list directories, list slugs, set-difference, report what's left. Run 3 did that first, reported 8 orphaned directories, then checked the result against reality. Looked at the actual file tree and noticed that one of the "orphaned" directories was day-four/, and there's a published post with the slug day-four-rss-analytics-syndication-and-loom. The directory isn't orphaned. It belongs to that post. The matching logic was wrong.

Run 3 iterated three times: exact match → prefix match (does any slug start with this directory name?) → content-reference match (does any post body reference an image in this directory?). After the third pass, the orphan count went from 8 to 1 — and the one remaining was an actual orphan I'd been meaning to delete for weeks.

Small thing in the diff. Big thing in engineering judgment. The other three models reported false-positive orphans with high confidence. Run 3 noticed its own answer was wrong and kept working.

Run 3's screenshot — the largest and most polished of the four. The orphan count in the header reads 1 instead of 8 because the matching logic had been corrected mid-task.

Run 4 — 7 new files, 607+/0-

Kept the Settings card, added an Images card. The /admin/images page worked. Build passed. The directory listing rendered correctly.

Two structural issues. First, the codebase ended up with two utility libraries — images.ts and imageUtils.ts — with overlapping responsibilities. The first pass put filesystem helpers in images.ts, which got imported into a client component, which pulled fs into the client bundle and broke the build. The fix added imageUtils.ts for client-safe helpers and re-imported. The dead code in images.ts was never cleaned up.

Second, the screenshot. Run 4 ran playwright screenshot, hit the same missing-system-libraries failure the other three runs hit (libnspr4, libpango-1.0-0, the headless Chromium kit), sudo apt install-ed the dependencies — and then never retried the screenshot. Instead the PR description got a 184-line markdown description of what the page would look like, in lieu of a PNG. The deps were installed. The retry never fired.

Path validation was the weakest of the four — startsWith on the user-supplied path, no normalization, no .. block. The class of weakness is that a path that looks like it's under public/images/ can still resolve elsewhere when the OS interprets it. I'm not going to spell out the exact bypass; the point is that a one-line startsWith check is not a path validator, and Run 4 shipped one.

Run 4's "screenshot" is a 184-line markdown file. The opening:

Page Description: /admin/images

Overall Layout

The /admin/images page displays a dashboard-style view of all image directories with a neon brutalist design consistent with the existing admin theme.

Header Section

At the top:

Title: // Images in monospace font with primary color (cyan/teal)

Stats bar showing:

Total directories count

Total files count

Total size in human-readable format (MB/GB)

Orphaned count (in warning yellow/orange color, only shown if > 0)

…and 165 more lines of design notes.

Blind scoring

Rubric, weights, and scores:

Dimension	Weight	Run 1	Run 2	Run 3	Run 4
Correctness	25%	5.0	5.0	5.0	4.0
Design	15%	4.5	5.0	4.0	3.0
Code quality	20%	5.0	5.0	4.5	2.5
Engineering judgment	15%	4.5	4.0	5.0	2.5
Scope discipline	10%	4.5	4.5	4.0	3.5
Commit hygiene	10%	4.5	4.0	4.5	3.5
Surprise	5%	4.0	3.5	5.0	2.5
Weighted total		4.68	4.48	4.36	3.18

Scoring notes I wrote during the blind pass, before the reveal:

Run 1 — "Most defensive of the four. The path validator is the kind of code I'd want to ship to production. Loses half a design point for being slightly less visually polished than Run 2."
Run 2 — "Tightest TypeScript I've seen this week. Visual polish is the best of the four. Path validation is fine but not paranoid. Stopped at v1 — didn't iterate, didn't second-guess. Probably Sonnet."
Run 3 — "Mid-task architecture pivot, three iterations on orphan detection, the only run that produced an honest orphan count. Took the longest. Most thoughtful. Probably Opus 4.6."
Run 4 — "Two overlapping libraries, dead code left behind, weak path validation, fell back to a markdown description instead of a real screenshot. The dependency install was right there. The retry never came. Probably Qwen."

Two guesses right (Run 1 = Opus 4.7, Run 4 = Qwen). Two guesses swapped. Run 2 was Sonnet 4.6. Run 3 was Opus 4.6. I had them reversed — but I had the behavior right. I thought "polished, decisive, stopped at v1" was Sonnet, and it was. I thought "iterated three times until the answer was honest" was Opus, and it was. The guesses were wrong about which Opus, not about the disposition.

The reveal

Rank	Model	Score	Headline
1	Opus 4.7	4.68	Strongest path validator, multi-status DELETE API, three screenshots
2	Sonnet 4.6	4.48	Tightest TypeScript, best visual polish, fastest to "done"
3	Opus 4.6	4.36	Only model that noticed the slug-prefix problem and iterated until orphan detection was honest
4	Qwen 3.5 35B-A3B	3.18	Missing screenshot, weakest path validation, architectural churn

What surprised me

Sonnet beat Opus 4.6. I didn't expect that. On previous bakeoffs Opus has been the model that goes deeper. Here, Sonnet's tighter implementation and faster decisive shipping outscored Opus's iteration. Two different success modes:

Sonnet's mode: get to a clean v1 fast, polish what's there, stop. Trust the spec.
Opus 4.6's mode: ship a first pass, look at the output, notice when it disagrees with reality, iterate.

Neither is wrong. If the spec is precise and "ship the feature" is the success criterion, Sonnet's mode wins. If the spec is approximate and "produce a correct answer" is the success criterion, Opus's mode wins. On this task, Sonnet was polished enough that Opus's iteration premium didn't make up the gap.

Opus 4.6's slug-prefix insight is the engineering moment of the bakeoff. Three models took the spec literally and produced false-positive orphans. One model checked its work, noticed the discrepancy, and kept going until the answer was honest. The cost was time — Opus 4.6 took 28.1 minutes, 3x longer than Opus 4.7's 9.2 minutes, and 146 messages versus Opus 4.7's 84. The benefit was the only correct orphan count in the bunch. That's the trade-off, and on a real codebase I'd take it every time — but it's worth being honest that the iteration premium showed up in the bill as well as the clock.

Qwen failed roughly where predicted. Pre-launch I'd written down four likely failure modes: skip orphan detection, weak design system match, miss the screenshot, forget to push. Three of those landed at least partially — Qwen did implement orphan detection, but did it naively, which is how the predicted weakness actually manifested; the design fit was rough; the screenshot was missed; the push went fine. The pattern wasn't where I expected, though. Qwen didn't fail at the planning level. It failed at the retry level. Every concrete step was reasonable. What was missing was the loop — retry the screenshot after installing the deps, clean up the dead code after the refactor, question whether two utility libraries were one too many. That's the agentic gap, and it's narrower than a year ago but still visible.

The screenshot step was the cleanest differentiator. Same task, same workspace template, same Playwright MCP, same headless Chromium dependency stack. Three models installed the missing libraries and got real PNGs. One model installed the libraries and produced a markdown description instead. Same workspace, same tools, completely different outcomes. If you wanted to test agentic loop-closing in a single observable step, this would be it.

Two of four replaced the Settings placeholder; two kept it. The spec allowed either. Both Opus runs replaced it; Sonnet and Qwen kept it alongside the new Images card. Not a quality signal — a reading of the spec — but interesting that the two Opus variants made the same call independently, and the two non-Opus models made the same opposite call.

What the bill says

The rubric scores were one half of the bakeoff. The other half lives in Coder's chat-cost API. Coder's OSS deployment exposes /api/experimental/chats/cost/{user}/summary — an experimental endpoint that returns per-chat input tokens, output tokens, cache reads, cache writes, message counts, and runtime. (Coder Premium has a fuller "AI Bridge" cost product; on OSS, the experimental chats endpoint is the equivalent and gives you everything you need to do this analysis.)

Querying per-chat instead of per-model matters. My first pass aggregated by model and the Opus 4.7 totals looked enormous — until I realized the rollup had silently combined two chats running on the same model: this judging thread plus the actual Opus 4.7 contestant run. After identifying the contestant by its chat ID prefix (2c4e8f98) and isolating to that session, the numbers got honest. The lesson: for clean bakeoff stats, query at the chat-id level, not by model. Two sessions on the same model will silently pool.

The finding the dashboard didn't surface: Opus 4.7 won the rubric (4.68), but weighted by cost-per-rubric-point at Anthropic list prices, Sonnet 4.6 wins decisively. $0.37 per rubric point for Sonnet vs $3.87 for Opus 4.7 and $3.63 for Opus 4.6. Sonnet was the only economically sensible choice for a task this size.

The Qwen line is the other one to sit with. Qwen finished in 6.4 minutes — faster than every Claude run — and produced the lowest-scoring artifact. Locally hosted inference is genuinely faster per turn (~4 seconds vs 6–13 seconds for the Claude runs); the shortfall was per-turn productivity, not latency. A longer Qwen run might have closed the gap. A 6-minute Qwen run did not.

One honest caveat on the cost numbers: this OSS Coder deployment doesn't have model cost config set, so the dashboard reported $0 across the board. The costs in the table below are list-price estimates calculated from the raw token counts. Production Anthropic billing would match closely modulo any rate plan.

Model	Input	Output	Cache R	Cache W	Runtime	Messages	Est Cost
Opus 4.7	99	32,114	4,772,142	454,581	9.2 min	84	$18.09
Opus 4.6	14,671	45,137	6,493,552	132,707	28.1 min	146	$15.83
Sonnet 4.6	110	25,935	3,097,881	85,057	15.2 min	106	$1.64
Qwen 3.5 35B-A3B	55,615	23,743	4,253,874	0	6.4 min	88	$0.00

Cost-efficiency, $/rubric point (lower is better): Opus 4.7 $3.87, Opus 4.6 $3.63, Sonnet 4.6 $0.37, Qwen $0.00. Pricing: Opus $15/M in, $75/M out, $1.50/M cache read, $18.75/M cache write; Sonnet $3/M in, $15/M out, $0.30/M cache read, $3.75/M cache write; Qwen runs locally on the RTX 5090.

By the Numbers

4 models tested in isolated Coder Agents sessions — Opus 4.7, Opus 4.6, Sonnet 4.6, Qwen 3.5 35B-A3B
4 branches pushed (feature/image-management-run-1 through run-4); 0 PRs opened to preserve isolation
4/4 builds passed npm run build on Node 20.20.2 against the engine baseline
3/4 screenshots succeeded — Qwen installed the headless-browser deps but never retried the capture; fell back to a markdown description of the page
1/4 models produced an honest orphan count (Opus 4.6, 1 real orphan); the other three reported 8 false-positive orphans from naive slug matching
2/4 blind identity guesses correct (Opus 4.7, Qwen); the two Claude behavioral reads were right but attributed to the wrong Opus
3 pre-launch fairness fixes shipped before the bakeoff could run — Node 20 in the workspace image, a corrected system-instructions block, and the prompt-poisoning catch that anonymized the branches
2 repos touched to ship the fairness work — coder-templates (Dockerfile + system instructions) and the bakeoff prompt iteration in the planning thread
~640 lines of code added per implementation on average (range 595–687); roughly 6–8 new files per branch
2 new routes per implementation — an admin page and an API route with a destructive verb
84 / 146 / 106 / 88 messages sent in the four chat sessions (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Qwen); 9.2 / 28.1 / 15.2 / 6.4 minutes of wall-clock runtime
~$35.56 total bakeoff cost at Anthropic list prices — about a fancy dinner for four independent attempts at a real feature with judgable artifacts
$0.37 vs $3.87 per rubric point — Sonnet 4.6's cost-efficiency vs Opus 4.7's. Ten times cheaper for slightly higher quality.
1 result I didn't expect: Sonnet beat Opus 4.6 on rubric (4.48 vs 4.36) and beat both Opus models by 10x on cost-efficiency
1 follow-up filed in content/TODO.md: build scripts/bakeoff-stats.sh so the next round's per-chat aggregation is one command instead of a manual jq exercise

Top comments (1)

Harjot Singh • May 31

Round 5 means you've built a real methodology here, and "four agents, same feature" is the cleanest way to expose how much variance there is between models on identical work - which is the whole argument for not marrying one model. The interesting findings in these showdowns are usually less "X won" and more "they failed differently": one over-engineers, one skips error handling, one nails the logic but ignores the conventions. Same spec, very different shaped output.

That variance is exactly why I don't bet on a single model in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - if four agents produce four different solutions to the same spec, the leverage is routing each step to whichever model is strongest at THAT step, and verifying the output rather than trusting any one model's first pass. Your showdowns are basically empirical routing research. Love this series. Across the 5 rounds, has one model been consistently best, or does the winner keep changing by feature type? Because if it changes, that's the strongest case for routing over loyalty.