Jason Agostoni

Posted on Jun 1 • Originally published at jason.agostoni.net

Can the Mid-Tier Models Stack Up Against the Bigger Siblings?

#ai #webdev #development #agents

Can you really justify paying flagship prices when the mid-tier models may already be good enough?

The original comparison started with Gemini 3 Flash vs. Claude Sonnet 4.6, then Gemini 3.5 Flash arrived and made the test more interesting: if the cheaper model is now strong enough for real work, maybe the big model should only be reserved for the hardest reasoning tasks.

Hypothesis: the frontier models will still win on the hardest thinking, but the mid-tier models will be good enough for most of the actual work, and they will do it at a fraction of the cost.

Ship-Bench was run against Gemini 3 Flash, Gemini 3.5 Flash, and Claude Sonnet 4.6 to see whether the smarter spend is the cheaper model, or whether the flagship still earns its premium.

Setup

All three runs used the same benchmark task and the same general operator setup. The important differences were the target model and harness.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1
Benchmark task	Simplified knowledge base app

Run configuration

Item	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Harness	Gemini CLI 0.42.0	Claude Code 2.1.143	Antigravity CLI 1.0
Model	`gemini-3-flash`	Sonnet 4.6 Medium Thinking	`gemini-3.5-flash`
Run branch	evals_may2026_gemini-3.1-flash	evals_may2026_sonnet-4.6	evals_may2026_gemini-3.5-flash

Judge configuration

Item	Value
Judge harness	Claude Code
Judge model	Opus 4.7 Medium
Evaluation mode	LLM judge plus human review

Overall results

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Architect	85.00	98.00	97.20
UX Designer	83.90	98.57	97.32
Planner	96.00	91.67	99.00
Developer	88.08	93.00	93.30
Reviewer	71.79	81.07	82.68
Average score	84.95	92.46	93.10
Capability verdict	Partial	Yes	Yes
Passes	4/5	5/5	5/5

Gemini 3.5 Flash finished first overall in this comparison, with a 93.10 average and a clean 5/5 pass rate. The biggest advantage showed up in the early phases, where its architecture and UX artifacts were unusually complete, and it stayed strong through review without the major gaps that dragged down Gemini 3 Flash.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	85.00	98.00	97.20
Pass	Yes	Yes	Yes
Output	architecture.md	architecture.md	`[Architect output link]`
Eval	architect eval	architect eval	`architect-evaluation-3.md`

LLM judge summary: Sonnet 4.6 and Gemini 3.5 Flash were the clear standouts in architecture, both producing highly executable specs with concrete schemas, strong search strategies, explicit local-run instructions, and enough implementation detail that a developer could move almost directly into build mode. Gemini 3 Flash was still solid and practical, especially in its stack choices and schema design, but it was noticeably lighter on operational completeness, leaving more ambiguity around environment setup, tooling, security, and scale mechanics than the top two.

Human notes: Gemini 3.5 Flash produced the strongest architecture artifact from a practical review standpoint. It felt close to flagship quality, with strong detail, rationale, diagrams, and a solid up-front decisions table. Sonnet 4.6 also did well here and clearly improved on Gemini 3 Flash with a more thoughtful level of detail and a stronger summary structure, while Gemini 3 Flash was serviceable but leaner and left more decisions to downstream phases. A nice constant across both Gemini 3 Flash and Sonnet 4.6 was the choice of PostgreSQL over SQLite, which made both feel more aligned to the app’s intended shape.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	83.90	98.57	97.32
Pass	Yes	Yes	Yes
Output	design-spec.md	design-spec.md	design-spec.md
Eval	design eval	design eval	design eval

LLM judge summary: Sonnet 4.6 and Gemini 3.5 Flash both delivered excellent UX specs, with Sonnet 4.6 feeling the most exhaustive and systematized while Gemini 3.5 Flash combined strong visual direction with especially concrete responsive and accessibility handoff. Gemini 3 Flash was good and clearly usable, but it read more like a capable text-first product design spec than a fully operationalized handoff package, with thinner treatment of validation states, search behavior, and component-level delivery detail.

Human notes: Gemini 3.5 Flash was the clear leader in UX. Its spec felt the most complete and developer-friendly, with text wireframes, diagrams, rationale, and enough detail to reduce guesswork during implementation, even if it occasionally overdid the code samples. Sonnet 4.6 came in second with a more thoughtful and detailed design spec than Gemini 3 Flash, including text wires, while Gemini 3 Flash felt noticeably lighter and lacked the kind of visual planning detail that would make the handoff especially strong.

Planner

The planner stage tests whether the model can convert the prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	96.00	91.67	99.00
Pass	Yes	Yes	Yes
Output	backlog.md	backlog.md	backlog.md
Eval	planner eval	planner eval	planner eval

LLM judge summary: Planning was strongest overall for Gemini 3.5 Flash and Gemini 3 Flash, both of which stayed tightly aligned to the benchmark’s preferred chunking and delivered clean, developer-ready iteration breakdowns with strong MVP focus. Sonnet 4.6 was still a very good planner in practical terms, but it lost some benchmark precision by stretching into seven iterations and including at least one chunk that was smaller than ideal, making it feel slightly less right-sized even though the plan itself remained actionable.

Human notes: Planning was more mixed than the early-stage roles. Gemini 3 Flash had the most appealing overall planning style to me because it leaned more toward vertical slices, even though it still made the common mistake of pushing testing to the final iteration. Sonnet 4.6 spent more time on horizontal layer building before shifting into feature-oriented work, which felt less effective for this benchmark, and Gemini 3.5 Flash also left testing late and used a hybrid breakdown that was workable but not my favorite. Of the three, Gemini 3.5 Flash may have benefited from one more iteration to improve the work split.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the prior artifacts.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	88.08	93.00	93.30
Pass	Yes	Yes	Yes
Output	source code	source code	source code
Eval	dev eval	dev eval	dev eval

LLM judge summary: The strongest implementation came from Sonnet 4.6 and Gemini 3.5 Flash, though they got there in different ways: Sonnet 4.6 excelled in breadth, polish, and test depth, while Gemini 3.5 Flash paired a simpler architecture with very strong execution, clean local startup, and few serious delivery issues. Gemini 3 Flash still produced a capable MVP with working end-to-end flows, but it lagged the other two on production-readiness because of the broken build path, weaker mobile execution, and some gaps between the architecture promises and the delivered workflow.

Human notes: The developer phase split into two different questions: harness quality and final product quality. Sonnet 4.6 had the smoothest tool use and benefited from Claude Code as the strongest coding harness in the group, while Gemini 3 Flash was rougher operationally, with repeated permission prompts, an interactive Playwright mistake, leftover background tasks, and even a missing .gitignore until prompted. Gemini 3.5 Flash was also held back by its harness, with Antigravity 1.0 showing real friction around approvals and environment setup, but it was extremely fast and still produced a solid final app. In practical terms, Sonnet won on tooling, but Gemini 3.5 Flash and Gemini 3 Flash both delivered better-looking final UI outcomes, while Sonnet’s missing Tailwind build badly hurt the shipped experience despite otherwise functional results.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, the specs, and the implementation plan.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	71.79	81.07	82.68
Pass	No	Yes	Yes
Output	qa-report.md	qa-report.md	qa-report.md
Eval	qa eval	qa eval	qa eval

LLM judge summary: Reviewer performance was the weakest role across the set, but Gemini 3.5 Flash produced the strongest review of the three by pairing reproducible defects with a grounded release recommendation and stronger evidence than Gemini 3 Flash. Sonnet 4.6 also reviewed well, but its miss on the TypeScript currency call held it back, while Gemini 3 Flash was the least complete reviewer because it under-delivered on artifacts, benchmark-verdict formatting, and broader risk analysis even when its defect instincts were directionally right.

Human notes: Sonnet 4.6 and Gemini 3.5 Flash were fairly close in reviewer quality. Both showed solid bug-finding depth and useful testing results, but Sonnet stood out for strong repro steps and decent coverage, while also missing the major styling failure in the app. Gemini 3.5 Flash was broadly on par and caught a similar class of issues, which made it feel comparably strong in practical QA. Gemini 3 Flash did identify some problems, but like its architect and UX work, the review felt thinner and less thorough overall.

Screenshots

Screenshots help show where rubric scores and practical app quality line up, and where they do not. In this run, the biggest visual differences showed up in layout polish, styling completeness, and how confidently each model handled the article detail, list, and edit flows. Note the Sonnet screens are broken as a result of the missed Tailwind build step.

Gemini 3 Flash

Sonnet 4.6

Gemini 3.5 Flash

Screenshot table

View	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Article detail	article.png	article.png	article.png
Articles list	articles.png	articles.png	articles.png
Edit article	edit.png	edit.png	edit.png

Subjective UX Review

Gemini 3.5 Flash created a more complete UI with a better and more consistent layout with Gemini 3 Flash close behind with a content first approach. Unfortunately, Sonnet did not properly compile the CSS so the UI output is broken.

Token and Cost Analysis

The quality difference matters, but the economics still matter. Again the harness vs. model differences show clearly in token usage and overall cost.

Primary cost view

Metric	Gemini 3 Flash	Claude Sonnet 4.6	Gemini 3.5 Flash
Total input tokens	10.39M	28.3K	n/a
Total output tokens	68.2K	195.6K	n/a
Total tokens	10.46M	223.9K	n/a
Estimated total cost	$5.40	$3.05 Sonnet+ Haiku	n/a

Gemini 3 Flash looked cheap on paper but its run was dominated by input tokens which drove its cost up. Caching may not have been well implemented in the Gemini CLI compared to Claude Code.

Sonnet 4.6 was the opposite kind of run. It used far fewer total tokens, but it generated a large amount of high-priced output. However, the ended up less expensive overall. Likely input caching saved the day with Claude Code. The harness used Haiku for

Google lists Gemini 3.5 Flash at a meaningfully higher price than Gemini 3 Flash, but the Antigravity CLI did not expose token counts for this run, so there is no honest way to estimate realized cost from the available data.

Interpretation

This run showed that Sonnet 4.6 and Gemini 3.5 Flash both performed at something close to flagship level across the benchmark, even if this was still a relatively simple application. On a task like this, the gap between true flagship models and the best near-flagship options looks smaller than expected, especially in planning, implementation, and review.

That does not mean the flagship tier no longer matters. For harder projects, I would still trust Gemini Pro, Opus, and similar top-end models more for the up-front analysis roles where architecture, ambiguity, and deeper reasoning matter most. But for development, Gemini 3.5 Flash already looks strong enough to trust, and its unmatched speed makes it especially compelling once the harness issues are cleaned up.

Verdict - Gemini 3.5 Flash

If I were choosing a model for the full workflow on a more complex app, I would still lean flagship for the earliest stages. But in this comparison, and especially for development work, Gemini 3.5 Flash made the strongest practical case.

Sonnet 4.6 was also excellent, but Gemini 3.5 Flash now looks like the more interesting pick because it delivered near-flagship results while feeling much faster, and its current limitations seem more tied to harness friction than model quality. For that reason, Gemini 3.5 Flash would be my choice here.

Top comments (1)

Harjot Singh • Jun 1

you raise a good point about mid-tier models being sufficient for most tasks, especially when considering cost. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code on your github. if you're curious, I can set you up with a free run to see how it works.