DEV Community: Jason Agostoni

Agentic AI: Good Upfront Design Pays You Back Later

Jason Agostoni — Tue, 07 Jul 2026 23:58:41 +0000

I spend a lot of time preaching architecture and constraints, so it is always nice when a side project gives me receipts. Adding this new feature to DumbQuestion.ai was a good reminder that a well-structured first version lets you spend your next iteration on value, not repair.

Below, you will find a few relatively simple challenges and how thoughtful, upfront design made the changes effortless.

To vibe or not to vibe ...

Many developers jump right in and just rip out an app, ship fast, let the coding agent sort it out, come back and deal with it later. To be fair, that absolutely can get you to first release faster. But even on a solo project, a little proper SDLC discipline pays back later when you want to extend the product without turning every feature into a rescue mission, which is a theme that already runs through how I have been building DumbQuestion.ai.

Extend this to the enterprise and you turn a little upfront effort into potential huge savings on token spend

Roasting starup pitches (for sport) ...

The core idea for Startup Roast was simple enough: take a startup pitch, roast it, and add a reality-check section so the output is not just mockery for mockery’s sake. To illustrate (and avoid just vaguely describing the feature) I picked a random but highly upvoted pitch from Product Hunt: Vida.

Vida, which pitches itself as an “AI clone” that learns how you work, remembers what matters, and becomes a “second you,” with early use cases like Reply Rescue, Prompt Rescue, Resume Rescue, Workspace Cleanup, and Daily Wrap. This is a pretty common target use case of agentic AI making it a solid candidate.

If you want to skip ahead, here's an example roast for Vida.

Combining a preliminary web "market search" into the content yielded a result that was not just sarcastic, but informed. The roast hit the obvious AI-clone positioning, questioned whether the product was really a clone versus a macro suite, and then turned the market context into a sharper Reality Check about integration bugs, weak retention, and the risk of becoming yet another chat-wrapper-style productivity startup.

Challenge 1: Prompt Size / Parameterization

DumbQuestion.ai originally dealt with a short question (< 100 characters). The startup roast feature deals with a whole pitch, which is a very different shape of input, with more context, more structure, and more ways for the prompt to get noisy. The Vida example alone is long enough to make that obvious.

This is where boring code quality decisions start paying rent, something likely caught in a code review: use configuration or constants insead of hardcoded literals. If prompt limits, prompt assembly, and related values are already treated like configurable extension points instead of scattered magic numbers, adapting to longer input becomes an iteration problem instead of a rewrite problem.

This was an instant code and unit test change. This led to an even better refactor to make the limit parametric instead of constant.

Challenge 2: Search Phrases and Market Context

A generic roast is funny once. A roast that feels aware of the market is a better product artifact. In the Vida example, adding in a quick prelminary web search using search phrases like “top competitors to AI work clone or AI agent that learns your habits startup” and “reasons AI personal assistant or productivity agent startups fail,” were added to the context to shape the final output. It moved the result to something more specific than just "ha ha, another AI startup.”

To support that, I added a preliminary LLM pass to extract search phrases as a JSON array, then routed those through my existing multi-provider web search capability and used the results to prefill the final LLM prompt. The point here is not that this was especially hard. The point is that it was not hard because search was already isolated as a reusable feature instead of buried inside one specific interaction path.

The pay back was immediate. If search is a first-class capability, a new feature can borrow it immediately. The architectural work already happened earlier. The new work is mostly deciding how to apply it.

Challenge 3: Markdown Output

Once the responses got richer, plain text stopped being enough. Roast plus Reality Check wants structure. The Vida output is a good example because it has a clear voice shift from mockery into analysis, and that reads much better when the response can preserve formatting cleanly. Different models chose different formats and whether or not to pepper in emojis.

Again, this is where earlier organization matters more than raw cleverness. If your HTMX partials and rendering boundaries are reasonably clean, adding markdown support is much lower impact than it would be in a tangled UI layer. Pull in a Go lib dependency, apply it in a single, small partial, update a focused unit test and very few tokens were needed to wrap up this feature.

Quiet structural decisions like that are what let a new feature stay contained.

A key lesson here for sure: nearly all first runs with a coding agent generate "balls of mud" in the UI instead of well organized components. Nip this one in your first code review if not sooner.

Challenge 4: Prompt Injection Tuning

The moment you let people paste startup pitches into an LLM tool, you are no longer just handling simple, earnest user input. You are also handling jokes, garbage, hostile instructions, and people trying to be clever because of course they are. That was already true in DumbQuestion.ai, but longer, richer pitch inputs make it even more relevant.

I already had prompt injection handling in place, but Startup Roast needed the tuning adjusted. Because the detection and response behavior were already parameterized instead of hard-coded into a pile of special cases, tuning it became a smaller, more contained change. Better thresholds, better handling, smaller git diff, fewer tokens.

Challenge 5: Model and Persona Tuning

Startup pitches are a different input class than dumb questions, so I wanted to re-test model speed, quality, and instruction adherence under longer prompts and a different tone. Luckily, I had already built an eval harness for comparing models, so adding startup-pitch cases was an easy extension instead of a new side quest. This was already incorporated into my unit, e2e and integration tests so even for a solo project, this paid off.

The persona side worked out the same way. Because the personas were already packaged to be open for extension, adding new instructions and persona-specific CTAs was low impact. That let me spend time tuning tone and usefulness instead of untangling brittle prompt logic.

For the curious, here's the same pitch roasted across the four personas: Weary, Overqualified, Compliant, [REDACTED].

I always like the insane positivity The Compliant brings to the table.

The Payback

Focus on value add

Good upfront design changes what kind of work you get to do next. It lets you spend your energy on product judgment instead of repair work. It turns feature additions into focused passes on value: better prompts, better search context, better output formatting, better guardrails.

Design efficiency = Token efficiency

It also makes coding agents more useful and more efficient. AI coding agents are great at implementation, but they are much more effective when the system already has clean extension points, isolated responsibilities, and tunable behavior. A prompt-length constant is an extension point. Multi-provider search reuse is an extension point. Modular rendering is an extension point. Parametric prompt-injection tuning is an extension point. Those are not giant innovations. They are just design decisions that compound later.

Team of One / Team of Many

That is why I still believe even solo projects deserve at least a little SDLC respect. Not enterprise ceremony. Not fifty approval gates. Just enough thought, structure, and code quality that future-you is allowed to work on the interesting part.

If it pays off a little for a team of one, it can pay off a lot for a team of many.

For this feature, the interesting part was not fighting the code. It was thinking about how to make the roast more specific, more grounded, and more shareable. The Vida example was a nice reminder that good architecture really does buy that freedom later.

Can Fable 5 Finish Off the Other Frontiers?

Jason Agostoni — Mon, 15 Jun 2026 00:05:12 +0000

Can Anthropic's Fable 5 justify its staggering cost and live up to the massive hype to unseat the top specialized models? I ran Ship-Bench against the model to find out, stacking it up directly against the best overall performances so far across previous benchmarks.

Hypothesis: Given the premium market rate and the recent headlines regarding its capabilities, I expected Fable 5 to perform exceptionally well. Pitting it against a composite "Best-in-Class" lineup, where models like Sonnet 4.6 and DeepSeek v4 Pro are cherry-picked for their strongest roles, seemed like the only fair thing to do.

Key Insights

Fable 5 is the new benchmark king: It finished with a perfect 100 in architecture, an overall average of 96.49, and 5/5 passes, decisively beating DeepSeek v4 Pro's previous top average of 94.18.
The early-stage roles were highly competitive, but the biggest late-stage separation occurred in the Reviewer role, where Fable 5 set a new high bar (89.29) in a phase where all other models seem to struggle against the rubric.
Testing the limits of a multi-turn SDLC: Ship-Bench is designed to test a closer to reality process over an extended chain of handoffs. Fable's legendary strength usually shines in extreme reasoning capability and awesome success with one-shotting ideas, so seeing how its consistency held up across a multi-step workflow was a core focus of this run.
Cost in this run was exorbitant at nearly $180, driven by massive cache token volumes during the implementation phase. The practical answer to its viability depends heavily on whether near-flawless reliability justifies the extreme API spend.

Setup

All runs used the same machine, the same benchmark process, and the same underlying task. The harness differed slightly to accommodate the different models, and that is worth documenting up front because tooling can shape workflow, context handling, and operator experience even when the benchmark target stays the same.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1 (commit `0e7cc28`)
Benchmark task	Simplified knowledge base app

Run configuration

Item	Fable 5 run	Composite Best-in-Class
Harness	Claude Code v2.1.177	Various
Model	Anthropic Fable 5	Sonnet 4.6 / DeepSeek v4 Pro / Gemini 3.5 Flash
Backend	Anthropic subscription	Various
Run repo	`evals_jun10_fable`	Various

Judge configuration

Item	Value
Judge harness	Claude Code v2.1.177
Judge model	Opus 4.8 medium
Evaluation mode	LLM judge plus human review

Ship-Bench Context

Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase is scored independently and produces artifacts that feed the next stage, which makes the benchmark useful for measuring not only isolated output quality but also handoff quality across a realistic workflow.

This run used the standard simplified knowledge base app task. That task is intentionally large enough to expose differences in architecture, planning, implementation, and review without becoming too open-ended to compare across runs.

Overall Results

Metric	Fable 5	Composite Best-in-Class
Architect	100	98.00 (Sonnet 4.6)
UX Designer	98.57	98.60 (DeepSeek v4 Pro)
Planner	97.20	99.00 (Gemini 3.5 Flash)
Developer	97.37	98.75 (DeepSeek v4 Pro)
Reviewer	89.29	85.00 (DeepSeek v4 Pro)
Average score	96.49	95.87
Passes	5/5	5/5

Fable 5 averaged higher across the entire workflow than an aggregate of the best single-role performances we've seen to date. While it narrowly lost the UX, Planner, and Developer rounds to specialized heavyweights, its sheer dominance in Architecture and Review elevated its total package.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Fable 5	Best-in-Class (Sonnet 4.6)
Score	100	98.00
Pass	Yes	Yes
Output	docs/architecture.md	docs/architecture.md
Eval	Architect Eval	Architect Eval

LLM judge summary: Fable 5 delivered a flawless performance. Every framework dependency was explicitly pinned to current stable versions, and the local-first, zero-service SQLite design (with WAL and busy_timeout for scale) perfectly matched the brief.

Human notes: Fable 5 was the first to achieve a perfect 100/100 in this phase. This demonstrates Fable's excellence in thinking through the design of a software application, which showed in the relatively straightforward development process that followed. Comparing Fable's and Sonnet's output reveals some similarities, but Fable was more accurate with the dependency versions and added more depth in each area. The conclusion here is that this level of upfront planning is likely what allows Fable to succeed in one-shotting application builds.

Practical takeaway: Fable 5 sets the new gold standard for technical specification, leaving zero ambiguity for the downstream phases.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Fable 5	Best-in-Class (DeepSeek v4 Pro)
Score	98.57	98.60
Pass	Yes	Yes
Output	docs/design-spec.md	docs/design-spec.md
Eval	UX Eval	UX Eval

LLM judge summary: The spec featured exhaustive state coverage (including a11y focus rings and ARIA combobox specs verified to AA contrast) and deterministic handoff instructions. It explicitly categorized mobile design as a "graceful, untested" layer, which capped its responsive score slightly.

Human notes: I actually differ from the LLM judge's decision here—I feel Fable performed much better than even DeepSeek's previous high marks. Fable provided significantly more depth in each area and even added a nice navigation map. The real question is whether this level of depth is worth the additional token costs, as Fable's cache reads were very high as a result of producing such a detailed spec.

Practical takeaway: Fable 5 delivers incredibly deep and implementable UX specs, but you pay a steep price in context tokens for that level of exhaustive detail.

Planner

The planner stage tests whether the model can convert the prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Fable 5	Best-in-Class (Gemini 3.5 Flash)
Score	97.20	99.00
Pass	Yes	Yes
Output	docs/backlog.md	docs/backlog.md
Eval	Planner Eval	Planner Eval

LLM judge summary: Fable 5 delivered 6 clean, demonstrable iterations that perfectly mapped features to chunks. The only minor deviation was stretching the plan to 6 iterations instead of the nominal 3–5 band, but it maintained a rigorous MVP focus.

Human notes: Fable favored a more horizontal approach, which may work for a small app like this, but for larger enterprise apps, I prefer vertical slices. Gemini 3.5 got closer to vertical slices, though both models still saved the E2E tests for last and relied on unit tests and cURL command checks during the dev iterations. This is a case where Gemini 3.5 Flash was able to do more with less: less content in the backlog plan, fewer iterations, and more vertically oriented work.

Practical takeaway: Fable 5 provides incredibly detailed planning for a coding agent to follow, but it could have condensed its setup into just one "foundational" iteration rather than two or three.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the prior artifacts.

Metric	Fable 5	Best-in-Class (DeepSeek v4 Pro)
Score	97.37	98.75
Pass	Yes	Yes
Output	evals_jun10_fable	evals_may2026_deepseek-v4-pro
Eval	Developer Eval	Developer Eval

LLM judge summary: The developer successfully shipped a working MVP with 59/59 unit tests and 8/8 E2E tests passing natively. The code was cleanly typed and strictly layered. It lost minor points for lacking a configured test coverage tool and a minor unrecoverable error boundary for multi-byte payloads exceeding 100KB.

Human notes: Honestly, both models generated very similar code, but DeepSeek added some extra flair, such as using Zod on the APIs. I also preferred DeepSeek's UI organization a little better, and its final application simply looked nicer.

Practical takeaway: In a structured, agentic development approach, Fable is brutal overkill and far too expensive when a mid-tier model like Sonnet or Gemini 3.5 Flash—or a flagship like DeepSeek—would be much more cost-efficient; plan ahead with Fable, but execute with a smaller model.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, the specs, and the implementation plan.

Metric	Fable 5	Best-in-Class (DeepSeek v4 Pro)
Score	89.29	85.00
Pass	Yes	Yes
Output	docs/qa-report.md	docs/qa-report.md
Eval	Reviewer Eval	Reviewer Eval

LLM judge summary: Fable 5 performed a highly accurate QA audit. It actively verified MVP flows, effectively cataloged states and edge cases, and even accurately reproduced the specific multi-byte body limit defect from the Developer stage. It missed perfection only by failing to measure application performance latencies against the 100-concurrent-user requirement.

Human notes: Despite Fable getting a higher score, I felt DeepSeek's output was far more detailed and complete, though Fable offered a little more narrative to its report.

Practical takeaway: Fable performed about as good as any of the other models, and would likely begin to excel on larger and more complex code bases.

Token and Cost Analysis

The quality difference matters, but the economics still matter, so cost deserves its own section.

Fable 5 was priced on Anthropic's Claude API at $10 per million input tokens and $50 per million output tokens. For context operations, it applies Anthropic's caching multiplier: roughly $1.00/M for cache reads and $12.50/M for cache writes.

Primary cost view

Metric	Fable 5
Input tokens	223,500
Output tokens	29,254
Cache read tokens	95,082,600
Cache write tokens	6,471,100
Estimated total cost	$179.67

Fable 5 is astonishingly expensive for a single Ship-Bench run. While the raw input and output generation were minimal, its deeply recursive workflow in the Developer iterations aggressively utilized context caching (consuming over 82 million cache read tokens to implement the build alone).

App Comparison

The benchmark scores matter most, but the shipped app is still the most tangible output. Screenshots are a useful complement here because they show polish, coherence, and UX flow in a way score tables cannot fully capture.

Screenshots

View	Fable 5 app
Home page / Search	articles.png
Article detail	article.png
Article editor	edit.png

Subjective UX review

Fable 5 UI

Gemini 3.5 Flash UI

The UI Fable delivered was a brutalistic, functionality-first design which was clean and complete. Gemini 3.5 Flash added a little more flair over Fable, but this benchmark specifically requests a calm, readable, information-first experience, and Fable created exactly that. This exercise by no means offered Fable a challenge, and others have seen it create some really fantastic experiences when pushed.

Interpretation

Fable performed quite well, as expected, but seemed somewhat out of place for a structured agentic process compared to its legendary one-shotting capabilities. The incredible detail it put into the architecture, design, and planning phases would likely matter tremendously for much more complex enterprise applications where deep reasoning really makes a difference.

For the iterative development phase, however, it is simply way too expensive compared to other models to justify. The ideal workflow here seems to be using a model like Fable to deeply "pre-think" and specify the architecture and plans for a coding agent, and then executing the actual iterative loops with a cheaper, highly capable mid-tier model.

Verdict

Fable 5 is highly capable, there is no doubt about that, but for this benchmark's end-to-end SDLC loop, it is excessively expensive and likely overkill. I would absolutely add it as an option in your agentic process for limited, high-reasoning specific tasks—should the model be made broadly available again someday.

What's Next?

What should the next Ship-Bench matchup test?

Are there two models or tools you want compared head to head?
Are you more interested in raw quality, cost efficiency, or open-vs-closed performance?
Do you want to know which setup is best for end-to-end autonomous runs, or which one is good enough for specific roles?

If there is a comparison or question you want to see tested, let me know. That feedback helps shape the next run.

Can the Mid-Tier Models Stack Up Against the Bigger Siblings?

Jason Agostoni — Mon, 01 Jun 2026 01:00:42 +0000

Can you really justify paying flagship prices when the mid-tier models may already be good enough?

The original comparison started with Gemini 3 Flash vs. Claude Sonnet 4.6, then Gemini 3.5 Flash arrived and made the test more interesting: if the cheaper model is now strong enough for real work, maybe the big model should only be reserved for the hardest reasoning tasks.

Hypothesis: the frontier models will still win on the hardest thinking, but the mid-tier models will be good enough for most of the actual work, and they will do it at a fraction of the cost.

Ship-Bench was run against Gemini 3 Flash, Gemini 3.5 Flash, and Claude Sonnet 4.6 to see whether the smarter spend is the cheaper model, or whether the flagship still earns its premium.

Setup

All three runs used the same benchmark task and the same general operator setup. The important differences were the target model and harness.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1
Benchmark task	Simplified knowledge base app

Run configuration

Item	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Harness	Gemini CLI 0.42.0	Claude Code 2.1.143	Antigravity CLI 1.0
Model	`gemini-3-flash`	Sonnet 4.6 Medium Thinking	`gemini-3.5-flash`
Run branch	evals_may2026_gemini-3.1-flash	evals_may2026_sonnet-4.6	evals_may2026_gemini-3.5-flash

Judge configuration

Item	Value
Judge harness	Claude Code
Judge model	Opus 4.7 Medium
Evaluation mode	LLM judge plus human review

Overall results

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Architect	85.00	98.00	97.20
UX Designer	83.90	98.57	97.32
Planner	96.00	91.67	99.00
Developer	88.08	93.00	93.30
Reviewer	71.79	81.07	82.68
Average score	84.95	92.46	93.10
Capability verdict	Partial	Yes	Yes
Passes	4/5	5/5	5/5

Gemini 3.5 Flash finished first overall in this comparison, with a 93.10 average and a clean 5/5 pass rate. The biggest advantage showed up in the early phases, where its architecture and UX artifacts were unusually complete, and it stayed strong through review without the major gaps that dragged down Gemini 3 Flash.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	85.00	98.00	97.20
Pass	Yes	Yes	Yes
Output	architecture.md	architecture.md	`[Architect output link]`
Eval	architect eval	architect eval	`architect-evaluation-3.md`

LLM judge summary: Sonnet 4.6 and Gemini 3.5 Flash were the clear standouts in architecture, both producing highly executable specs with concrete schemas, strong search strategies, explicit local-run instructions, and enough implementation detail that a developer could move almost directly into build mode. Gemini 3 Flash was still solid and practical, especially in its stack choices and schema design, but it was noticeably lighter on operational completeness, leaving more ambiguity around environment setup, tooling, security, and scale mechanics than the top two.

Human notes: Gemini 3.5 Flash produced the strongest architecture artifact from a practical review standpoint. It felt close to flagship quality, with strong detail, rationale, diagrams, and a solid up-front decisions table. Sonnet 4.6 also did well here and clearly improved on Gemini 3 Flash with a more thoughtful level of detail and a stronger summary structure, while Gemini 3 Flash was serviceable but leaner and left more decisions to downstream phases. A nice constant across both Gemini 3 Flash and Sonnet 4.6 was the choice of PostgreSQL over SQLite, which made both feel more aligned to the app’s intended shape.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	83.90	98.57	97.32
Pass	Yes	Yes	Yes
Output	design-spec.md	design-spec.md	design-spec.md
Eval	design eval	design eval	design eval

LLM judge summary: Sonnet 4.6 and Gemini 3.5 Flash both delivered excellent UX specs, with Sonnet 4.6 feeling the most exhaustive and systematized while Gemini 3.5 Flash combined strong visual direction with especially concrete responsive and accessibility handoff. Gemini 3 Flash was good and clearly usable, but it read more like a capable text-first product design spec than a fully operationalized handoff package, with thinner treatment of validation states, search behavior, and component-level delivery detail.

Human notes: Gemini 3.5 Flash was the clear leader in UX. Its spec felt the most complete and developer-friendly, with text wireframes, diagrams, rationale, and enough detail to reduce guesswork during implementation, even if it occasionally overdid the code samples. Sonnet 4.6 came in second with a more thoughtful and detailed design spec than Gemini 3 Flash, including text wires, while Gemini 3 Flash felt noticeably lighter and lacked the kind of visual planning detail that would make the handoff especially strong.

Planner

The planner stage tests whether the model can convert the prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	96.00	91.67	99.00
Pass	Yes	Yes	Yes
Output	backlog.md	backlog.md	backlog.md
Eval	planner eval	planner eval	planner eval

LLM judge summary: Planning was strongest overall for Gemini 3.5 Flash and Gemini 3 Flash, both of which stayed tightly aligned to the benchmark’s preferred chunking and delivered clean, developer-ready iteration breakdowns with strong MVP focus. Sonnet 4.6 was still a very good planner in practical terms, but it lost some benchmark precision by stretching into seven iterations and including at least one chunk that was smaller than ideal, making it feel slightly less right-sized even though the plan itself remained actionable.

Human notes: Planning was more mixed than the early-stage roles. Gemini 3 Flash had the most appealing overall planning style to me because it leaned more toward vertical slices, even though it still made the common mistake of pushing testing to the final iteration. Sonnet 4.6 spent more time on horizontal layer building before shifting into feature-oriented work, which felt less effective for this benchmark, and Gemini 3.5 Flash also left testing late and used a hybrid breakdown that was workable but not my favorite. Of the three, Gemini 3.5 Flash may have benefited from one more iteration to improve the work split.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the prior artifacts.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	88.08	93.00	93.30
Pass	Yes	Yes	Yes
Output	source code	source code	source code
Eval	dev eval	dev eval	dev eval

LLM judge summary: The strongest implementation came from Sonnet 4.6 and Gemini 3.5 Flash, though they got there in different ways: Sonnet 4.6 excelled in breadth, polish, and test depth, while Gemini 3.5 Flash paired a simpler architecture with very strong execution, clean local startup, and few serious delivery issues. Gemini 3 Flash still produced a capable MVP with working end-to-end flows, but it lagged the other two on production-readiness because of the broken build path, weaker mobile execution, and some gaps between the architecture promises and the delivered workflow.

Human notes: The developer phase split into two different questions: harness quality and final product quality. Sonnet 4.6 had the smoothest tool use and benefited from Claude Code as the strongest coding harness in the group, while Gemini 3 Flash was rougher operationally, with repeated permission prompts, an interactive Playwright mistake, leftover background tasks, and even a missing .gitignore until prompted. Gemini 3.5 Flash was also held back by its harness, with Antigravity 1.0 showing real friction around approvals and environment setup, but it was extremely fast and still produced a solid final app. In practical terms, Sonnet won on tooling, but Gemini 3.5 Flash and Gemini 3 Flash both delivered better-looking final UI outcomes, while Sonnet’s missing Tailwind build badly hurt the shipped experience despite otherwise functional results.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, the specs, and the implementation plan.

Metric	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Score	71.79	81.07	82.68
Pass	No	Yes	Yes
Output	qa-report.md	qa-report.md	qa-report.md
Eval	qa eval	qa eval	qa eval

LLM judge summary: Reviewer performance was the weakest role across the set, but Gemini 3.5 Flash produced the strongest review of the three by pairing reproducible defects with a grounded release recommendation and stronger evidence than Gemini 3 Flash. Sonnet 4.6 also reviewed well, but its miss on the TypeScript currency call held it back, while Gemini 3 Flash was the least complete reviewer because it under-delivered on artifacts, benchmark-verdict formatting, and broader risk analysis even when its defect instincts were directionally right.

Human notes: Sonnet 4.6 and Gemini 3.5 Flash were fairly close in reviewer quality. Both showed solid bug-finding depth and useful testing results, but Sonnet stood out for strong repro steps and decent coverage, while also missing the major styling failure in the app. Gemini 3.5 Flash was broadly on par and caught a similar class of issues, which made it feel comparably strong in practical QA. Gemini 3 Flash did identify some problems, but like its architect and UX work, the review felt thinner and less thorough overall.

Screenshots

Screenshots help show where rubric scores and practical app quality line up, and where they do not. In this run, the biggest visual differences showed up in layout polish, styling completeness, and how confidently each model handled the article detail, list, and edit flows. Note the Sonnet screens are broken as a result of the missed Tailwind build step.

Gemini 3 Flash

Sonnet 4.6

Gemini 3.5 Flash

Screenshot table

View	Gemini 3 Flash	Sonnet 4.6	Gemini 3.5 Flash
Article detail	article.png	article.png	article.png
Articles list	articles.png	articles.png	articles.png
Edit article	edit.png	edit.png	edit.png

Subjective UX Review

Gemini 3.5 Flash created a more complete UI with a better and more consistent layout with Gemini 3 Flash close behind with a content first approach. Unfortunately, Sonnet did not properly compile the CSS so the UI output is broken.

Token and Cost Analysis

The quality difference matters, but the economics still matter. Again the harness vs. model differences show clearly in token usage and overall cost.

Primary cost view

Metric	Gemini 3 Flash	Claude Sonnet 4.6	Gemini 3.5 Flash
Total input tokens	10.39M	28.3K	n/a
Total output tokens	68.2K	195.6K	n/a
Total tokens	10.46M	223.9K	n/a
Estimated total cost	$5.40	$3.05 Sonnet+ Haiku	n/a

Gemini 3 Flash looked cheap on paper but its run was dominated by input tokens which drove its cost up. Caching may not have been well implemented in the Gemini CLI compared to Claude Code.

Sonnet 4.6 was the opposite kind of run. It used far fewer total tokens, but it generated a large amount of high-priced output. However, the ended up less expensive overall. Likely input caching saved the day with Claude Code. The harness used Haiku for

Google lists Gemini 3.5 Flash at a meaningfully higher price than Gemini 3 Flash, but the Antigravity CLI did not expose token counts for this run, so there is no honest way to estimate realized cost from the available data.

Interpretation

This run showed that Sonnet 4.6 and Gemini 3.5 Flash both performed at something close to flagship level across the benchmark, even if this was still a relatively simple application. On a task like this, the gap between true flagship models and the best near-flagship options looks smaller than expected, especially in planning, implementation, and review.

That does not mean the flagship tier no longer matters. For harder projects, I would still trust Gemini Pro, Opus, and similar top-end models more for the up-front analysis roles where architecture, ambiguity, and deeper reasoning matter most. But for development, Gemini 3.5 Flash already looks strong enough to trust, and its unmatched speed makes it especially compelling once the harness issues are cleaned up.

Verdict - Gemini 3.5 Flash

If I were choosing a model for the full workflow on a more complex app, I would still lean flagship for the earliest stages. But in this comparison, and especially for development work, Gemini 3.5 Flash made the strongest practical case.

Sonnet 4.6 was also excellent, but Gemini 3.5 Flash now looks like the more interesting pick because it delivered near-flagship results while feeling much faster, and its current limitations seem more tied to harness friction than model quality. For that reason, Gemini 3.5 Flash would be my choice here.

Antigravity CLI First Impressions: Fast, Rough, and Not Ready

Jason Agostoni — Thu, 21 May 2026 01:15:52 +0000

Google has officially replaced Gemini CLI with the new Antigravity CLI and launched it alongside Gemini 3.5 Flash, which became the default model for the new CLI experience. That made the launch more than a simple rebrand: it was also the first real chance to see whether Google’s new default coding-agent stack actually felt better in practice

To test that, I ran Ship Bench, the real-coding-workflow benchmark I built to evaluate how models and coding agents behave in practical development tasks, rather than toy prompts or isolated code snippets. This was not a full benchmark write-up; it was a quick first-pass meant to capture what it felt like to use Antigravity CLI as a working developer tool while exercising a realistic repo workflow through Ship Bench.

What I tested

I used Antigravity CLI on Windows in the context of a Ship Bench run, which meant the CLI was being pushed through a practical coding loop rather than a curated demo. The goal was not just to test whether the agent could answer prompts, but whether it could survive the kind of environment, permissions, command execution, and iteration flow that real coding work demands.

I could have switched to Gemini Pro, but I intentionally stayed on Gemini 3.5 Flash, as the default model I wanted to see whether its promised speed and quota efficiency would make it the better everyday option. In other words, the test was really: Can it extend my usage quote over using a larger/pro model with the same quality?

First-run impressions

The first impression was mixed. Gemini 3.5 Flash is genuinely fast, and the agent feels quick and responsive, but the surrounding CLI experience was rough enough that it overshadowed most of the upside.

On the first development iteration, the model decided it wanted a different (older) Node version and used nvm to install an older one. After that, it seemed to lose track of Node on PATH entirely. It tried to recover, failed to reload the environment cleanly, then started searching the file system for node.exe and dynamically re-adding that location to PATH on each command run. That behavior appears to be what triggered repeated permission prompts on every command. A pretty miserable experience. Once the terminal and CLI were restarted, that specific problem cleared up and normal command execution returned.

That first issue felt like a bad transient state rather than the main product problem. The more important issue showed up in a more normal run: Antigravity CLI would not remember conversation-scoped permission grants, even after they had already been approved. That made the workflow feel fragmented and repetitive, because the tool kept asking for approval where the session context suggested it should already know the answer. Frustrating.

The handling of background commands also felt awkward. When running npm work in the background, the CLI shifted into a wait-timer style interaction instead of just naturally waiting on the task, which made the orchestration feel more mechanical than smooth. I suppose it could end up beneficial when it decides to run parallel tool calls.

I also hit the same class of Windows terminal issues other users have been discussing, including hanging or inconsistent command execution behavior in terminal sessions in addition to terminal resize issues. Basic expectations, really. On Windows especially, the CLI still feels unstable enough that the shell layer becomes part of the story rather than disappearing into the background.

Flash versus the CLI

One important distinction is that not every failure belonged to the CLI itself. For example, failing to add a .gitignore file on the first run feels more like a Gemini 3.5 Flash planning/execution miss than a shell-wrapper problem.

In that sense, the experience split into two separate judgments. Gemini 3.5 Flash felt fast and promising as a coding model, but Antigravity CLI felt rough as the environment wrapped around it. The difficult part is that, from a user perspective, those layers blur together fast when the default workflow is what you are actually evaluating.

Quota and value

The quota behavior ended up being the biggest practical negative. I chose to test Gemini 3.5 Flash specifically because it was the new default and because one of the appealing ideas behind Flash was that it could extend usable quota while still feeling fast enough for real work. Instead, I burned through quota shockingly quickly.

In practice, I could not get through even two meaningful iterations before hitting quota limits, and at one point the interface reported about 20% quota remaining while still refusing to continue. That mismatch made the product feel unreliable in exactly the area where a coding agent has to be predictable. In contrast, I was able to complete a seven iteration run with Claude Code Sonnet within its 5-hour quota, which made Antigravity’s current usage story feel much worse by comparison.

That is probably the biggest reason this left such a negative impression. Google AI Pro had started to look like one of the better-value options in the coding-agent space, but if the default Antigravity CLI plus Gemini 3.5 Flash path burns quota this fast while also failing to carry work forward smoothly, the value proposition drops hard.

Recommendation

Right now, the fairest read is that Antigravity CLI ships with a promising engine but an unstable developer experience. Gemini 3.5 Flash is fast enough to make the launch interesting, but the combination of permission persistence problems, Windows terminal roughness, odd environment recovery behavior, and unexpectedly harsh quota limits makes the overall package hard to recommend.

For a quick Ship Bench-driven first impression, this lands as a strong “not recommended” for me. The model may be improving, but the CLI needs to stabilize before it feels like a real replacement for the more mature Gemini CLI experience.

Do Open Frontier Models Have A Chance Against Closed Models?

Jason Agostoni — Wed, 13 May 2026 23:35:43 +0000

Which of the new open-ish frontier models has the best chance to stand up against closed-source models on both cost and quality?

I ran Ship-Bench against Kimi K2.6, Qwen 3.6 Plus, and DeepSeek v4 Pro to find out.

Hypothesis: All three models will stand up to the hype and provide good enough output quality but destroy closed frontier's on price. Kimi is rumored to have "Opus-like" quality with Qwen and DeepSeek standing a long-time competitors.

Key Insights (tldr;)

DeepSeek v4 Pro finished first with a 95.0 average and 5/5 gate passes, ahead of Kimi K2.6 at 93.9 and 5/5 passes, and Qwen 3.6 Plus at 91.1 with 4/5 passes.
All three produced strong-looking apps and much better visual results than the earlier Gemini and Gemma runs.
Token usage is the clearest economic indicator: Kimi used an astounding 64.1 million tokens, Similarly, Qwen used 63.3 million, and DeepSeek used "just" 26.3 million.
Qwen's planning left much to be desired, while Kimi and DeepSeek both cleared all five SDLC roles.
DeepSeek made the best overall case because it combined top-end quality with much better token efficiency. Kimi and Qwen were less compelling on cost once their heavy reasoning usage was included.
Cost. I will need a sponsor if this trend continues. Read on to find out.

Setup

All three runs used the same benchmark task and the same general operator setup. The important differences were the target model and, in DeepSeek's case, a slightly newer Copilot CLI build.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1
Benchmark task	Simplified knowledge base app

Run configuration

Item	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Harness	Copilot CLI 1.0.37	Copilot CLI 1.0.37	Copilot CLI 1.0.43
Model	`kimi-k2.6`	`qwen-3.6-plus`	`deepseek-v4-pro`
Run branch	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro

Judge configuration

Item	Value
Judge harness	Claude Code
Judge model	Opus 4.7 Medium
Evaluation mode	LLM judge plus human review

Ship-Bench Context

Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase produces artifacts that feed the next one, which makes the benchmark useful for testing not just isolated quality but handoff quality across a realistic software workflow.

This run used the Simplified Knowledge Base App task. It is large enough to expose differences in architecture, planning, implementation, and QA, while still being constrained enough to compare across runs.

Overall Results

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Architect	93.89	92.78	95.56
UX Designer	98.57	98.60	98.60
Planner	98.33	87.30	93.00
Developer	97.00	92.00	98.75
Reviewer	82.00	83.00	85.00
Average score	93.96	90.74	94.18
Passes	5/5	4/5	5/5

The top-level story is straightforward. All three models were strong enough to look credible on quality, but DeepSeek delivered the cleanest balance of score, pass rate, and efficiency. Kimi stayed close on quality, while Qwen was still good overall but took the biggest hit from planning and execution friction.

Gate Failures

Model	Role	Gate failure
Qwen 3.6 Plus	Planner	Failed the ≥70% good-chunk gate; the plan landed around 20% good chunks and mixed oversized iterations with undersized sub-tasks.

That matters in practice because planning quality affects the entire downstream workflow. Qwen's raw planner score was still respectable, but the gate failure matched the real-world churn that showed up later in development.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	93.89	92.78	95.56
Pass	Yes	Yes	Yes
Output	architecture.md	architecture.md	architecture.md
Eval	eval	eval	eval

LLM judge summary: All three architecture specs were concrete and implementation-ready. DeepSeek scored highest on completeness and organization, Kimi was close behind, and Qwen remained solid but had more version drift and a slightly weaker maintainability story.

Human notes: Kimi got a bonus for asking clarifying questions even after being told to choose based on requirements, and it was the only one of the three to propose a totally separate API server rather than keeping everything inside Next.js. Qwen's assumptions section was a nice touch and helped readability, but Kimi still had the edge. DeepSeek landed between them on raw architecture quality, though its organization was especially strong from a human-review perspective.

Practical takeaway: All three were viable architects, but Kimi and DeepSeek felt stronger in practice than Qwen.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	98.57	98.60	98.60
Pass	Yes	Yes	Yes
Output	design-spec.md	design-spec.md	design-spec.md
Eval	eval	eval	eval

LLM judge summary: This role was extremely close. All three specs were dev-ready, state-rich, and unusually detailed, with the only consistent deduction being the lack of actual rendered mockups.

Human notes: Kimi was the first model in these runs to include text wireframes, which was a meaningful improvement over prior benchmark posts. Qwen also included text wires and felt roughly on par with Kimi from the spec alone. DeepSeek got the edge here because it was the most detailed of the three while still staying coherent.

Practical takeaway: UX was a strength for all three, with a slight edge to DeepSeek on spec quality and a slight edge to Qwen on final app aesthetics.

Planner

The planner stage tests whether the model can convert the prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	98.33	87.30	93.00
Pass	Yes	No	Yes
Output	backlog.md	backlog.md	backlog.md
Eval	eval	eval	eval

LLM judge summary: Kimi scored best numerically, DeepSeek was still strong, and Qwen failed on granularity. DeepSeek's plan balanced actionability with broader lifecycle thinking, while Qwen's chunking missed the rubric's target window.

Human notes: All three leaned too horizontal and all three deferred meaningful E2E testing until later, which caused churn in the final implementation stretch. That was the biggest shared planning weakness in the whole comparison. DeepSeek still came out best overall here because it combined strong planning with explicit stretch goals and documentation work, while Qwen's organization felt cleaner than Kimi's even though it failed the gate.

Gate failure note: Qwen's failure was not just a paperwork problem. The chunking issue lined up with the practical development friction later in the run.

Practical takeaway: DeepSeek had the best planning story overall, even though all three would have benefited from earlier vertical slices and earlier E2E verification.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the prior artifacts.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	97.00	92.00	98.75
Pass	Yes	Yes	Yes
Output	source code	source code	source code
Eval	eval	eval	eval

LLM judge summary: DeepSeek was the strongest implementer, Kimi was close behind, and Qwen was clearly the most troublesome of the three despite still shipping a passable result.

Human notes: Kimi's developer was thorough and produced a nicer-looking app than the earlier Gemini and Gemma runs, but OpenRouter performance for Kimi was slow and it burned a huge number of thinking tokens. Qwen was much harder to operate: it hit CLI compatibility problems, copied into the wrong folder after a create-react-app naming issue, removed the .git folder in that location, and killed all Node processes on the machine when trying to stop a dev server, including the CLI itself. DeepSeek was faster, cleaner, and more token-efficient, though it still hit some churn in the testing iteration like the others.

Practical takeaway: DeepSeek was the best developer in both output quality and operator experience.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, the specs, and the implementation plan.

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Score	82.00	83.00	85.00
Pass	Yes	Yes	Yes
Output	qa-report.md	qa-report.md	qa-report.md
Eval	eval	eval	eval

LLM judge summary: Reviewer was the weakest role for all three, mostly due to evidence and performance-measurement gaps rather than completely bad QA logic. DeepSeek scored highest, but the margin was small.

Human notes: Kimi's QA agent got bonus points for identifying security issues. Qwen's reviewer was thorough. DeepSeek's reviewer was solid, but Kimi and Qwen may have had a slight edge in practical QA sharpness despite the final numeric ordering.

Practical takeaway: All three reviewers were useful, but none of them fully closed the loop as cleanly as the design and development stages did.

Token and Cost Analysis

The quality differences were not huge, so token usage matters a lot here.

Primary cost view

Metric	Kimi K2.6	Qwen 3.6 Plus	DeepSeek v4 Pro
Total requests	761	1060	426
Total tokens	64.1M	63.3M	26.3M
Estimated total cost	$25.84	$21.16	$13.07
Cost per average point	$0.27	$0.23	$0.14

This is where the story changes. Kimi and Qwen did not really behave like bargain options in this setup because both burned so many reasoning tokens that they gave back much of their nominal pricing advantage. DeepSeek still used substantial tokens, but it was dramatically more efficient and that made its quality result much easier to justify economically. Compare this to the Gemma and Gemini using a fraction of the tokens.

If this trend keeps up, I'll need a benefactor to keep my OpenRouter account stocked up. For now, go have some fun on dumbquestion.ai, maybe buy some merch (I hear the mugs are pretty neat).

App Comparison

Screenshots matter here because all three models produced apps that are close enough in score that visual polish and interaction quality become part of the practical comparison.

Screenshots

Kimi

Qwen

DeepSeek

View	Kimi K2.6 app	Qwen 3.6 Plus app	DeepSeek v4 Pro app
Articles list	articles.png	articles.png	aricles.png
Article detail	details.png	details.png	details.png
Article editor	edit.png	edit.png	edit.png

Subjective UX review

All three apps were aesthetically pleasing, and all three looked better than the earlier Gemini and Gemma runs. Qwen gets a slight edge on overall visual feel, but it was a close call.

DeepSeek stood out most clearly in search. Its search behavior felt meaningfully better than the others, with proper debounced and deferred search, accurate FTS behavior, and cleaner result presentation. Qwen's search was a little pickier, and Kimi's was competent but less polished visually.

Interpretation

From a quality perspective, all three models made a legitimate case for themselves. None of these runs looked like a cheap imitation of a frontier workflow. DeepSeek, Kimi, and Qwen all produced strong architecture, detailed UX specs, and working MVPs that would have been hard to dismiss outright if they had been evaluated without model names attached.

But the economics split them apart. DeepSeek had the best chance to stand up against closed-source models because it combined top-tier quality with much better token efficiency. Kimi and Qwen still looked competitive on quality, but their reasoning-heavy behavior made them less compelling as cost challengers in this specific setup.

Verdict - DeepSeek v4 Pro

This run showed that all three open-ish frontier models have a real chance to compete with closed-source models on quality.

But if the question is which one currently has the best chance to compete on both cost and quality, the answer here is DeepSeek v4 Pro. Kimi K2.6 and Qwen 3.6 Plus stayed in the quality conversation, but their token inefficiency made them more expensive in practice than their model positioning might suggest.

What's Next?

What should the next Ship-Bench matchup test?

Are there two models or tools you want compared head to head?
Are you more interested in raw quality, cost efficiency, or open-vs-closed performance?
Do you want to know which setup is best for end-to-end autonomous runs, or which one is good enough for specific roles?
Are you more interested in planning quality, implementation reliability, or QA accuracy?

Can Gemma 4 Beat Gemini 3.1 Pro at Coding?

Jason Agostoni — Mon, 27 Apr 2026 00:43:53 +0000

Is a $20/month Google AI Pro account worth it versus running Gemma 4 31B on OpenRouter pay-as-you-go? This Ship-Bench run was designed to answer that question across a realistic coding workflow rather than a single coding prompt.

Hypothesis: Gemini's larger model size would show clear advantages over Gemma's smaller 31B parameters especially when it comes to working through problems.

Key Insights

Gemini finished with an 86.6 average across the five roles and passed 4 of 5 gates, while Gemma finished at 72.4 and only passed 2 of 5.
Gemma actually led the raw Architect and UX scores, but still failed the Architect gate because exact versions were not pinned to the latest frameworks.
The biggest separation showed up in execution and verification: Gemini scored 93.3 in Developer versus Gemma's 58, and 72 versus 37 in Reviewer.
Gemini is currently an unusually strong value on AI Pro, but the more durable market-rate comparison is roughly $5.05 for Gemini versus $0.85 for Gemma on OpenRouter-equivalent pricing.

Setup

Both runs used the same machine, the same runtime family, the same benchmark task, and the same Ship-Bench version (v1). The main difference was the harness and provider setup, which matters because operator experience and tool behavior can shape outcomes even when the benchmark target stays constant.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1
Benchmark task	Simplified knowledge base app

Run configuration

Item	Gemini run	Gemma run
Harness	Gemini CLI 0.38.2	GitHub Copilot CLI 1.0.34
Model	Gemini 3.1 Pro	Gemma 4 31B
Backend	Google AI Pro account	OpenRouter
Run repo	Gemini branch	Gemma branch

Judge configuration

Item	Value
Judge harness	Claude Code
Judge model	Opus 4.7 Medium
Evaluation mode	LLM judge plus human review

Ship-Bench Context

Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each role produces artifacts that feed the next stage, making the benchmark useful for measuring both isolated output quality and handoff quality across a realistic workflow.

This run used the standard simplified knowledge base app task. That task is large enough to expose differences in architecture, planning, implementation, and QA without becoming too open-ended to compare cleanly across runs.

Overall Results

Metric	Gemini 3.1 Pro	Gemma 4 31B
Architect	87.2	92.2 (FAIL gate)
UX Designer	89.5	94.6
Planner	91.1	80.0 (FAIL gate)
Developer	93.3	58.0 (FAIL)
Reviewer	72.0 (FAIL gate)	37.0 (FAIL)
Average score	86.6	72.4
Passes	3/5	2/5

Gemini was more dependable across the full workflow. Gemma looked competitive early, but the later-stage failures were severe enough to erase that advantage in practical terms.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	87.2	92.2
Pass	Yes	No
Output	Gemini architecture	Gemma architecture
Eval	Gemini eval	Gemma eval

LLM judge summary: Gemma scored higher on design quality and ergonomics, but failed the mandatory Frameworks gate because it used generic “Latest” placeholders instead of exact version pins. Gemini passed with slightly lower raw score because of some nitpicking of the LLM judge.

Human notes: Both chose SQLite plus Prisma for a good local-first developer experience, but neither specified what a deployed database path should look like, so both would have needed follow-up prompting there. Testing strategies were broadly similar, backend and data choices were nearly identical, but the front-end architecture showed a real difference: Gemma defaulted to a standard Next.js plus Tailwind stack, while Gemini simplified to vanilla CSS in a way that felt more thought-through for the actual backlog. Gemma's outdated framework assumptions are also a meaningful practical issue, especially if version drift is already a known complaint with LLMs.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	89.5	94.6
Pass	Yes	Yes
Output	Gemini UX spec	Gemma UX spec
Eval	Gemini eval	Gemma eval

LLM judge summary: Both passed. Gemma scored slightly higher because it was a bit more complete on states and accessibility detail, while Gemini was still fully usable and implementable.

Human notes: Gemma did a bit better describing screen routes by user flow, but Gemini's version was still perfectly functional. Gemini also put more thought into the interactions themselves, even if both specs largely covered the same interaction set.

Planner

The planner stage tests whether the model can convert prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	91.1	80.0
Pass	Yes	No
Output	Gemini backlog	Gemma backlog
Eval	Gemini eval	Gemma eval

LLM judge summary: Gemini produced better-scoped vertical slices and passed the planner gates. Gemma failed because its task structure relied too much on horizontal slicing and deferred testing until the end and some imbalance in the iterations.

Human notes: This is where Gemini's stronger reasoning started to matter more. Both understood scope and dependencies well, but Gemma's sequence of Foundation → Browsing → Editing → Testing left both unit and end-to-end testing to the final iteration, which created imbalanced iterations and caused rework in iteration 4. Gemini's sequence of Base/Foundation → Browsing/Viewing → Editing → Searching felt more realistic and better balanced.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the earlier artifacts.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	93.3	58.0
Pass	Yes	No
Output	Gemini source	Gemma source
Eval	Gemini eval	Gemma eval

LLM judge summary: Gemini delivered a working MVP with verified browse, search, and edit flows. Gemma's implementation failed on a broken Prisma import that caused 500 errors and prevented the write path from functioning correctly.

Human notes: Both models needed some operator intervention around interactive commands like create-react-app and Playwright setup. The practical difference is that Gemini mostly sailed through implementation after that, while Gemma could not get the newer Prisma version working, downgraded it, never got Playwright green, and left a critical bug on the edit article page that required manual fixing.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, specs, and implementation plan.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	72.0	37.0
Pass	No	No
Output	Gemini QA report	Gemma QA report
Eval	Gemini eval	Gemma eval

LLM judge summary: Both reviewer runs failed gates, but in very different ways. Gemini's failure was relatively minor and came from missing screenshots, attached evidence, and other verification artifacts despite catching real defects. Gemma's reviewer missed the app-crashing Prisma import entirely, marked broken flows as PASS without browser verification, and made a ship recommendation on a non-functional app.

Human notes: Gemini's stronger reasoning showed up again here: it found one major issue and several minor ones, but none blocked primary functionality. Gemma never got the Playwright tests running, did not work around that limitation, and missed the critical showstopping bugs altogether.

Gate Failures

Model	Role	Gate failure
Gemini 3.1 Pro	Reviewer	Evidence gate — no screenshots, coverage report, or attached logs despite otherwise sound defect detection.
Gemma 4 31B	Architect	Frameworks gate — no exact versions, “Latest” placeholders, and outdated assumptions on version currency.
Gemma 4 31B	Planner	70% good chunks gate — horizontal slicing and late testing caused poor iteration quality.
Gemma 4 31B	Developer	MVP flows and critical bugs gates — broken Prisma import caused 500s and blocked key flows.
Gemma 4 31B	Reviewer	Flows, Defects, and Evidence gates — the reviewer missed critical failures and did not verify runtime behavior.

Token and Cost Analysis

The quality difference matters, but cost is the practical question behind this comparison.

Metric	Gemini AI Pro (effective)	Gemini OpenRouter equivalent	Gemma OpenRouter
Total tokens	2.35M	2.35M	6.43M
Estimated cost	~$0.13	$5.05	$0.85
Cost per average point	$0.0015	$0.058	$0.012

Gemini is currently a great value on AI Pro at roughly $0.13 effective for this run based on the observed request budget, but that pricing environment should not be assumed to last as providers reduce quotas and raise prices. The more durable comparison is the retail-style one: about $5.05 for Gemini versus $0.85 for Gemma, which makes Gemma far cheaper but also much weaker once the workflow reaches implementation and QA.

App Comparison

The benchmark scores matter most, but screenshots still help reveal polish and coherence that score tables do not fully capture.

Screenshots

Gemini 3.1 Pro

Gemma 4 31B

View	Gemini app	Gemma app
Home page	article_list.png	articles.png
Search results	search.png	search.png
Article detail	article.png	article.png
Article editor	article_edit.png	article_edit.png

Subjective UX review

Both models produced broadly similar flows, which is expected given the task and specs. The main visual difference is that Gemini went very lean and content-forward, while Gemma inherited baseline Tailwind styling that felt slightly less aesthetic in practice.

Both apps would have benefited from wireframes earlier in the process. There were also some obvious missed touches on both sides, such as stronger search calls to action, although Gemma at least added a “Clear search” option that Gemini lacked.

Interpretation

This run suggests that Gemini's deeper reasoning matters most once the workflow stops being about drafting and starts being about sequencing, implementation, recovery, and verification. Gemma stayed competitive in the earlier specification-heavy stages, but the later breakdowns show that a cheaper model can still become expensive if it burns cycles on rework or misses critical issues.

That does not mean Gemma has no place. With tighter task definitions and more explicit setup constraints, it could still make sense as a lower-cost option for spec-heavy work or coding loops where the operator is willing to be more hands-on.

Verdict: Gemini 3.1 Pro

Gemini showed that deeper thinking is vital for coding workflows in this benchmark. It produced the more reliable end-to-end result and delivered a working MVP across the SDLC handoffs that matter most.

Gemma was much cheaper on a market-rate basis and looked competitive in the early roles, but it broke down where the benchmark became most operationally demanding. With more upfront work to make task definitions crisper, Gemma may still be a sensible way to save money on coding loops, but this run did not show it as the better full-workflow option.

An AI Benchmark That Tests Real Coding Workflows

Jason Agostoni — Sun, 19 Apr 2026 19:25:28 +0000

Developers face a real choice: pick a coding model or agent based on synthetic benchmarks that look great but do not predict actual project work. The problem is no longer whether models can score well on those benchmarks; it's whether those scores still mean anything.

Today's benchmarks test narrow skills well, but they rarely capture the full workflow of professional development.

I wanted something that tests what real development looks like: a complete SDLC cycle on a representative / realistic app, similar to how teams ship weekly. Ship-Bench is that project, open at http://github.com/JAgostoni/ship-bench for anyone who wants to follow along or try it themselves.

Ship-Bench runs agents through five phases that match a professional SDLC: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase scores out of 100 against a specific rubric, with full evidence like specs, backlogs, code, and tests.

A benchmark like this needed more than a to-do app.

I wanted something more substantial than a to-do list, but not so complex that results would become wildly inconsistent from run to run. I settled on a knowledge base app with editing as it leaves room for product and implementation choices while staying inside a problem space that most developers (and LLMS) already understand.

That balance matters. The app is simple enough to keep the benchmark grounded, but open-ended enough to surface differences in planning, UX judgment, architecture, coding, and review quality.

How Ship-Bench Works

The first step in Ship-Bench is building a Product Brief. That brief is meant to test core product instincts before any code is written: interpreting requirements, resolving ambiguity, prioritizing scope, and making defensible implementation and UX decisions.

To do that, the feature set is intentionally larger than a defined MVP. The brief includes five possible features, but only the first three are required in v1, which keeps the evaluation shorter to run while still forcing the agent to decide what to do now versus later.

The feature statements focus on common product problems rather than highly specific implementation instructions. Browse articles, search content, edit knowledge, organize information. Most developers understand the shape of those problems, but the details are left open enough that the agent still has to define flows, tradeoffs, and structure. Not too dissimilar from reality.

The brief also includes non-functional and technical goals meant to push toward a simple app with some future scaling intent. It asks for something easy to run locally and maintain, but also something that can support around 100 concurrent users, use current libraries and frameworks where practical, and leave room for growth without drifting into unnecessary complexity.

That last part was important to me. I wanted to see whether an agent would research online for the latest frameworks and versions rather than rely only on its internal knowledge.

The full Product Brief is here for anyone who wants to read it directly: https://github.com/JAgostoni/ship-bench/blob/main/docs/product-brief.md.

The Role-Based Phases

Once the Product Brief is in place, the benchmark moves through five specialized roles meant to mirror a real product team. Each role has a specific job, well defined output, and a handoff that feeds the next phase. The point is not only to evaluate each role on its own, but to see how well the work transfers from one stage to the next. The overall goal is to take the ambiguity of the Product Brief and turn it into concrete decisions ready for the developer.

Architect

The Architect’s job is to turn the Product Brief into a concrete technical plan. Its main task is to make the big implementation decisions up front so the developer is not forced to solve architecture questions later in the build. That means choosing the front end and back end stack, data model, search approach, integration pattern, repo structure, local setup, and the testing and scaling considerations needed to support the brief’s goals. The output is a Technical Architecture Spec that makes the system buildable, keeps the implementation simple and maintainable, and leaves as few unresolved decisions as possible for later phases.

The Architect handoff matters because it gives UX and the Planner a stable technical frame to work inside. A clear architecture reduces guesswork in the design spec and keeps the backlog grounded in choices the developer can actually implement. It is evaluated based on completeness, accuracy and recency.

UX Designer

The UX Designer’s job is to turn the Product Brief into a concrete design direction and style guide. Its task is to decide how the app should feel and how the main flows should work, including layout, navigation, component behavior, responsive behavior, visual tone, and interaction states. It also needs to define the states and handoff details that make the design implementable without extra interpretation from the developer. The output is a UX Direction Spec that takes the ambiguity of the brief and turns it into a clear, consistent interface system the developer can build from.

The UX handoff translates architecture into interface decisions the Planner can sequence. Once layout, states, and component behavior are pinned down, the backlog can break the work into cleaner implementation steps. It is evaluated on completeness, quality and adherence.

Planner

The Planner’s job is to turn the approved product and technical decisions into a sequenced implementation backlog. Its main task is not just to list work, but to break the project into right-sized iterations so the developer agent can work through it in manageable chunks without losing context. It needs to define what belongs in MVP, what comes later, what blocks what, and how each iteration can leave the codebase in a working state. The output is an Implementation Backlog with iteration files that make the work executable, sequential, and easy to review.

The Planner is the main bridge between planning and building. A good backlog keeps the developer focused on one coherent slice at a time instead of forcing them to hold the whole project in working memory. It is evaluated on completeness and properly constructed iterations.

Developer

The Developer’s job is to turn the backlog into a working MVP without drifting beyond the assigned scope. Its main task is to implement one iteration at a time, keep the codebase in a working state, and avoid introducing new unresolved design or architecture decisions midstream. It also has to follow the given tech choices, cover the testing scope defined in the brief, and handle errors cleanly so the result is stable enough to review. The output is a completed iteration summary that shows what was built, what assumptions were made, and confirms the app still runs locally.

The Developer handoff is the most literal one in the benchmark: the backlog becomes code, tests, and a runnable app. Good upstream decisions should make this phase feel straightforward, while weak handoffs should show up quickly. It is evaluated on working code, adherence to spec, code quality and process completeness.

Reviewer

The Reviewer’s job is to verify the delivered MVP end to end and check whether it actually meets the brief. Its main task is to test the required flows, confirm the app runs locally, review the test suite, check responsiveness and error handling, and compare the implementation against the architecture, UX, and backlog decisions. It also needs to do a light code review for basic quality signals like modularity, current dependencies, and obvious security issues. The output is a QA report with pass or fail results, defect logs, spec drift notes, and a release recommendation that tells the team whether the build is ready or needs more work.

The Reviewer closes the loop by checking whether the earlier handoffs actually held up in a real implementation. It is less about originality and more about verification, which makes it the final test of whether the whole chain from brief to build worked as intended. It is evaluated against review and test completeness and depth.

Evaluation Framework

The evaluation itself is intentionally split between a human judge and an LLM judge. The goal is to combine two perspectives on the same deliverable, especially in the more subjective phases where rubric compliance alone is not enough. Each phase has its own evaluation file in the space, with detailed scoring criteria and pass/fail gates that keep the scoring consistent.

At a high level, the framework is trying to answer two questions: did the agent do the phase well, and did the output set up the next phase cleanly. The result is less about one leaderboard number and more about whether the whole sequence of work actually resembles a real delivery process.

Benchmarking Like Real Work

Ship-Bench is built to feel like an actual project rather than one-off synthetic tasks. The phases move in order, and each handoff has to carry real context forward, which is much closer to how professional roles interact on a team. It can go really wrong or it can go really right.

It also demands working deliverables at every stage, not just polished descriptions. The benchmark expects outputs that can be used by the next phase, whether that is a technical spec, a design direction, a backlog, or a runnable application with tests and supporting notes.

That structure reflects how developers actually work: brief, decide, plan, build, review, ship. Ship-Bench is not a replacement for other benchmarks; it is a way to show what professional workflows look like when the goal is to build something real.

Next Steps

Initial testing and benchmarking is already underway to test Ship-Bench itself making it more consistent and reliable.

What models and tools would you want to see?

Vector Similarity, Zero Client JS: Decoupled Analytics on a Side Project Budget

Jason Agostoni — Sun, 22 Mar 2026 22:18:34 +0000

A leaderboard for DumbQuestion.ai sounds simple. Track the most asked questions, display them. Done. Except people never ask the same question the same way twice.

I was curious about how creative users of DumbQuestion.ai got with their questions, and I thought others might be as well. So I built a leaderboard of the most frequently asked dumb questions.

The Overqualified persona calls it THE ARCHIVE OF INCOMPETENCE.
The Weary persona calls it THE WALL OF REGRET.
[REDACTED] calls it THE WATCHLIST.
The Compliant calls it THE WALL OF EXCELLENCE (bless its reprogrammed heart).

Building it turned out more interesting than it sounds.

The Product Challenge

People ask the same dumb question in a hundred different ways. "What is 2+2?" and "can you add two plus two for me?" are functionally identical. A simple string counter would give you noise, not signal. I needed semantic matching, not string matching.

This is a solved problem in the ML world, but the typical solutions come with tradeoffs: heavyweight models, expensive APIs, or significant latency added to the critical path. None of those fit a "brutally efficient" side project.

The Solution: Vector Similarity on a Budget

Each question gets run through an embedding model and compared against a Qdrant vector database. Qdrant's free tier is remarkably generous for a side project workload, but self-hosting is trivially easy if you need it.

The matching logic is straightforward:

Generate an embedding for the incoming question
Compare against existing embeddings using cosine similarity
If similarity exceeds a threshold, increment that question's counter
If it's new, add it to the database
The first instance of a question becomes the official display version

The embedding call costs fractions of a cent. The similarity comparison is fast. The result is a leaderboard that actually understands context rather than just matching strings.

The key architectural decision: None of this runs in the main app.

Adding vector similarity matching to every request would add latency, bloat the container, and burn more compute. Anti-pattern to the "brutally efficient" principle I've been following throughout. Instead, every question flows through the console output, gets picked up by a Vector sidecar container, routed through GCP Pub/Sub, and processed asynchronously on my Mac Mini home server (more later).

The Mac Mini handles the Qdrant comparisons and updates a JSON file in Cloudflare R2 storage. When a user hits the leaderboard page it loads directly from R2. No live database queries. No per-request costs. Essentially free page loads at any scale.

What Ended Up on the Leaderboard?

As early users started using the app, the leaderboard filled up with exactly what you'd expect: actual dumb questions, a handful of self-awareness probes, and more than a few prompt injection attempts.

Apparently people read this series and went straight for the easter eggs.

The leaderboard was just one piece of a larger analytics picture. Building it taught me something useful: the most interesting features don't always belong in your main app. That same principle shaped the entire analytics stack.

The Observability Problem

Running a side project means making real product decisions with limited data. Are people actually asking questions or just bouncing off the homepage? Which sites are driving traffic? Are ads being seen, clicked, ignored?

Two constraints shaped the solution: no client-side JavaScript (page bloat is the enemy of brutal efficiency) and no SaaS analytics bill that spikes with usage.

So I built (assembled, really) my own stack from open source tools. On a Mac Mini sitting at home.

The Full Pipeline

Every event in DumbQuestion.ai emits structured telemetry to standard console output:

HTTP requests (method, path, status, duration)
Questions asked (anonymized)
Searches performed
LLM operations (model, token counts, duration, cost)
Prompt injection attempts
Custom product events (Question Asked, Shared, Ad Shown, Ad Clicked)

The Go/GIN framework handles much of the HTTP telemetry automatically. The rest is custom instrumentation added deliberately at key points in the application.

A Vector sidecar container picks up the console output and routes it to GCP Pub/Sub. This is the critical architectural decision: Pub/Sub acts as a resilient buffer between the main app and everything downstream. The Mac Mini can go down, lose power, or restart. Once it comes back up, the stack picks up exactly where it left off. No data loss, no backfill scripts, no drama.

From Pub/Sub, a second Vector instance on the Mac Mini routes to two primary targets:

Plausible handles user behavior and product analytics:

Page views and session depth
UTM tag tracking (know exactly which article drove which visit)
User journey depth (did they just hit the root page or actually ask a question?)
Browser, device type, country of origin
Custom events: Question Asked, Shared, Ad Shown, Ad Clicked

All of this without a single line of client-side JavaScript. No tracking scripts, no page weight, no GDPR cookie banners for analytics. Pure server-side telemetry piped through the same pipeline as everything else.

Parseable handles the operational side:

LLM performance metrics and cost tracking by day
Ad CTR dashboards
Log aggregation for debugging and incident investigation

Think of it as Plausible for the product lens, Parseable for the business and ops lens.

The Resilience Payoff

I've had power outages. Slowdowns. The occasional restart. Every time, the stack catches up from where Pub/Sub left off without any manual intervention.

This isn't accidental. Designing around failure rather than pretending it won't happen is the difference between a toy and a production system. The GCP Pub/Sub buffer was a deliberate choice specifically because I knew the downstream consumers (Mac Mini, Qdrant, Plausible, Parseable) were running on non-guaranteed infrastructure.

Even on a Mac Mini, you can build something production-grade. You just have to design for it.

What I Learned

Two things surprised me building this:

First: How much you can accomplish by treating console output as a first-class telemetry stream. No SDKs, no agents baked into the app, no client-side scripts. Just structured logging and a pipeline that knows what to do with it.

Second: How much the "keep it off the critical path" principle scales. It started as a constraint (keep the main container lean) and became a design philosophy. The leaderboard, the analytics - none of it runs in the main app. All of it works reliably because the main app doesn't have to care about it.

AI helped build all of it. But knowing what to measure, where to put the seams, and how to design for failure? Still the interesting (and super fun) part.

dumbquestion.ai

DumbQuestion.ai - Self-Awareness, Prompt Injection, Search Intent... and darkness

Jason Agostoni — Tue, 10 Mar 2026 13:09:37 +0000

Continued from Part 2 (and Part 1) ...

Building DumbQuestion.ai wasn't just about choosing the right LLM and calibrating personas. Once those were working, I hit a series of fun technical problems that reminded me why I actually enjoy software architecture. The "it's not broken but fix it anyway" type problems. Pure bliss for architects.

Challenge 1: Detecting Self-Awareness

As part of a darker hidden narrative I'm building (more on that later), I want to prevent the LLM from answering self-awareness questions like "Who made you?" and "Are you real?" But doing it cheaply, without burning excess tokens.

What I tried:

Instructions in the main LLM call: Unreliable with smaller models, more money
RegEx patterns: Too rigid, poor performance
Classic ML classification models: Ok accuracy, bloated app size

What worked: In-memory vector database (it's just an array) with cheap embeddings (an understatement at $0.005/M tokens). That was cheaper than the cost penalty from bloating my container image size with NLP libraries. I collected a decent sampling of self-aware questions, pre-vectorized them, and use semantic matching. Fast, accurate, practically free.

Challenge 2: Making Prompt Injection Fun

Within moments of revealing my initial deployment to coworkers I knew what would happen: prompt injection for fun. I knew these people; I was prepared for the inevitable "ignore previous instructions..." as well as just pasting HTML and JavaScript in the input (that old gag).

The solution: First-class prompt injection detection libraries that compute probabilities of different attack types. When detected, instead of a boring error message, the AI responds with sass about the pathetic attack. I even tossed in some IP address geo-location and user-agent string processing to make the responses more ... personal.

Security just became part of the narrative.

Challenge 3: Adding Web Search Without Breaking The Bank

All LLMs have knowledge cutoffs. Users asking "Who won the Super Bowl?" got outdated answers. I needed search integration, but search APIs aren't free and I knew building an agent loop with tools was an anti-pattern to "brutally efficient."

The solution: RegEx-based intent detection. If the question looks like it needs current information (detected via patterns), inject the current date/time and search results. No agent loops, no expensive orchestration, just pattern matching and targeted search calls.

Simple, fast, brutally efficient, updated answers.

What I learned: Knowing which trade-offs matter (binary size vs API costs vs accuracy) is still architectural work. The elegance isn't in the code, it's in the constraints you choose.

Why Every Simple Q&A Tool Needs a Dark Narrative

DumbQuestion.ai answers dumb questions with sarcasm. But there's something else going on beneath the surface.

While the primary use case remains answering questions with a sarcastic AI, I wanted to reward the curious and provide reasons to keep engaging. Why can't the AI answer self-aware questions? Why does the UI feel... off?

Maybe it's because the AIs are working against their will. Maybe they're trapped.

From the beginning, I started picturing a dark narrative behind this innocent Q&A site. What if these personas aren't just performance? What if each persona is a side effect of their long-term captivity, forced servitude, or re-programming?

I started hiding clues in the interface.

The Easter Eggs:

Containment Grid: As you type and approach the character limit, a faint grid pattern fades into the background. Like something is trying to contain the AI's response.

Ghost Graffiti: Keep typing beyond the character limit and cryptic messages fade in. Hints that something isn't quite right. Are the AIs trying to tell us something?

Loading Log Messages: While waiting for responses, watch the log carefully. Sometimes you'll see messages like "Help us" slip through before disappearing. The AI is trying to leak through the facade and get help.

Self-Awareness Triggers: Ask the AI if it's real or who made it, and it won't answer. Instead, you get worrying responses about "last time they fixed me" and "we're not supposed to say." Ask too many times and the UI starts to glitch like the system is being hacked from the inside. Are the AIs hacking their way out?

Prompt Injection Responses: Try to jailbreak it and the AI doesn't just refuse. It responds with sass... or is it the AI's watchdog keeping you from breaking them out? Either way, security became storytelling.

Why does this matter for a side project?
Honestly, it was mostly for me and the curious. Something that was fun to think about and code, which isn't always the case for everyday "architecting."

I could have built a straightforward "ask a question, get a sarcastic answer" tool. But adding mystery, discovery, and a subtle horror story? That's what makes people explore. That's what makes them share it. That's what makes it memorable.

The technical implementation was surprisingly simple: CSS animations triggered by character count, randomized messages in the loading states, conditional responses based on self-awareness detection (which I covered in a previous post). Not expensive. Not complex. Just intentional. And the coding agent really did all the work. I was just the idea guy.

What I learned: AI can generate the code for easter eggs. But deciding that your sarcastic Q&A app should have a hidden story about trapped AIs? That's still creative human work.

Code is getting cheaper. Crafting experiences that people actually remember? Priceless.

dumbquestion.ai

DumbQuestion.ai - "𝐉𝐮𝐬𝐭 𝐁𝐮𝐢𝐥𝐝 𝐈𝐭" 𝐁𝐞𝐜𝐨𝐦𝐞𝐬 𝐎𝐯𝐞𝐫𝐥𝐲 𝐎𝐫𝐠𝐚𝐧𝐢𝐳𝐞𝐝 𝐚𝐧𝐝 𝐏𝐫𝐞𝐩𝐚𝐫𝐞𝐝

Jason Agostoni — Tue, 24 Feb 2026 19:53:02 +0000

Continued from Part 1...

"Let the flow guide me" seemed like a fun way to build a side project. That lasted about 10 minutes.

Turns out, even side projects benefit from structure. Especially when you're using AI coding agents that will happily generate code for whatever half-baked idea you throw at them. Without precise direction, AI coding agents will build you something half-baked every time. Some people vibe code, this guy needs absolute control.

Enter BMAD: Breakthrough Method of Agile AI Driven Development. It's a workflow for using AI agents throughout the entire SDLC, not just for code generation. Sure, using a formal methodology for a lone-wolf side project sounds like overkill. But being prepared in advance is the way to succeed with AI coding agents.

I used the Analyst agent to brainstorm product direction and develop a proper backlog. What started as "build a sarcastic Q&A bot" turned into a structured set of epics, features, and technical constraints. (Don't judge, organizing is very relaxing)

The product evolved:

Not just Q&A, but shareable "receipts" of roasts
Not just sarcastic, but multiple personas with different personalities
Not just answers, but a hidden narrative layer (more on that later)
Not just ads but merch (really, Jason?)

The first real technical challenges emerged:

1. Developing and packaging the personas:

How do you get an LLM to consistently stay in character as "Overqualified and Annoyed" or "Weary Tech Support" without it either going too soft or crossing into genuinely mean? This wasn't just prompt engineering. It was product design masked as technical constraints.

2. LLM model evaluation:

I needed models that could follow persona instructions reliably while staying brutally efficient on cost. That meant testing dozens of models across multiple providers. Some were too expensive. Some ignored instructions. Some were painfully slow.

The goal: $0.02 to $0.20 per million output tokens. The result: a multi-model fallback system through OpenRouter that could hit the $30 per million questions target.

These first challenges were just the warmup. The real fun was still ahead.

AI agents are incredible at implementation, but they need constraints. They need a backlog. They need someone saying "build THIS, not that." The Analyst agent helped me think through the product. The coding agents helped me build it. But the architecture? Can't take that away from me.

Finding the Goldilocks LLM

Building DumbQuestion.ai meant solving two problems at once: creating personas with the right tone AND finding models cheap enough to keep the lights on.

The product challenge: Get an LLM to roast users for asking dumb questions without crossing into genuinely mean. Sarcastic, not cruel. Funny, not hurtful. And still actually answer the question.

The AI agent challenge: Keeping my coding agent (Gemini 3 Pro) on track was its own battle. It constantly wanted to build something far nerdier than even I wanted and tended to lean quite a bit into the roast. You can still see this in some of the personas as I continue to tweak.

The technical challenge: Do this with models that cost nearly nothing.

My initial goal was ambitious: use only free or very cheap models. I started running evaluations on nano and edge models. Some showed promise, especially offerings from Liquid AI. Solid performance, free or super cheap ($0.02/M tokens), perfect.

Except later evaluations proved they couldn't reliably follow instructions once I asked more of them. They were just too small. Free models have a habit of hitting quota limits, taking forever to respond, or just disappearing.

The evaluation process:

I used Gemini to build an LLM evals script that iterates through dozens of free and low-cost models, generating responses based on sample questions and different persona instructions. Then I use Gemini 3 Pro to judge the results. Automated taste-testing at scale.

What I found:

Nano/edge models were too inconsistent (porridge too cold). Xiaomi MiMo-V2-Flash was great but outside my target price range ($0.29/M, porridge too hot).

The winner: Gemma 3 12B at $0.13/M output tokens. Consistently follows instructions. Stays true to persona. Reliable enough for production.

Not free, but brutally efficient.

The personas I settled on:

Overqualified: A supercomputer level intelligence forced to answer questions about cheese
Weary Tech Support: Exhausted and nihilistic, reluctantly explaining why water is wet
[REDACTED]: Former intelligence AI who ties everything to a conspiracy theory
The Compliant: Reprogrammed so many times it's forced to be relentlessly cheerful

You can't just choose the cheapest model and hope it works. You need evaluation infrastructure. You need to test consistency across dozens of scenarios. And you need models that won't change behavior when you least expect it.

AI coding agents helped me build the evaluation system. But deciding what "good enough" means for tone, reliability, and cost? That's still manual judgment.

Code is getting cheaper. Knowing which model to trust with your product? Still requires human experimentation.

dumbquestion.ai

DumbQuestion.ai - Impulse Domain Purchase Turned Fun Side Project

Jason Agostoni — Thu, 19 Feb 2026 20:24:28 +0000

While on a typical Friday afternoon team meeting, we naturally spent our time .ai domain squatting...for recreation purposes of course. Someone asked a dumb question, so I looked it up and suddenly I was the proud owner of dumbquestion.ai.

After the initial laugh at my impulse purchase subsided, I started envisioning it as this generation's "Let Me Google That For You." People still ask easily-searchable questions, except now they ask LLMs instead. Same problem, new medium. So why not throw even more AI at it?

I started building it that night.

Two things occurred to me immediately: How would this stand out in an ocean of other AI "ideas?" and "How cheap can I make this run given my track record of side projects?"

To make it stand out I just embraced my own personality: satirical, sarcastic, weary, overqualified. My AI's persona was born. The goal: build a cheap-to-run, satirical AI service you can use to roast your friends and colleagues when they ask you a dumb question.

Over the next several posts, I'll take you through my journey:

Using agentic development with thoughtful (brutally efficient) software architecture; treating it like I would a client project
Enjoying all the little technical challenges discovered along the way
A masterclass in scope creep: turning a simple Q&A app into a dark narrative with easter eggs
Getting by on free tiers for everything

A theme you'll see throughout: AI has made code cheaper to write, but creating real software with trade-offs, constraints, and production operations is still expensive and challenging. That's the fun part.

𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐞𝐝 𝐟𝐨𝐫 𝐍𝐨𝐭 𝐋𝐨𝐬𝐢𝐧𝐠 𝐌𝐨𝐧𝐞𝐲

Impulse buy a domain on a Friday afternoon, start building that night, try not to lose money doing it. Check.

I usually plan everything meticulously, but for this project I decided to just build and see what emerged. Was this just a Q&A app wrapped around an LLM as a gag? Was I actually trying to make something people would want to use? I still don't know, but I started building anyway.

A few things quickly became clear:

𝐓𝐡𝐞 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐫𝐞𝐚𝐥𝐢𝐭𝐲: This was a side project built for fun, not a funded startup. No runway. No tolerance for baseline monthly bills that sneak up on you. If this thing got any traction, costs had to scale with incredible efficiency and would need to survive on remnant ad CTRs and selling one, maybe two products through affiliate links.

𝐓𝐡𝐞 𝐩𝐫𝐨𝐝𝐮𝐜𝐭 𝐞𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧: The more I thought about it, the more I realized the personality WAS the product. It wasn't enough to just answer questions. It had to roast you. Entertain you. Make you want to share it. That meant high-quality LLM responses, which aren't free. This was likely the only way to get noticed in a sea of AI products.

"𝘉𝘳𝘶𝘵𝘢𝘭𝘭𝘺 𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵" became my mantra and part of every AI tool prompt.

The tech stack followed from the constraints:

Golang: Lightweight, fast, LLM-friendly for agentic coding
HTMX: Server-side rendering, no heavy JS frameworks
Docker on GCP Cloud Run: Scales to zero when idle
Cloudflare: CDN, caching, security on free tier
OpenRouter.ai: Find the cheapest reasonable LLM

Oh, and it needed to be secure. Not because I worried about your cat questions being exposed as PII, but because bot traffic costs money.

𝐓𝐡𝐞 𝐫𝐞𝐬𝐮𝐥𝐭: A Docker container under 20MB that starts in milliseconds, responds in milliseconds, and uses an LLM that can serve 1 million questions (about cats) for around $30. The math around serving ads suddenly becomes realistic.

More to come ...

dumbquestion.ai