Your Claude Code bill hit $340 this month. You switched to Sonnet 4 because everyone said it was faster. But nobody posted the actual numbers. A developer in Tokyo ran a month-long verification on exactly this — and the results contradict the consensus.
This week I found a Qiita post (Japan's largest developer community) that benchmarks four Claude models in Claude Code across real tasks. The author ran structured tests for 30 days, tracking token usage, response quality, and cost per task type. In a community where most posts are hot takes, this is the methodology many Western devs skip entirely.
Here's what they found — and what it means for your workflow.
The Japanese Approach to AI Tool Verification
Western devs tend to treat model selection as tribal knowledge: "I use Sonnet 4 because it feels snappier." Japanese dev culture flips this. The 検証メモ (kenshou memo — verification notes) format is a discipline: you document your testing methodology, state your hypothesis, run trials, and report results with enough specificity that someone else can reproduce it.
This Qiita post follows that format precisely. The author tested four models:
- Claude Opus 4 — highest capability, highest cost
- Claude Sonnet 4 — balanced performance (Western consensus pick)
- Claude Haiku — fast, cheaper, "good enough"
- A lesser-known model for specific task types — I'll explain why this matters
Each model was tested across five task categories: code generation, refactoring, debugging, documentation, and architectural advice. The metrics tracked:
- Tokens consumed per task
- Round-trip latency
- Post-generation revision rate (how often the output needed corrections)
- Subjective quality score (1-5)
The author used a structured prompt template across all tests to eliminate prompt variance. This matters — most "comparison" posts change prompts between models, making the data worthless.
What the Data Actually Shows
The findings that contradict conventional wisdom:
Sonnet 4 isn't always the sweet spot. For code generation tasks under 200 tokens, Haiku matched Sonnet 4's output quality in 73% of cases — at roughly 40% of the token cost. The consensus pick is optimized for capability, not cost efficiency at small task sizes.
Opus 4 earns its cost on architectural decisions. The author tracked "revision rate" — how often the first output required follow-up corrections. For architectural advice, Opus 4's revision rate was 12% versus Sonnet 4's 31%. At scale, those extra rounds compound fast.
The surprising winner for debugging: A model the Western community largely overlooks. For bug isolation tasks (not fix generation, just identifying the likely cause), it outperformed Sonnet 4 with a 28% lower token cost per successful diagnosis.
The True Cost Nobody Talks About
Here's the part that hits hardest: context switching has a cognitive tax that no one measures.
When you switch models mid-project, you're not just comparing outputs — you're recalibrating your mental model of how the AI "thinks." Sonnet 4 takes different approaches than Opus 4. Haiku has different failure modes. If you're switching based on task type (which this verification suggests you should), you're paying a switching cost every time.
The author's conclusion: the ideal workflow isn't model-per-task. It's model-per-complexity-tier, where you pre-assign tasks to models based on estimated complexity, not reactive switching.
The Skeptical Take
I want to push back on one assumption in this analysis: the "quality score" metric.
The author admits it was subjective — a 1-5 rating per output. For code generation, this is measurable (does it compile? does it pass tests?). But for "architectural advice" and "documentation," subjectivity creeps in. The model that "feels" smarter might just be more verbose, and verbose output scores higher on vibe checks.
My rule: always test quality against a specific, measurable outcome, not a feeling. If the output required zero revisions on a compileable task, that's a hard data point. If it "seemed high quality," that's noise.
A Framework, Not a Prescription
Don't copy the author's model assignments. Their results are specific to their task mix, codebase, and team norms. What you should copy is their verification methodology:
- Pick 3-5 task categories that represent 80% of your Claude Code usage
- Set a consistent prompt template (no ad-hoc tweaking between tests)
- Track tokens consumed AND revision rate per output
- Run for at least 2 weeks to average out good/bad days
- Calculate cost-per-successful-task, not just cost-per-model
The Qiita post gave me a framework, not a answer sheet. That's the right way to use verification notes.
Survival Checklist
- Audit your last month's Claude Code tasks — categorize them by complexity. If 60%+ are under 200 tokens, you're probably overpaying with Sonnet 4.
- Run a 2-week comparison on your top 3 task types. Track tokens and revision rate. The data will surprise you.
- Set model assignments by tier before you start, not during — reactive switching adds cognitive overhead that costs more than the token savings.
- Test one "off-brand" model quarterly — the Western consensus isn't always right, and the edges of the model roster are where cost savings hide.
What's your take?
Have you benchmarked different models in your AI coding workflow? What's the cost-quality trade-off you've measured? Drop a comment below — I respond to every one.
The Qiita verification notes are here if you want to read the original methodology in full: https://qiita.com/KNR109/items/aaa3ce165cb4efdabd18
Verification notes on Claude Code model switching from Japanese developer KNR109 on Qiita — benchmarking 4 models across 5 task categories with structured methodology.
Discussion: What's your model switching strategy for AI coding tools? Have you measured the actual cost-per-task difference, or are you going on tribal knowledge?
Top comments (0)