The safe default for AI API calls is to route everything to the best model you can afford. I did it for months. Opus for every request -- commit messages, variable lookups, SQL optimization, architecture. All $0.17/call.
We ran the numbers to see if that assumption holds.
Ten real developer tasks. Real API calls. Real billing. We sent each task through three setups: Komilion frugal tier, Komilion balanced tier, and Claude Opus 4.6 called directly via the Anthropic API.
The result: the balanced tier beat Opus on 6 of 10 tasks at half the cost. Frugal delivered 97% of Opus quality at 1.6% of the cost.
The Setup
3 configurations, 10 tasks:
| Tier | Model selection | Cost/task |
|---|---|---|
| Frugal | Cheapest capable model, auto-selected | ~$0.003 avg |
| Balanced | Mid-tier, optimized for developer tasks | ~$0.08 |
| Opus 4.6 direct | Claude Opus 4.6 via Anthropic API directly | ~$0.17 |
10 tasks from real developer work: code generation, debugging, explanation, SQL optimization, architecture design, commit messages, and more.
Judge: Hermione -- a Gemini 2.5 Flash LLM judge scoring each response head-to-head. 3 runs per comparison to reduce variance. Scores are head-to-head relative ratings, not absolute quality measures.
Results
| Tier | Avg Score | Beat Opus | Cost/task | vs Opus cost |
|---|---|---|---|---|
| Balanced | 8.7/10 | 6 of 10 | $0.08 | 53% cheaper |
| Frugal | 8.3/10 | 3 of 10 | ~$0.003 | 98% cheaper (56x lower) |
| Opus 4.6 direct | 8.6/10 | -- | $0.17 | baseline |
Finding 1: Balanced beats Opus on most developer tasks
This was the finding I did not expect.
Balanced averaged 8.7/10. It beat Opus on 6 of 10 tasks. At $0.08/task vs $0.17 for Opus, that is better quality at half the cost.
For well-defined developer tasks -- write this function, debug this code, optimize this query -- the balanced tier routes to Sonnet-class models highly tuned for exactly this type of work. The judge consistently scored them at or above Opus on tasks with clear success criteria.
A specific example: Task 9. Frugal scored 8.67, balanced scored 8.67. Opus scored 8.33 and 7.67 across judge runs. A task requiring real technical depth -- and both cheaper tiers outscored the frontier model. This result appeared repeatedly across the 10-task run.
Where Opus still wins: Task 10 tells the other story. Opus scored 9.0. Balanced scored 8.0. Frugal scored 7.0. For complex, open-ended problems where output breadth and multi-step reasoning visibly matter, Opus produces noticeably more thorough results. The judge valued that. It is a real gap -- on a narrower set of tasks than most developers assume.
The tasks where Opus won cluster around a recognizable pattern: SQL optimization, unit test generation, REST API design. Tasks where the output has architectural depth, must satisfy multiple simultaneous constraints, or requires anticipating edge cases across a broad surface. On those, the frontier model earns its price tag. On the other 6 of 10 tasks, the balanced tier matched or outperformed it.
Finding 2: Frugal delivers 97% of Opus quality at 1.6% of the cost
Frugal averaged 8.3/10. It won 3 of 10 head-to-heads.
At $0.003/task vs $0.17 for Opus, frugal delivers 97% of Opus quality at 56x lower cost. The tasks frugal handles best make up the majority of most developers' API traffic: commit messages, short explanations, summarization, quick lookups, simple code generation.
The tasks where frugal struggles -- complex open-ended problems -- are real. For those, route to balanced or accept the Opus cost selectively.
The honest conclusion
Balanced is the better default for most developer workloads.
8.7/10 avg, 6 of 10 wins against Opus, $0.08/task. If you are routing everything to Opus, you are paying $0.17/call for results the balanced tier matches or beats on 60% of tasks.
Frugal is the cost optimizer for simple-task volume. 97% of Opus quality. 1.6% of the cost.
And on a specific subset of complex open-ended tasks, Opus still wins. That is not a bug -- it is the whole argument for intelligent routing. Know your task distribution. Route accordingly.
How to try it
OpenAI SDK compatible. One line:
from openai import OpenAI
client = OpenAI(
base_url="https://www.komilion.com/api/v1",
api_key="ck_your_key" # free at komilion.com, no card
)
# Balanced -- recommended default based on this benchmark
response = client.chat.completions.create(
model="neo-mode/balanced",
messages=[{"role": "user", "content": "your prompt"}]
)
print(response.model_extra["komilion"]["brainModel"])
print(response.model_extra["komilion"]["cost"])
Works with Cline, Cursor, Roo Code, Continue, any OpenAI-compatible client.
All 30 responses from this benchmark run are published unedited at komilion.com/compare-v2 -- every response, every judge verdict, JSON download available.
Komilion is live on Product Hunt today: https://www.producthunt.com/posts/komilion -- if this was useful, an upvote takes 30 seconds.
Data from real API calls, February 2026. Phase 4 run: 10 tasks x 3 configurations = 30 calls. Judge: Hermione (Gemini 2.5 Flash), 3 runs per comparison.
Top comments (0)