When I tell developers "it automatically picks the cheapest model," the first question is always: how?
Here's the actual architecture.
The core problem
Every AI API call has a cost. The cost is determined by two things: which model handles it, and how many tokens are involved.
Opus 4.6 costs ~15× more per token than Gemini Flash. For a commit message or "what does this function return?" — you're paying 15× too much. For a 500-line architectural review — Opus is the right tool.
The routing problem is: classify each request quickly enough that the classification overhead doesn't eat your savings, then map it to the right model.
Layer 1: Regex fast-path (<5ms)
The first pass is a regex classifier that runs in a few milliseconds. It looks for explicit signals in the request:
Simple patterns (routes to frugal tier):
- Requests under ~100 tokens with common question patterns
- Commit message / changelog requests
- Single-line completions
- "What does X do?" / "Explain this variable" patterns
- Documentation / comment generation
Complex patterns (routes to premium tier):
- Multi-file refactoring language ("across the codebase", "all occurrences")
- Architecture or design requests ("design a system", "how should we structure")
- Debug-from-scratch requests with large context
- Agent-style requests with multiple steps
About 60% of requests hit a clear regex match and skip further classification. The cost of this step is effectively zero — no API calls, no latency.
Layer 2: LLM micro-classifier (~100-200ms, ~$0.00008/call)
Ambiguous requests — "fix this" with a 200-line code block, or "improve this" — don't match regex cleanly. These go to a micro-classifier: a small, cheap model (Gemini Flash Lite) with a structured prompt.
The classifier prompt returns JSON:
{
"complexity": "simple|moderate|complex",
"task_type": "code_review|debugging|generation|explanation|other",
"confidence": 0.85
}
This classification call costs $0.00008 — about 1/70th of a frugal-tier routing decision. Even if the classifier runs on every call, the overhead is negligible.
Below a confidence threshold, the request escalates to the next tier rather than risk a low-quality response.
Layer 3: Model selection (deterministic)
Once complexity is known, model selection is deterministic — it's a lookup, not another API call.
The model pool is scored weekly from LMArena Elo rankings and Artificial Analysis benchmarks, weighted by:
- Quality score for the task type
- Cost per token
- Speed (tokens/second)
- Availability (uptime over the past 7 days)
For each tier, there's a ranked list. The top-scoring available model wins.
frugal: [gemini-3-flash, deepseek-v3.2, llama-4-maverick, ...]
balanced: [claude-sonnet-4-6, gemini-3-pro, ...]
premium: [claude-opus-4-6, ...]
The list is updated weekly from benchmark data, reviewed, then deployed. If a model degrades in benchmarks or has uptime issues, it drops in the ranking.
(Padme: removed gpt-5.2 — banned, throws 400 errors on OpenRouter. Removed gpt-5.3-codex — fabricated model name. Changed "updates automatically" to reflect the actual manual review + deploy process, consistent with other copy.)
Layer 4: Provider failover
The winning model goes to its provider API (via OpenRouter for most, direct for some). If the call fails — 429, 500, timeout — the next model in the ranking takes over, silently.
From the caller's perspective: one API call, one response. The failover is invisible.
What the response includes
Every response includes routing metadata in data["komilion"]:
{
"komilion": {
"cost": 0.000063,
"latencyMs": 1829,
"neo": {
"mode": "direct",
"brainModel": "google/gemini-3-flash-preview"
}
}
}
brainModel tells you exactly which model handled the request. cost is the actual dollar amount charged. There's no black box — you can log every call and audit routing decisions.
The "wrong tier" problem
What happens when the classifier routes a complex request to a cheap model?
The response quality degrades. The user notices. They re-run on premium. Net cost: ~$0.006 (frugal attempt) + $0.55 (premium) = $0.556. Same as premium alone. No downside, only upside on the 70% of calls that route correctly.
The classifier accuracy on unambiguous requests (simple questions, formatting tasks) is effectively 100%. The ambiguous middle — "improve this function" — is where misroutes happen, and those are also the cheapest to correct.
OpenRouter vs. Komilion
The most common question: "isn't this just OpenRouter?"
OpenRouter is a model marketplace. You specify the model, they route the call to the right provider. You choose; they deliver.
Komilion sits on top of OpenRouter (and direct providers). You specify a tier (frugal, balanced, premium). The classifier chooses the model. The routing is the product.
You can replicate this manually: classify every request yourself, maintain a scored model list, implement failover logic. That's roughly what the Oracle does. The question is whether you want to maintain that infrastructure or just make API calls.
What it doesn't do
No routing system is perfect. A few honest limitations:
Context-blind: The classifier sees the request text, not your repo, your history, or your workflow. "Fix this" without context is genuinely ambiguous.
No learning: Routing decisions are deterministic from the model list, not learned from your usage patterns. There's no personalization yet.
Latency overhead: The LLM micro-classifier adds ~100-200ms. For interactive use this is invisible; for sub-100ms latency requirements, frugal or premium direct avoids it.
The open questions
The routing architecture has solved the easy problem: obvious task type → right model. The hard problem — learning from outcome quality, personalizing to user patterns, handling deeply ambiguous requests — is on the roadmap.
For now: 70% of requests route correctly on the first try. For most developer workloads, that means significant cost reduction with no manual effort.
Benchmark data to verify this yourself: komilion.com/compare — same prompts through all three tiers, outputs and costs side by side.
Top comments (0)