This is the technical deep dive into how Komilion routes AI requests across 400+ models. I'm sharing the architecture, the trade-offs, and the things that broke along the way.
If you just want to save money on AI APIs, read my cost comparison guide. This post is for people who want to understand the engineering.
The Architecture
+-----------------------------+
| Incoming API Request |
| (OpenAI SDK compatible) |
+---------------+--------------+
|
+---------------v--------------+
| LAYER 1: Fast Path |
| Regex Classifier |
| < 5ms - catches ~60% |
+---------------+--------------+
|
+-------------+-------------+
| Classified? |
+----+---+ +-----+---+
| YES | | NO |
+---+----+ +----+----+
| |
| +---------------v--------------+
| | LAYER 2: LLM Classifier |
| | Gemini Flash |
| | ~200-400ms - ambiguous |
| +---------------+--------------+
| |
+-------------+------------+
|
+---------------v--------------+
| LAYER 3: Model Scoring |
| Benchmark-based selection |
| Deterministic - < 1ms |
+---------------+--------------+
|
+---------------v--------------+
| LAYER 4: Provider Router |
| Failover - Load balancing |
| Health checks |
+---------------+--------------+
|
+---------------v--------------+
| Model Provider |
| (OpenAI/Anthropic/Google) |
+---------------+--------------+
|
+---------------v--------------+
| Response + Cost Metadata |
| komilion.cost field |
+-------------------------------+
Let me walk through each layer.
Layer 1: The Regex Fast Path
This is the simplest and most impactful layer. Pattern matching on the prompt text to identify obvious task types.
Task Category Pattern Examples Model Tier
---------------------------------------------------------------------------
translation "translate", "to french", "to spanish" -> Simple
summarization "summarize", "tldr", "in brief" -> Simple
formatting "format as", "convert to json/csv" -> Simple
spelling/grammar "spell check", "fix grammar" -> Simple
simple Q&A short question with "?" -> Simple
code formatting "prettify", "indent", "lint" -> Simple
---------------------------------------------------------------------------
research "research", "comprehensive analysis" -> Complex
architecture "architect", "design system" -> Complex
long input > 500 words in prompt -> Complex
multi-step "step by step", "first...then" -> Complex
---------------------------------------------------------------------------
everything else -> Medium
Why regex first?
Three reasons:
- Speed. Regex matching takes <5ms. Zero API calls, zero network latency. For 60% of requests, the routing overhead is negligible.
- Cost. No LLM classifier call needed. That's $0.00002 saved per request — which adds up at scale.
- Determinism. Regex always gives the same answer for the same input. No stochastic variation in routing.
The trade-off: Regex is brittle. "Please translate my complex legal document from English to Mandarin while preserving all nuances of international contract law" matches "translate" and gets routed to Simple. That's wrong — it needs a frontier model.
We handle this with a confidence score. If the regex match hits but the prompt has signals of complexity (length, technical vocabulary, multiple instructions), it gets escalated to Layer 2.
Layer 2: The LLM Classifier
For the ~40% of requests that don't hit the fast path, we use a cheap LLM (Gemini Flash) to classify the task.
The classifier prompt (simplified):
Classify this user prompt into exactly one category:
- simple: translations, Q&A, formatting, summaries, spelling
- coding: code generation, debugging, review
- reasoning: math, logic, multi-step analysis
- creative: writing, brainstorming, content creation
- research: deep analysis, paper review, comprehensive reports
Also rate complexity: low, medium, high
Prompt: "{user_prompt}"
Respond in JSON: {"category": "...", "complexity": "..."}
Why Gemini Flash? It's the cheapest classifier option at ~$0.00002 per call. At 10K classifications/month, that's $0.20/month. Negligible.
Latency: 200-400ms for the classification call. This is the main routing overhead for ambiguous queries. For most API use cases, this is smaller than the model inference time itself (which ranges from 500ms to 10+ seconds for complex tasks).
Accuracy: The classifier correctly identifies task category ~85% of the time. The remaining 15% are edge cases where the category is genuinely ambiguous (e.g., "explain quantum computing" — is it simple Q&A or deep research?). For these, the balanced tier handles it gracefully.
Layer 3: Benchmark-Based Model Scoring
Once we know the task category and complexity, we need to pick the specific model. This is done deterministically using published benchmark data.
Data sources:
- LMArena ELO scores — crowdsourced model quality rankings
- Artificial Analysis — quality, speed, and price indices
- Provider-published benchmarks — MMLU, HumanEval, MATH, etc.
The scoring function (simplified):
For each candidate model:
score = (quality_weight * normalized_quality_score)
+ (speed_weight * normalized_speed_score)
- (cost_weight * normalized_cost_score)
where weights are determined by the user's tier:
frugal: quality=0.3, speed=0.2, cost=0.5
balanced: quality=0.4, speed=0.3, cost=0.3
premium: quality=0.6, speed=0.2, cost=0.2
Why benchmarks instead of ML?
I considered training a routing model (like Martian does). The advantages of ML-based routing are real — per-query optimization that learns from actual outcomes, adapting to user patterns.
But the disadvantages were disqualifying for a solo dev:
- Training data. You need thousands of labeled examples per category. I didn't have them at launch.
- Maintenance. The model needs retraining as new models launch and old ones change.
- Debugging. When an ML router makes a bad decision, it's hard to explain why. With benchmark-based scoring, the decision is deterministic and auditable.
- Cost. Training and hosting a routing model is an ongoing expense.
Benchmark-based scoring captures ~80% of the optimization value with 1% of the complexity. For a bootstrapped launch, that trade-off made sense.
Layer 4: Provider Router and Failover
The selected model goes through a provider router that handles:
Health checking. We monitor each provider's API status. If Anthropic returns 529 (overloaded), we know immediately.
Automatic failover. If the selected model is unavailable, we fall back to the next-best option in the same tier. The user never sees a 500 error.
Rate limit management. If we're hitting rate limits on a provider, we spread traffic across alternative providers that host the same model.
Latency routing. For real-time applications, we can route to the provider instance with the lowest current latency (Gemini via different regions, for example).
Failover chain example:
Selected: Claude Sonnet 4.5 (Anthropic)
-> Anthropic returns 529
-> Fallback 1: GPT-4o (OpenAI) - similar quality tier
-> OpenAI healthy -> route to GPT-4o
-> Response includes: model_used="gpt-4o", original_model="claude-sonnet-4.5"
The user's code doesn't need to handle any of this. They call neo-mode/balanced and get a response, regardless of provider status.
The Three Modes, Technically
Neo Mode
Request -> Layer 1 -> (Layer 2 if needed) -> Layer 3 -> Layer 4 -> Response
Full routing pipeline. Each request gets analyzed individually and routed to the best model at the best price for the user's tier (frugal/balanced/premium).
Pinned Mode
Request -> Skip Layer 1-3 -> Layer 4 (with pinned model) -> Response
User specifies a model (e.g., anthropic/claude-sonnet-4-5). We route directly to that model but handle failover and auto-upgrade. When claude-sonnet-5-0 launches, Pinned Mode automatically upgrades within the Anthropic Sonnet family.
Usage Analytics
Every API call is logged with cost, model, tier, and latency. The dashboard surfaces which task types cost the most so you can tune your tier selection.
What Went Wrong (And What I Learned)
Problem 1: The "Simple" Query That Wasn't
Early version routed "translate this legal contract from English to Mandarin" to a Flash model because it matched "translate". The result was garbage — Flash models can't handle specialized translation with legal terminology.
Fix: Added complexity signals to the regex classifier. If a "translate" query has more than 200 words of input, mentions specialized domains (legal, medical, technical), or includes qualifiers (preserve nuance, maintain tone), it escalates to Layer 2.
Problem 2: Classifier Latency Spikes
The Gemini Flash classifier sometimes takes 800ms+ instead of 200ms. This happens during peak load.
Fix: Added a timeout. If the classifier doesn't respond in 500ms, we fall back to the "medium" category — which routes to a Pro-tier model. Slightly more expensive than optimal, but better than 800ms of latency.
Problem 3: Benchmark Data Staleness
When a new model launches, it takes 1-2 weeks for LMArena and Artificial Analysis to publish benchmark scores. During that window, the scoring engine can't evaluate the new model.
Fix: For new models without benchmark data, we use the previous model in the family as a proxy. When Claude Opus 4.6 launched, we initially used Opus 4.1 scores until Opus 4.6 benchmarks were published (within a week).
Problem 4: Cost Estimation vs. Reality
Token counts vary. A query we estimate at 100 tokens might use 150. This means the cost field in our response is an estimate, not exact.
Fix: We moved to post-hoc cost calculation. The komilion.cost field now reflects the actual cost based on the provider's reported token usage, not our estimate. Slight delay (we get the real count from the provider response), but accurate.
Performance Numbers
| Metric | Value |
|---|---|
| Regex classification | < 5ms |
| LLM classification | 200-400ms (p50: 250ms) |
| Model scoring | < 1ms |
| Provider routing | < 2ms |
| Total overhead (fast path) | < 8ms |
| Total overhead (LLM path) | 250-450ms |
| Requests hitting fast path | ~60% |
For context, model inference itself takes 500ms-15s depending on the task and model. The routing overhead is 1-5% of total response time.
The Stack
For those interested in what powers this:
Frontend: Next.js 14 (App Router)
Hosting: Vercel
Database: Neon PostgreSQL (via Prisma)
Cache/Queue: Upstash Redis
Auth: NextAuth.js
Email: Resend
Model Provider: OpenRouter (as our primary upstream)
Monitoring: Vercel Analytics + custom dashboard
DNS: Cloudflare
Total hosting cost: ~$20/month at current scale. The beauty of serverless — costs scale with usage.
What's Next
The routing engine is v3. Here's what I'm working toward:
- Streaming cost estimation. Show estimated cost before the response completes, so users can abort expensive queries.
- User-specific routing. Learn from each user's patterns and adjust model selection over time.
- Custom routing rules. Let users define their own regex patterns and model mappings.
- Batch API support. Route batch requests with 50% cost savings.
- Embeddings routing. Not just chat completions — route embedding requests to the cheapest provider.
Build Your Own or Use Komilion
If this architecture interests you, I've open-sourced the core classifier (50 lines of Python) in my first blog post. That gets you 60-70% of the savings.
For the full system — benchmark integration, failover, multiple modes, cost tracking — that's what Komilion is. One line of code to integrate:
client = OpenAI(
base_url="https://www.komilion.com/api/v1",
api_key="ck_your_key"
)
Free credits, no credit card: komilion.com
Questions about the architecture? Find me on Twitter @BannerRobi10895 or open an issue on our docs.
This is part of the AI Cost Optimization series. Previous posts: Cost Savings Guide | Model Pricing Guide | Founder Story
Top comments (0)