I track my own AI spend across three projects. In March, the line item that grew fastest was not Claude or GPT calls. It was my Cursor seat plus my Copilot seat plus the Anthropic API I was hitting from a personal CLI. Three subscriptions, three meters, and the same Opus tokens billed twice because Cursor was sending the same context to its own backend that I was already passing through to Anthropic directly.
The wrappers do not advertise this. The router code is not their product. The product is the convenience of not thinking about which model handles which prompt. You pay the orchestration tax in margin baked into the seat price.
I got tired of paying it. So I wrote the router. It is 200 lines of TypeScript. My April bill came in 41% under March on roughly the same volume of work.
TL;DR
| Model | Input $/M tokens | Output $/M tokens | Best for |
|---|---|---|---|
| Haiku 4.5 | 0.80 | 4.00 | Lookups, classification, typo fixes |
| Sonnet 4.6 | 3.00 | 15.00 | Default coding, refactors, code review |
| Opus 4.7 | 5.00 | 25.00 | Multi step planning, architecture |
| GPT-5 mini | 0.50 | 2.00 | Cheap classification, embeddings prep |
The 41% saving came from one thing: stopping Sonnet from handling tasks that Haiku could finish in a tenth of the cost. Most coding queries are lookups dressed up as questions. Route by intent, not by habit.
1. The orchestration tax is real
Every wrapper makes the same trade. They pick a model for you, they prepend a system prompt you cannot edit, and they hold a context window you cannot inspect. In return, you do not have to think.
The cost of not thinking shows up two ways:
- The wrapper calls the most expensive model that fits its SLA, because that is what makes the demo look good
- The wrapper bills you for context it sent on your behalf, including its own system prompt and tool definitions
I logged 30 days of Cursor usage against the Anthropic dashboard. Cursor was sending an average of 8,400 input tokens per chat turn. My direct API calls for the same chats averaged 1,900. The 6,500 token delta is Cursor's frame, plus indexing context, plus its agent scaffolding. Useful, but not free.
When you build the router yourself, you choose what to send. That is the whole game.
2. The 200 line router
Here is the file. Drop it in a project, give it your API keys, and it picks a model per request based on rules you control.
// router.ts
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
type Intent = "trivial" | "code" | "plan" | "embed";
interface RouteRule {
match: (prompt: string) => boolean;
intent: Intent;
}
interface ModelConfig {
provider: "anthropic" | "openai";
model: string;
maxTokens: number;
}
const ROUTES: Record<Intent, ModelConfig> = {
trivial: { provider: "anthropic", model: "claude-haiku-4-5-20251001", maxTokens: 1024 },
code: { provider: "anthropic", model: "claude-sonnet-4-6", maxTokens: 4096 },
plan: { provider: "anthropic", model: "claude-opus-4-7", maxTokens: 8192 },
embed: { provider: "openai", model: "gpt-5-mini", maxTokens: 512 },
};
const RULES: RouteRule[] = [
{ intent: "trivial", match: (p) => p.length < 200 && /\?$/.test(p.trim()) },
{ intent: "trivial", match: (p) => /^(what is|define|fix typo|rename)/i.test(p) },
{ intent: "plan", match: (p) => /(refactor|design|architect|migrate|plan)/i.test(p) },
{ intent: "code", match: (p) => /(```
{% endraw %}
|function |class |const |let )/i.test(p) },
{ intent: "embed", match: (p) => p.startsWith("CLASSIFY:") },
];
function pickIntent(prompt: string): Intent {
for (const rule of RULES) {
if (rule.match(prompt)) return rule.intent;
}
return "code";
}
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export interface RouteResult {
text: string;
model: string;
inputTokens: number;
outputTokens: number;
costUsd: number;
}
const PRICING: Record<string, { in: number; out: number }> = {
"claude-haiku-4-5-20251001": { in: 0.8, out: 4 },
"claude-sonnet-4-6": { in: 3, out: 15 },
"claude-opus-4-7": { in: 5, out: 25 },
"gpt-5-mini": { in: 0.5, out: 2 },
};
function priceCall(model: string, inTok: number, outTok: number): number {
const p = PRICING[model];
if (!p) return 0;
return (inTok * p.in + outTok * p.out) / 1_000_000;
}
export async function route(prompt: string): Promise<RouteResult> {
const intent = pickIntent(prompt);
const cfg = ROUTES[intent];
if (cfg.provider === "anthropic") {
const r = await anthropic.messages.create({
model: cfg.model,
max_tokens: cfg.maxTokens,
messages: [{ role: "user", content: prompt }],
});
const text = r.content
.filter((b) => b.type === "text")
.map((b) => (b as { text: string }).text)
.join("");
return {
text,
model: cfg.model,
inputTokens: r.usage.input_tokens,
outputTokens: r.usage.output_tokens,
costUsd: priceCall(cfg.model, r.usage.input_tokens, r.usage.output_tokens),
};
}
const r = await openai.chat.completions.create({
model: cfg.model,
max_tokens: cfg.maxTokens,
messages: [{ role: "user", content: prompt }],
});
const usage = r.usage ?? { prompt_tokens: 0, completion_tokens: 0 };
return {
text: r.choices[0]?.message?.content ?? "",
model: cfg.model,
inputTokens: usage.prompt_tokens,
outputTokens: usage.completion_tokens,
costUsd: priceCall(cfg.model, usage.prompt_tokens, usage.completion_tokens),
};
}
{% raw %}
That is it. Two providers, four intents, five rules, and a cost calculator. Use it like this:
typescript
import { route } from "./router";
const out = await route("rename this function from getUser to fetchUser");
console.log(out.model, out.costUsd.toFixed(5));
// claude-haiku-4-5-20251001 0.00012
The rules are deliberately dumb. Length plus regex covers maybe 70% of routing decisions correctly. For the other 30%, override with a prefix:
plaintext
await route("[force:opus] design a permissions model for ...");
Add a one liner to pickIntent to read the prefix. I left it out to keep the example tight.
3. Routing rules that actually work
The naive approach is to send a tiny classifier call to a cheap model and have it pick the route. That sounds smart and costs more than it saves, because every request now eats two API calls. The cost of pickIntent must be zero.
Five regex rules cover most of my workload:
- Short and ends in a question mark: trivial
- Starts with "what is", "define", "fix typo", "rename": trivial
- Contains "refactor", "design", "architect", "migrate", "plan": plan
- Contains code fence or function keyword: code
- Starts with "CLASSIFY:" prefix: embed (cheap classifier)
Default to code. A wrong route from trivial to code costs maybe 4x more on that one request. A wrong route from code to opus costs 1.6x. Neither is a disaster. The bug to avoid is sending Haiku a multi step plan it cannot hold context for, which means default conservatively.
I also log every miss. After two weeks I had a small CSV of "this prompt routed to X but should have been Y". I added two regex rules and the miss rate dropped from 8% to under 2%.
4. The 41% number, broken down
March bill, no router:
| Source | Calls | Spend |
|---|---|---|
| Cursor seat | n/a | $20 |
| Copilot seat | n/a | $10 |
| Anthropic direct | 4,200 | $87 |
| OpenAI direct | 800 | $14 |
| Total | $131 |
April bill, with router (cancelled Cursor, kept Copilot for IDE inline only):
| Source | Calls | Spend |
|---|---|---|
| Cursor seat | n/a | $0 |
| Copilot seat | n/a | $10 |
| Anthropic via router | 5,100 | $54 |
| OpenAI via router | 1,400 | $13 |
| Total | $77 |
That is 41% lower on 30% more total calls. The router shifted 62% of calls onto Haiku, which was eating workloads Sonnet had been handling. Average cost per call dropped from $0.024 to $0.013.
The Cursor cancel did the headline saving. The router did the smaller, repeating, compounding saving. Both come from the same idea: the wrapper is hiding decisions you could make better yourself.
5. What this does not do
This is not an agent framework. It does not stream. It does not retry. It does not cache. It does not handle rate limits. It does not do tool use. It does not know about your codebase.
Adding any of those takes work. Streaming is two changes. Caching with the Anthropic prompt cache is one extra header on each call. Retries with exponential backoff is 20 lines. Tool use requires schema plumbing you would write anyway.
If you need all of that, use a real framework. If you want to stop paying the orchestration tax on 80% of your calls, the 200 lines above will do it. Add the rest as you actually hit each problem.
Conclusion
Wrappers exist because routing AI calls is annoying. It is also the highest leverage thing you can own in your own code. The 200 lines above are not a moat. They are a Tuesday afternoon. The reason to write them is that you cannot improve a bill you cannot see.
What is your current ratio of cheap model to expensive model calls? If you do not know, that is the first thing to fix. Wire up cost logging before you wire up the router. The numbers will surprise you.
GDS K S · thegdsks.com · building Glincker · follow on X @thegdsks
The orchestration tax is the part of the AI bill that does not show up on the pricing page.
Top comments (6)
That’s impressive 🔥
Simple optimization can save more money than scaling bigger infrastructure sometimes.
the "deliberately dumb" framing is the right one. with regex you get debuggable, deterministic classification, where every misroute has a fixable signature instead of being some opaque LLM-router whim. honestly the 70% number kind of undersells what's happening: code-tier as the default-on-miss means most of the 30% misroutes fall back to a model that can still answer, just less efficiently. the failure mode i'd watch for is a genuinely complex plan-tier task getting misrouted to haiku and producing a confident but wrong answer that needs a full retry. that's where the savings disappear. seen that pattern in your april data?
The 8,400 vs 1,900 input token comparison is the number that sells the whole post. Most people don't realize how much context the wrapper is sending on their behalf.
I run a similar pattern in production but with one difference: my router isn't regex-based, it's a Haiku call. You mentioned this costs more than it saves because of the double API call. In my case it's worth it because the classification does more than pick a model. It extracts intent, sentiment, complexity, and chain context from the user's message in one pass, and that metadata drives the entire downstream pipeline (which tools to load, which data to prefetch, which response format to use). So the "classifier" call isn't wasted, it's doing real work that the main model would otherwise spend tokens figuring out.
The miss rate logging is the underrated part of this post. Most people build the router and never measure whether it's routing correctly. Your CSV approach is simple and it's exactly what caught my 30% of edge cases too. I log every intent classification with the model's confidence score. Anything below 0.85 gets flagged for review. After two weeks of that, the misroutes dropped to nearly zero.
One thing I'd add: the cost saving is real but the accuracy improvement matters more. Haiku answering a complex multi-step question gives you a confident wrong answer. Opus answering a simple lookup wastes money and adds 3 seconds of latency. The router protects you from both failure modes. Cost is the metric you measure. Accuracy is the metric that matters.
The 41% bill drop is a good number, but the underlying issue is more interesting — the double-billing on context tokens. Cursor and Copilot both proxy to the same model provider with their own system prompts on top of yours. The agent sees the full expanded context and charges per token. You see the invoice from both subscriptions.
The router fixes this by being the single orchestration layer. But 200 lines means you're probably doing model selection by simple heuristic. What's the routing logic — temperature-based task type, or something more learned?
This is exactly the kind of optimization that gets overlooked. Most devs just pick one model and stick with it regardless of task complexity.
The routing logic makes total sense — simple completions don't need the same model as deep reasoning tasks. Did you find a reliable way to classify task complexity at inference time, or do you use a heuristic based on prompt length / token count?
I'm curious how it handles edge cases where a "simple" task turns out to need more reasoning than expected.
The 8,400 vs 1,900 token comparison is eye-opening — I knew wrappers added overhead but never quantified it that precisely. The fact that most of the savings came from shifting 62% of calls to Haiku is a great reminder that most "AI" workloads are actually pretty simple lookups.
I've been thinking about a similar routing approach for a side project. One question: how do you handle the edge case where a prompt looks trivial by your regex rules but turns out to need deeper reasoning? Do you have a fallback or retry mechanism, or do you just accept the occasional wrong route as a known trade-off worth making?