I track my own AI spend across three projects. In March, the line item that grew fastest was not Claude or GPT calls. It was my Cursor seat plus my...
For further actions, you may consider blocking this person and/or reporting abuse
Thatβs impressive π₯
Simple optimization can save more money than scaling bigger infrastructure sometimes.
???
The 8,400 vs 1,900 input token comparison is the number that sells the whole post. Most people don't realize how much context the wrapper is sending on their behalf.
I run a similar pattern in production but with one difference: my router isn't regex-based, it's a Haiku call. You mentioned this costs more than it saves because of the double API call. In my case it's worth it because the classification does more than pick a model. It extracts intent, sentiment, complexity, and chain context from the user's message in one pass, and that metadata drives the entire downstream pipeline (which tools to load, which data to prefetch, which response format to use). So the "classifier" call isn't wasted, it's doing real work that the main model would otherwise spend tokens figuring out.
The miss rate logging is the underrated part of this post. Most people build the router and never measure whether it's routing correctly. Your CSV approach is simple and it's exactly what caught my 30% of edge cases too. I log every intent classification with the model's confidence score. Anything below 0.85 gets flagged for review. After two weeks of that, the misroutes dropped to nearly zero.
One thing I'd add: the cost saving is real but the accuracy improvement matters more. Haiku answering a complex multi-step question gives you a confident wrong answer. Opus answering a simple lookup wastes money and adds 3 seconds of latency. The router protects you from both failure modes. Cost is the metric you measure. Accuracy is the metric that matters.
The orchestration tax is absolutely real, and your cost savings are impressive. For us, building health AI, routing goes beyond cost-efficiency; it's about accuracy and contextual relevance. When a user says \"mujhe bukhar hai\" (Hindi for \"I have a fever\") via voice, a router must not just detect intent but also route to models best equipped...
the "deliberately dumb" framing is the right one. with regex you get debuggable, deterministic classification, where every misroute has a fixable signature instead of being some opaque LLM-router whim. honestly the 70% number kind of undersells what's happening: code-tier as the default-on-miss means most of the 30% misroutes fall back to a model that can still answer, just less efficiently. the failure mode i'd watch for is a genuinely complex plan-tier task getting misrouted to haiku and producing a confident but wrong answer that needs a full retry. that's where the savings disappear. seen that pattern in your april data?
The 41% bill drop is a good number, but the underlying issue is more interesting β the double-billing on context tokens. Cursor and Copilot both proxy to the same model provider with their own system prompts on top of yours. The agent sees the full expanded context and charges per token. You see the invoice from both subscriptions.
The router fixes this by being the single orchestration layer. But 200 lines means you're probably doing model selection by simple heuristic. What's the routing logic β temperature-based task type, or something more learned?
The thing I keep relearning with router setups is that the cost win shows up on the bill weeks before it shows up in any quality metric. We ran a similar split small model for retrieval pre-pass and rewrite, mid model for actual reasoning, large only for hard escalations. Bill dropped about 38% over a month. What I did not predict is that latency variance also collapsed because small models served almost all the warm-cache hits. The trap is letting your router become an org chart. Once the routing logic encodes who-owns-what, swapping models becomes a political negotiation, not an experiment.
The "cost of pickIntent must be zero" line is the whole article in seven words. We tried LLM-based routing first (Haiku-as-classifier in front of Sonnet/Opus) and the classifier cost ate ~30% of the savings before you even got to misroute correction β regex wins on debuggability and on the bill.
One refinement that helped us push past the ~2% misroute floor: instead of a single regex pass, we let the tool surface the user is hitting bias the tier. A request that needs file-edit tools almost always wants Sonnet minimum even if the prompt looks trivial; a pure Q&A endpoint can stay on Haiku even when "refactor" appears in the text. Intent β§ tool-context beats intent alone.
Also worth saying out loud: the 6,500-token wrapper-overhead delta you measured against Cursor is the real story here. Most teams blame the model price when their problem is that they're shipping a small novel of scaffolding with every turn.
The 6,500 token delta between Cursor and direct API is the real killer. That's not just overhead β that's Cursor's entire system prompt, indexing context, and agent scaffolding being billed to you every single turn. The wrapper's convenience fee is hiding in plain sight. You're paying for them to send context you already have. The router fixes that by letting you choose what to send.
This is exactly the kind of optimization that gets overlooked. Most devs just pick one model and stick with it regardless of task complexity.
The routing logic makes total sense β simple completions don't need the same model as deep reasoning tasks. Did you find a reliable way to classify task complexity at inference time, or do you use a heuristic based on prompt length / token count?
I'm curious how it handles edge cases where a "simple" task turns out to need more reasoning than expected.
The 8,400 vs 1,900 token comparison is eye-opening β I knew wrappers added overhead but never quantified it that precisely. The fact that most of the savings came from shifting 62% of calls to Haiku is a great reminder that most "AI" workloads are actually pretty simple lookups.
I've been thinking about a similar routing approach for a side project. One question: how do you handle the edge case where a prompt looks trivial by your regex rules but turns out to need deeper reasoning? Do you have a fallback or retry mechanism, or do you just accept the occasional wrong route as a known trade-off worth making?
Routing is the unglamorous lever. Mine isn't 200 lines but the principle held most calls don't need the strongest model, you just default there because comparison is cheaper than measurement. The thing I had to add later was a rolling sample of 5 percent of traffic that always hits the strongest model so the router has fresh ground truth to drift against. Without that, the savings hide a slow regression in answer quality. 41 percent is plausible; 41 percent that's still 41 percent six months later is the harder number.