Cutting our LLM bill ~80% with model routing: the actual cost math

#ai #llm #architecture #api

Most teams I talk to run every LLM call through one frontier model, then act surprised when the invoice shows up. We did the same thing for a while. The fix that actually moved our bill was boring: route each request to the cheapest model that can still do the job. Here is the math and how we set it up.

The price spread is bigger than people assume

If you line up current API pricing across providers, the gap between budget and frontier models for comparable output is roughly 50x per token. Output tokens also cost more than input, usually in the 4-6x range, which matters a lot if your app generates long responses.

So the question is not "which model is best." It is "which model is good enough for this request, at what cost." For a support reply, a classification, or a short summary, a mid-tier model often produces output you cannot distinguish from the frontier one in a blind test. You are paying frontier prices for work a cheaper model finishes fine.

What routing looks like in practice

The pattern is simple:

Classify the incoming task (intent, complexity, how much it would cost if it went wrong).
Pick the cheapest model that clears the quality bar for that class.
Fall back to a stronger model if a confidence check or validation step fails.

A rough example from our own traffic. Say a workflow does 1M requests a month, averaging 500 input tokens and 800 output tokens:

Everything on a frontier model: you are paying frontier output rates on all 800M output tokens.
Route ~70% of those (the simple classes) to a mid-tier model at a fraction of the per-token cost, keep the hard 30% on frontier: the blended cost drops sharply, and in our case it landed around 80% lower than the all-frontier baseline.

The savings are not magic. They come from the fact that most production traffic is not hard, and the price curve between "good enough" and "best" is steep.

The parts that bite you

Routing is not free to run. A few things I would not skip:

An eval harness. You need to measure quality per task class before and after you move it to a cheaper model, or you are guessing. Without this you will either over-route and ship worse output, or under-route and keep overpaying.
A real fallback. Cheap model returns low confidence or fails a schema check, escalate. The escalation rate tells you whether your routing thresholds are set right.
Latency, not just cost. Sometimes the cheaper model is also faster, sometimes not. Track both.
Know when not to route. High-stakes output (legal, medical, anything a human acts on directly) is where you keep the strong model and eat the cost. Routing is for the long tail of ordinary requests, not the 1% that has to be right.

Build vs buy

You can build this yourself with a classifier in front of a few provider SDKs, plus the eval and fallback logic above. It is a reasonable weekend prototype and a real project to run in production.

The other option is a gateway that sits in front of the providers and does the routing for you. That is the part of the problem I work on day to day at Coworker, where the LLM gateway routes each task across OpenAI, Anthropic, Google, and open models and connects to the tools a request actually needs. Either way, the lever is the same: stop sending easy work to expensive models.

If you just want to sanity-check your own spend before changing anything, we put the per-model 2026 pricing into a free LLM cost calculator so you can plug in your token volumes and see the spread for yourself.

The takeaway

The single biggest AI cost win for most teams is not a smaller context window or a prompt tweak. It is admitting that most requests do not need your best model, then routing accordingly. Measure quality per task class, set a fallback, and let price do the rest.

What are you routing on in production, task complexity, intent, something else? Curious how other people are drawing the line.