Stop hand-picking an LLM per request: a practical case for auto-routing

#ai #api #llm #webdev

Most LLM features ship with the model name hardcoded. You picked it once — usually the strongest one you could justify — and now every request, trivial or gnarly, hits the same expensive model. The easy ones overpay; if you down-picked to save money, the hard ones quietly degrade. You're paying the frontier price for "reformat this list," or shipping a weak answer on "find the bug in this trace."

Routing per request fixes the mismatch: classify each request's difficulty, then send it to the cheapest model in your quality tier that can actually handle it.

What "difficulty" can mean in practice

You don't need a research model to route well. Cheap, legible signals get you most of the way:

Input shape and length. A 30-token reformat is not a 6,000-token "reason over this codebase" request.
Task type keywords / structure. Extraction and classification skew easy; multi-step reasoning, code debugging, and math skew hard.
A tiny classifier up front. A small, fast model can score difficulty in a few milliseconds and a fraction of a cent, then hand off to the right tier.

The router doesn't have to be perfect. It has to be better than a single hardcoded choice — which is a low bar, because a hardcoded choice is wrong for half your distribution by construction.

Where routing misfires (and why you must plan for it)

Anyone who's run this in production will tell you the failure modes are the interesting part:

Deceptively short hard prompts. "Prove this is NP-complete" is 5 tokens and very hard. Length-only routing sends it to a weak model and you get a confident-wrong answer.
Misclassification is silent. A bad route doesn't error — it just returns a worse answer. Without eval, you won't see it.
Tier boundaries are fuzzy. Requests near the easy/hard line will flip routes run-to-run, making behavior feel non-deterministic.

So routing is not "set and forget." It needs guardrails.

Keeping it safe

Route within a quality floor, never below it. Let the user pick a tier; the router chooses within it, so the worst case is still acceptable. Never silently drop to a model the user wouldn't accept.
Bias toward the stronger model on uncertainty. When the difficulty score is ambiguous, round up. A few cents of overspend beats a wrong answer.
Make routing decisions observable. Log which model served each request and why. You can't debug or tune a router you can't see.
Eval the routes, not just the models. Periodically check that easy-routed requests would've gotten the same answer from the strong model, and that hard-routed ones aren't degrading.

What it buys you

When it's tuned, you stop overpaying on the (usually majority) easy traffic without down-grading the hard tail — and you stop hand-maintaining a model choice that drifts out of date every time providers ship something new. The routing logic lives in one place instead of smeared across feature code.

Takeaway

Hardcoding one model per feature optimizes for nothing — it's a coin flip that's wrong for half your request distribution. Difficulty-based routing within a quality tier, with an "round up when unsure" bias and real observability, is a better default. I've built this into a small OpenAI-compatible gateway called Modelis (send model: "auto", it routes within your tier and bills a flat per-call price, free tier) at modelishub.com — but you can build the same idea yourself with a small front classifier. I'd love to hear the nastiest "short prompt, secretly hard" example you've hit — those are the routing killers.