Here is something most LLM tutorials quietly ignore. Every request in your app gets sent to the same model, at the same cost, with the same latency. It does not matter if someone asks "what is the capital of France?" or "summarise this 40-page legal contract". Same model. Same price. Same wait.
That is not how you would design anything else in your stack. You would not spin up a GPU instance to serve a static HTML page. So why are we treating all LLM queries as equally complex?
The answer is usually that routing feels complicated. But it genuinely isn't. This article shows you a simple pattern that takes about 20 lines of Python to implement and can cut your API spend significantly while keeping quality where it matters.
The idea
Not all queries are equal. Some are simple lookups or rewrites that a small, fast, cheap model handles just fine. Others actually need the reasoning depth of a frontier model. The router pattern sits in front of your LLM calls and makes that decision before any tokens are generated.
User Query
|
v
[ Router: classify complexity ]
|
|--- simple ---> cheap model (e.g. Haiku, Gemini Flash, Llama 3.1 8B)
|
|--- complex --> capable model (e.g. Sonnet, GPT-4o, Gemini Pro)
Simple. But the impact compounds fast at scale.
A basic implementation
Let's build this in Python. We will use the Anthropic SDK, but the same pattern works with any provider.
First, install the SDK:
pip install anthropic
Now the router itself. The simplest version classifies a query before routing it:
import anthropic
client = anthropic.Anthropic()
CHEAP_MODEL = "claude-haiku-4-5-20251001"
CAPABLE_MODEL = "claude-sonnet-4-6"
def classify_query(query: str) -> str:
"""Returns 'simple' or 'complex'."""
response = client.messages.create(
model=CHEAP_MODEL,
max_tokens=10,
system=(
"You classify user queries. "
"Reply with only one word: simple or complex. "
"Simple = factual lookups, rewrites, short summaries. "
"Complex = reasoning, analysis, multi-step tasks, long documents."
),
messages=[{"role": "user", "content": query}]
)
label = response.content[0].text.strip().lower()
return label if label in ("simple", "complex") else "complex"
def routed_call(query: str) -> str:
complexity = classify_query(query)
model = CAPABLE_MODEL if complexity == "complex" else CHEAP_MODEL
print(f"[router] complexity={complexity}, model={model}")
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
return response.content[0].text
Then you call it like this:
answer = routed_call("What is the boiling point of water?")
print(answer)
answer = routed_call(
"Analyse the trade-offs between microservices and monoliths "
"for a team of five engineers building a B2B SaaS product."
)
print(answer)
Output:
[router] complexity=simple, model=claude-haiku-4-5-20251001
100 degrees Celsius at standard atmospheric pressure.
[router] complexity=complex, model=claude-sonnet-4-6
For a team of five engineers building a B2B SaaS product...
Why this works
The classification call itself is cheap. You are using the fast model to decide which model to use, so even the routing decision costs very little. For most apps, 60 to 70 percent of queries are genuinely simple, which means you are running the expensive model only when it actually earns its cost.
There is a second benefit people overlook. Latency. The small models are significantly faster. For the simple majority of your queries, users get a response much quicker. That matters more than most people realise for perceived product quality.
Making the router smarter without overengineering it
The binary simple/complex split is a starting point. You can extend it in a few ways without the router becoming its own maintenance burden.
One pattern is keyword-based pre-screening before you even make an API call:
SIMPLE_SIGNALS = [
"what is", "define", "translate", "fix the typo",
"rewrite this", "summarise in one sentence"
]
def fast_prescreen(query: str) -> str | None:
"""Returns 'simple' immediately if we can tell from keywords alone."""
q = query.lower()
if any(q.startswith(signal) for signal in SIMPLE_SIGNALS):
return "simple"
return None # unclear, fall through to the classifier
def routed_call_v2(query: str) -> str:
complexity = fast_prescreen(query) or classify_query(query)
model = CAPABLE_MODEL if complexity == "complex" else CHEAP_MODEL
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
return response.content[0].text
The prescreen avoids the classification API call entirely for obviously simple queries. For everything else it falls back to the classifier. Two layers, each only doing what is necessary.
What this pattern does not solve
Worth being honest about the limits.
The classifier is not perfect. Some queries look complex but aren't, and vice versa. You will want to log which model handled which queries and spot-check the routing decisions early on. If you find the classifier is consistently wrong on a category of queries, add examples to the system prompt or build a small lookup.
Also, this pattern assumes you have two tiers that are meaningfully different in cost and capability. If you are only using one provider and all their models are similarly priced, the benefit is mostly latency, not cost.
The bigger point
The routing pattern is one example of a broader principle that is becoming clearer as LLM applications mature. The model is not the product. The system around the model is the product.
Choosing which model to call, when to retry, how to validate the output, how to handle failures gracefully, these are engineering decisions that compound over time. A well-designed system with average models will outperform a poorly-designed system with great models almost every time.
The teams shipping reliable AI products in 2026 are not the ones with access to the best models. They are the ones who stopped treating LLM calls as black boxes and started treating them like any other network dependency worth engineering around.
Start with the router. See what your query distribution actually looks like. Then decide what to optimise next.
Top comments (0)