Gabriel Anhaia

Posted on May 7

Cost-Aware LLM Routing: Sending 30% of Traffic to a Cheaper Model Without Quality Loss

#ai #llm #observability #costoptimization

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You look at last month's LLM spend and the line item that hurts is not the hard cases. It is the easy ones. The "hi", the "thanks", the "what's my balance" that you are paying flagship-tier prices to handle because you wired the whole product to a single model id and never wired it to anything else. Every easy request rides the same expensive lane as the hard ones, and the bill reflects that.

Routing fixes this. Not load-balancing across providers, not failover, not a feature-flag dance. Routing in the sense your CDN routes traffic: each request gets sent to the cheapest model that can answer it correctly, and the rest go to the strong model. On a B2B customer-support workload with a heavy short-question tail, sending the bottom 30% of traffic (by complexity) to a smaller model can cut spend by roughly a third without moving quality on the eval set. Treat the numbers in this post as illustrative starting targets you tune to your own evals. The interesting part is how you decide which 30%.

Four router patterns show up over and over. They have different cost-to-build, different blast radius when wrong, and different ceilings on how much traffic they can divert. Pick the wrong one and you either underperform or you ship a quality regression that takes a week to spot.

The four router patterns

In rough order of complexity to build:

Length cutoff. If len(input) < N, send to the cheap model. Else flagship.
Intent classifier. A tiny model (or even a regex) labels each request, and a static map sends each label to a model.
Embedding-clustering router. Embed the request, look up the nearest cluster, route by cluster's historical pass rate on the cheap model.
Cascading. Try the cheap model first. If its confidence (or a judge's verdict) is below a threshold, fall back to the flagship and serve that answer instead.

Length cutoff is a one-liner. Cascading needs eval, telemetry, and a confidence signal it can trust. The bigger the workload, the further down the list it pays to go.

Pattern 1: input-length cutoff

The cheapest router. One conditional. The premise: a 30-token question is rarely a hard one, and a 6,000-token question with three attached PDFs almost certainly is. The premise is mostly true, which is enough.

HAIKU = "claude-haiku-4-5"
SONNET = "claude-sonnet-4-5"

def pick_model(messages: list[dict]) -> str:
    input_chars = sum(
        len(m.get("content", "")) for m in messages
    )
    if input_chars < 400:
        return HAIKU
    return SONNET

Four hundred characters is roughly a paragraph. Below that, the request is probably a greeting, an acknowledgement, a one-line lookup. Above that, you are into territory where the strong model earns the price difference.

The reason this works for some workloads: input-length distribution on most B2B chat products is heavy-tailed. The mode is short. If the head of your distribution is fat — and on most workloads it is — a length cutoff alone moves real volume to the cheap lane without touching anything else.

The reason it breaks: short does not always mean easy. "Reverse a binary tree in Rust" is short and hard. "Translate this ten-thousand-word doc into bullet points" is long and easy. Length only proxies difficulty. On a workload where the proxy holds, you ship the one-liner. On one where it doesn't, you climb the list.

Pattern 2: intent classifier as a router

The next rung. Train (or, more often, write) a tiny classifier that labels each request before it touches the LLM. The label is the routing key.

A workable classifier on a support workload is rarely a fine-tuned model. It is a prompt to a fast small model (Haiku, GPT-4o-mini, a 3B local model) that returns a single label from a fixed set.

INTENTS = [
    "greeting",
    "account_lookup",
    "billing_question",
    "technical_debug",
    "complex_analysis",
]

CLASSIFIER_PROMPT = """
Label the user request with exactly one of:
greeting, account_lookup, billing_question,
technical_debug, complex_analysis.

If unsure between two labels, pick the one
that implies a harder request.

Reply with the label only. No punctuation.

Request:
{request}
""".strip()

def classify(client, request: str) -> str:
    resp = client.messages.create(
        model=HAIKU,
        max_tokens=8,
        temperature=0,
        messages=[{
            "role": "user",
            "content": CLASSIFIER_PROMPT.format(
                request=request
            ),
        }],
    )
    label = resp.content[0].text.strip().lower()
    if label not in INTENTS:
        return "complex_analysis"
    return label

The map from label to model is data, not code:

ROUTING_TABLE = {
    "greeting": HAIKU,
    "account_lookup": HAIKU,
    "billing_question": HAIKU,
    "technical_debug": SONNET,
    "complex_analysis": SONNET,
}

def pick_model(client, request: str) -> str:
    return ROUTING_TABLE[classify(client, request)]

You pay one cheap classifier call per request, then route. The classifier itself is cheap enough that the math works as long as the cheap-lane model is at least 5× cheaper than the strong one. Based on Anthropic's and OpenAI's public per-token pricing as of 2026, both Haiku-vs-Sonnet and GPT-4o-mini-vs-GPT-4o clear that bar comfortably; verify the live numbers on each provider's pricing page before you commit to a router design.

The ceiling on this pattern is the classifier's accuracy. A label miss in either direction has a different cost. Routing a hard request to the cheap lane gives you a wrong answer; routing an easy request to the flagship costs you the savings on that request. Both are real, but the first one is the one users notice. Bias the prompt toward "complex_analysis" on ambiguity. Better to overpay than to underdeliver.

Pattern 3: embedding-cluster router

A step further. Instead of asking a model to label every request, embed the request and look up its nearest cluster in a precomputed map. Each cluster has a historical pass rate on the cheap model, measured offline by replaying past traffic and judging the answers. If the cluster's pass rate is high, route to cheap. If it is low, route to flagship.

The expensive part is the offline labelling: you replay a sample of past requests through both models, judge each pair, and aggregate by cluster.

import numpy as np

# Loaded at startup. Built offline:
# - embed N past requests
# - cluster (KMeans, HDBSCAN, whatever)
# - for each cluster, replay through HAIKU + SONNET
# - judge with a stronger judge model
# - record cluster -> haiku_pass_rate
CLUSTER_CENTROIDS: np.ndarray  # shape (k, dim)
CLUSTER_PASS_RATE: np.ndarray  # shape (k,)

PASS_THRESHOLD = 0.92

def embed(client, text: str) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return np.array(resp.data[0].embedding)

def pick_model(client, request: str) -> str:
    vec = embed(client, request)
    sims = CLUSTER_CENTROIDS @ vec
    cluster = int(np.argmax(sims))
    if CLUSTER_PASS_RATE[cluster] >= PASS_THRESHOLD:
        return HAIKU
    return SONNET

The runtime cost is one embedding call (cheap) plus a dot product. The build cost is a one-off offline run that produces the centroids and the per-cluster pass rate.

The pattern shines on workloads where intent is too coarse a signal. A "billing_question" might be safe on Haiku for one customer's billing schema and unsafe for another. Clustering captures the second axis without you naming it.

It breaks when the input distribution drifts. Yesterday's clusters describe last quarter's traffic. Either re-cluster on a schedule (weekly is usually enough) or attach a "below pass-threshold drift" alert to the offline pipeline.

Pattern 4: cascading

The strongest pattern. Try the cheap model first. If you cannot trust the answer, fall back to the strong one.

The hard part is "trust." You need a confidence signal. Three options, in order of how much they cost you:

The cheap model's own self-rated confidence. Ask it to score its answer 0–10 in the same call. Cheap, but the model is biased toward overconfidence.
A second cheap model as a judge. Pass the question and the cheap model's answer to a tiny judge prompt: "is this answer likely correct?" You pay for two cheap calls instead of one and the answerer's bias does not leak in.
A rule-based check (JSON-schema validation on a tool call, regex-match on an expected output shape, hash-match against a known-good RAG answer) costs nothing at runtime and only applies when the output is structured.

A workable cascading wrapper:

def cascade(client, request: str) -> dict:
    cheap = client.messages.create(
        model=HAIKU,
        max_tokens=512,
        temperature=0,
        messages=[{"role": "user", "content": request}],
    )
    cheap_answer = cheap.content[0].text

    if confident(client, request, cheap_answer):
        return {
            "model": HAIKU,
            "text": cheap_answer,
            "fellback": False,
        }

    strong = client.messages.create(
        model=SONNET,
        max_tokens=1024,
        temperature=0,
        messages=[{"role": "user", "content": request}],
    )
    return {
        "model": SONNET,
        "text": strong.content[0].text,
        "fellback": True,
    }

confident is the part you tune. A judge-call version that returns a boolean:

JUDGE_PROMPT = """
A user asked the question below. An assistant
gave the answer. Reply YES if the answer is
likely correct, complete, and free of obvious
errors. Reply NO otherwise. One word.

Question: {q}
Answer: {a}
""".strip()

def confident(client, q: str, a: str) -> bool:
    resp = client.messages.create(
        model=HAIKU,
        max_tokens=4,
        temperature=0,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(q=q, a=a),
        }],
    )
    return resp.content[0].text.strip().upper().startswith("Y")

The cost math depends on the fallback rate. If the cheap model can answer 70% of requests acceptably, your spend is 0.7 × cheap + 0.3 × (cheap + strong). That simplifies to roughly cheap + 0.3 × strong. Per Anthropic's public pricing as of 2026, Haiku sits at roughly a tenth of Sonnet's per-token rate (check the live numbers before you commit), which puts the cascading bill somewhere near a third of an all-flagship workload. If the fallback rate climbs to 60%, you are paying both models on most requests, and the cheap-first dance costs more than just calling Sonnet directly.

That break-even point is what you watch on the dashboard.

The eval that decides whether you ship

None of these patterns are safe to ship without a quality check, and the check is the same regardless of which router you pick. You need a comparison-judge eval that runs both lanes on the same input set.

The shape:

Pick a representative sample of past requests. A few thousand is usually enough; a few hundred if traffic is small.
Run each request through both the cheap-only path and the routed path.
Pass each pair (cheap-path-answer, routed-path-answer) to a stronger judge model. Ask for a verdict: "tied", "routed-path better", "routed-path worse".
Aggregate. If the routed path wins or ties on, say, 95% of cases, you can ship. If it loses on more than 5%, your router is wrong somewhere — usually the threshold.

Pick the percentile the product can defend. A 1% regression on a billing answer is not the same as a 1% regression on a greeting. Weight the judge eval by category if the categories carry different weight in production.

What to instrument before you ship

Five numbers, on a dashboard, before any of these patterns counts as shipped:

Diversion rate. Share of requests that went to the cheap model. The number you are trying to push up.
Fallback rate (cascading only). Share of cheap-lane attempts that fell back to flagship. If it climbs, your threshold is too tight or the cheap model regressed.
Cost per logical request. (cheap_cost + strong_cost) / requests. The number that pays for the work.
Quality regression rate. Share of routed-path answers that the offline judge marks as worse than the cheap-only path. Run weekly on fresh traffic; alert if it drifts past the threshold you committed to.
Latency p95 by lane. Cheap should be faster than strong; if it isn't, the routing overhead is eating the win.

The LLM Observability Pocket Guide covers the OTel GenAI semantic-convention attributes that make this dashboard cheap to build (gen_ai.request.model, gen_ai.usage.*, and a routing.lane custom attribute that lets every query group cleanly by which model handled the call).

Where the math actually lands

A note on the headline. "Send 30% of traffic to a cheaper model" is a starting target. The destination depends on your workload: the right number is the largest share you can divert before quality regressions show up on the eval. For a workload with a fat head (support chat, intent classification, simple document Q&A), that share often climbs to 50–60%. Workloads of mostly hard requests — code generation, multi-step reasoning, long-context analysis — sit at 10% or zero. The patterns above are how you find the number; the number is yours.

The other note: cheap-and-strong is a two-tier framing for clarity. In practice, three-tier (Haiku, Sonnet, Opus) or even four-tier routing pays off once your traffic is large enough to justify the operational overhead. The same patterns apply, with one extra threshold per tier.

What to do with this on Monday

Pick the cheapest pattern that fits. If your input-length distribution has a fat short head and you have not started, ship Pattern 1 behind a 5% canary today. Watch the quality regression number for a week.

If length is not a clean signal (most chat workloads), ship Pattern 2. Three to five intent labels is the usual sweet spot; you do not need a fine-tuned classifier.

If your traffic is large enough that Patterns 1 and 2 leave money on the table, build the offline replay-and-judge pipeline first, then layer Pattern 3 or 4 on top. The pipeline is the load-bearing part. Without it, every router decision is a guess and every cost win is on borrowed time.

The savings compound with the rest of the cost stack: response caching for repeats, prompt caching for shared prefixes, and summarising long sessions before the next turn runs. Routing is the layer that decides which model gets the request in the first place. Get it right and the other layers pay off on a smaller, smarter base.

If this was useful

The LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team covers the offline eval pipeline this post leans on: how to score routed-path answers against a judge, why routing.lane is the attribute the dashboard hangs off, and the wiring that catches a quality regression before the bill does. The multi-model attribution material pairs directly with the cascading pattern: which spans to record, which token-class fields to sum, and how to keep the day-over-day diversion-rate graph honest when traffic shape moves.