Ravi Patel

Posted on Jun 10 • Originally published at ssimplifi.com

Model routing by task type: the savings math, the classifier overhead, and the A/B that proves it

#llm #routing #taskclassifier #costoptimization

The case for task-type routing reduces to one observation: no single LLM dominates the cost-quality frontier across all workloads, so paying frontier prices for tasks a small model handles competently is structural waste. Most production applications run on a single model because that's the default for simplicity, and the savings from routing — typically 40-60% of total LLM cost, no quality regression — sit unrealised in plain sight. This post walks through the math: per-task savings arithmetic, the classifier overhead (it's negligible — 5-20ms vs model calls that take 500-2000ms), and the A/B framework that proves quality didn't regress when you flipped the routing on. Written for engineers actively designing or evaluating a routing layer.

The parent guide LLM cost reduction covers all 14 cost-reduction techniques; this article goes deep on technique #3 (routing) specifically.

The price gap that creates the wedge

The relevant fact about the LLM model catalog in 2026 is the size of the per-tier price gap.

Tier	Example models	Approx input price ($/M tokens)	Approx output price ($/M tokens)
Small / fast	GPT-5.4-mini, Claude Haiku 4.5, Gemini 3 Flash, Groq Llama 8B	$0.05–$0.75	$0.15–$5.00
Mid	Mistral Medium 3.5, Claude Haiku 4.5, DeepSeek V4-Flash	$0.50–$1.50	$2.00–$7.50
Large	GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro, DeepSeek V4-Pro	$1.74–$3.00	$3.48–$15.00
Frontier	Claude Opus 4.7, GPT-5.5	$5.00–$15.00	$25.00–$75.00

The gap from small to frontier is roughly 20-100x depending on which models you compare. Sending a "simple Q&A" task to a frontier model when a small model would have produced an equivalent answer means paying 20-100x more for the same outcome. Multiply that across a meaningful production volume and the dollar number is real.

The wedge isn't that frontier models are bad. It's that simple tasks don't need frontier capability — and small models handle simple tasks well. The job of task-type routing is to send each request to the right tier for its actual complexity.

What "task type" actually means

The taxonomy that works in production is small. Most working systems use four categories:

simple — direct Q&A, extraction, formatting, classification, translation. The model isn't reasoning; it's retrieving or transforming.
code — code generation, code review, code explanation, debugging. Specialised models (code-focused fine-tunes) often outperform general models in this category at lower cost.
reasoning — multi-step logical inference, math, planning, analysis. The category where frontier models earn their price.
complex — long-context analysis, multi-document synthesis, intricate research. Frontier territory; long-context-specialised models also fit here.

Some teams add categories (e.g. creative for content generation, conversational for open-ended chat). Most production deployments stop at 4-6 categories because:

More categories make the classifier less reliable
More categories make the routing table harder to maintain
The 4 above capture roughly 90%+ of variation that matters for routing decisions

The pillar guide LLM cost reduction and the glossary task-type routing cover the taxonomy framing in more depth.

The per-task savings arithmetic

Walk through a concrete example. Suppose your application receives 50,000 requests per day with the following task-type mix:

Task type	% of traffic	Single-model cost (all-GPT-5.4)	Routed cost	Saving
simple	50%	25K req × $0.0125/req = $313/day	25K × $0.0038/req (gpt-5.4-mini) = $94/day	$219/day
code	20%	10K req × $0.0125 = $125/day	10K × $0.0075/req (codestral) = $75/day	$50/day
reasoning	20%	10K req × $0.0125 = $125/day	10K × $0.0125/req (gpt-5.4, no swap) = $125/day	$0/day
complex	10%	5K req × $0.0125 = $63/day	5K × $0.0125/req (gpt-5.4) = $63/day	$0/day
Total	100%	$626/day	$357/day	$269/day (43% saving)

VERIFY (founder): replace the example traffic mix + per-request costs with a representative Prism customer profile or aggregated production data. The illustrative numbers above are reasonable industry-typical but worth grounding in real numbers.

The savings concentrate in the simple-task slice — by design. That's the slice where the gap between mini and frontier is largest, and where small models handle the task well enough that quality regression is minimal or zero. The reasoning + complex slices stay on frontier models because that's where the price is earned; the savings from those slices are small (~6% combined in this example) because the model choice barely changes.

The total impact depends on the task mix. Workloads heavy on simple tasks (~70% simple) see the largest absolute savings; workloads dominated by reasoning (~70% reasoning) see less because routing has less to optimise. Most production workloads land somewhere in the middle, with 40-60% routing-driven savings as the typical band.

The classifier overhead (it's negligible)

The argument against routing is usually "the classifier adds latency and cost." Let's quantify it.

Classifier compute cost. A task-type classifier is typically a small fine-tuned model (a 8B-parameter Llama or a similar mini-LM) or an embedding-based similarity score against a labelled corpus. Per-classification cost is roughly $0.00005-$0.0002 — call it half a cent per thousand classifications. Against model calls that cost 0.1-5 cents each, the classifier overhead is in the noise (0.1-1% of total cost on the cheapest workloads; sub-0.1% on more typical workloads).

Classifier latency. A small classifier returns in 5-20ms p95 — typically running locally or in a sidecar process. Compare to model calls that take 200-2000ms p95. The classifier overhead is 1-5% of the request latency budget; against the routing savings, that's a clean trade.

Classifier accuracy. Production classifiers running on the 4-category taxonomy land around 88-93% top-1 accuracy on broad-domain traffic. The bulk of errors are adjacent (simple/code or reasoning/complex boundaries), and the routing-table picks for adjacent categories are usually close enough that an adjacent-category error costs little.

The math is one-directional. The classifier costs cents and milliseconds; the routing savings are dollars and seconds. The objection to routing on "overhead" grounds doesn't survive contact with the numbers.

Where routing goes wrong (and how to prevent it)

The three failure modes you actually have to design for:

Failure mode 1 — Quality regression on edge-of-category tasks

The classifier's job is to pick the right category most of the time. The job of the system around the classifier is to detect when it picked wrong and route differently next time.

The detection mechanism: capture feedback signals per request. Thumbs-down rate, rating distribution, ticket-volume tied to specific responses. When a feature's thumbs-down rate spikes after routing rolled out, audit the cases — usually a specific task type that's been miscategorised.

The fix: route the affected task type to a higher-tier model. Either via an explicit override rule ("requests matching pattern X always route to gpt-5-4") or by retraining the classifier on the surfaced edge cases.

The discipline that keeps this working is closed-loop feedback. Routing without feedback monitoring drifts; routing with it stays calibrated.

Failure mode 2 — Classifier drift as task mix evolves

A classifier trained on Q1 2026 traffic may not generalise to Q3 2026 traffic. As your user base expands, your feature set grows, or your application use case evolves, the distribution of incoming requests shifts. The classifier's training distribution diverges from the production distribution; accuracy drops; routing decisions get worse.

The mitigation: retrain the classifier on a quarterly cadence using a sampled set of recent production requests with human-labelled task types. Most production deployments rebuild the classifier roughly every 90 days; some bump to monthly if drift is rapid.

Failure mode 3 — Routing-table staleness as model catalogs evolve

The right model for "simple" tasks in early 2026 may not be the right model in mid-2027 because new models launch (cheaper, faster, or higher-quality). A routing table written against the Q1 2026 catalog gets stale as the catalog expands.

The mitigation: benchmark the catalog quarterly. Run a representative prompt set through every model in your routing table; score quality + latency + cost; recalibrate the (task_type, mode) routing-table cells against the current data. The bench is real work (3-5 days of effort per quarter) but it's the only way the routing table stays competitive.

Prism re-benchmarks quarterly; the v1.7-A benchmark (May 2026) is the most recent calibration of our 23-model catalog.

The A/B framework that proves quality didn't regress

Before routing rolls out to 100% of traffic, you need to know that quality didn't regress on the slices being routed away from the previous default. The framework:

Phase 1 — Shadow routing (1-2 weeks). Route 100% of requests through the routing logic as if it were rolled out but actually dispatch to the existing single-model setup. The routing decisions don't affect production behaviour; you're just collecting per-request "would-have-routed-to-X" labels. Use these labels to spot-check the classifier's accuracy on your specific traffic.

Phase 2 — Canary deployment (1 week). Roll out routing on 5-10% of production traffic. Monitor:

Per-task-type quality signals (thumbs ratio, ratings, customer-reported issues)
Latency distribution (especially p95/p99 — small models should be faster, not slower)
Cost-per-feature dashboard (you should see the bill drop on the routed slice)

If quality signals stay flat or improve, proceed. If they degrade on a specific task type, hold or route that specific task type back to the previous default while you investigate.

Phase 3 — Gradual rollout (2-4 weeks). Increase routing coverage by 10-20% per week. Continue monitoring. Pause if signals degrade; back off the specific failing slice if needed.

Phase 4 — 100% with monitoring (ongoing). Routing applied to all eligible traffic. Quality and cost signals remain on the dashboard. Quarterly review of the routing table against current model catalog + accumulated production feedback.

The total rollout cycle is roughly a month — long enough to gather meaningful signal at each phase, short enough that the savings start landing within a quarter. Skipping phases is the most common implementation mistake; teams who flip routing on 100% on day one often have to roll back when they hit a quality regression and don't know which task type caused it.

What this looks like in code

The shape of routing logic in production. This pattern works whether you're using a gateway (Prism, Portkey, LiteLLM) or building it yourself:

# Pseudocode for a basic routing layer
def route_request(user_message: str, context: dict) -> str:
    # 1. Classify the request
    task_type = classifier.classify(user_message)  # returns "simple" | "code" | "reasoning" | "complex"

    # 2. Pick mode based on caller intent (passed as header or config)
    mode = context.get("mode", "balanced")  # "eco" | "balanced" | "sport"

    # 3. Look up the routing table
    model = ROUTING_TABLE[task_type][mode]

    # 4. Apply per-project overrides (if any)
    if project_override := context.get("model_override"):
        if project_override in MODEL_CATALOG:
            model = project_override

    return model

# The routing table — calibrated from benchmark data
ROUTING_TABLE = {
    "simple":    {"eco": "groq-llama-8b",   "balanced": "groq-llama-8b",       "sport": "claude-opus-4-7"},
    "code":      {"eco": "codestral",       "balanced": "codestral",           "sport": "mistral-medium-3-5"},
    "reasoning": {"eco": "groq-llama-8b",   "balanced": "groq-qwen-32b",       "sport": "claude-opus-4-7"},
    "complex":   {"eco": "groq-llama-70b",  "balanced": "gpt-5-4",             "sport": "gemini-3-pro"},
}

The table above is Prism's current production routing table (v1.7-A P6 calibration). The cells map (task_type × mode) to a specific model based on measured quality + cost data from the v1.7-A benchmark. Pro+ accounts can override per-project via the X-Prism-Model-Prefer header; the default mode is balanced.

How Prism implements routing

Prism's router combines:

Mode declaration via X-Prism-Mode header — eco / balanced / sport. Default: balanced.
Classifier — a small fine-tuned model that runs in the API process at ~10ms p95. Returns one of simple / code / reasoning / complex per request.
Routing table — the 4×3 grid above, calibrated quarterly from a 23-model benchmark across 8 providers.
Override — X-Prism-Model-Prefer header pins a specific model on Pro+ accounts when the caller wants direct control.
Failover — if the chosen model's provider is unhealthy, the router falls over to an equivalent model on a different provider (capability-tier match).
Speculative parallel routing on sport mode — fires two providers in parallel and takes the first response, hedging p99 latency under provider degradation. Pro+ only.

The full mechanic and per-glossary detail is in task-type routing and multi-provider failover.

VERIFY (founder): confirm the routing table above matches current production. The 2026-05-22 v1.7-A P6 calibration is in backend/app/services/router.py::ROUTING_TABLE; if it's been refreshed since this article was written, sync the table to current.

Decision framework

If you're deploying task-type routing on a production workload:

Quantify your task mix. Sample 100-1000 recent requests; manually label by task type; compute the percentages. The savings depend on the mix.
Pick the model per (task, mode) cell. Use a benchmark — there's no shortcut. Prism's quarterly bench is one input; Hugging Face's open-LLM leaderboards are another.
Wire the classifier. A fine-tuned 8B model running locally or in a sidecar is the typical pattern. Off-the-shelf classifiers (e.g. zero-shot category classifiers via a small LLM) work for prototyping.
Capture feedback signals per request. Thumbs-down + rating + comment + feedback ID correlation. The closed-loop monitoring is what keeps routing calibrated.
Roll out via the 4-phase A/B framework above. Don't flip 100% on day one.
Plan for quarterly recalibration. Both classifier retraining and routing-table benchmark refresh.

Routing is the highest-effort top-5 cost reduction technique (~2-3 days for the basic version, weeks for the full closed-loop discipline), but it's also the largest structural lever. The math is favourable — 40-60% savings on the routable slice is the production norm, and the engineering work compounds: once the discipline is in place, it stays in place.

Where to go next

For the broader cost-reduction context: LLM cost reduction playbook (all 14 techniques) and LLM cost reduction ranked by ROI (the top 5).

For routing-specific deep dives: task-type routing glossary, LLM routing glossary, multi-provider failover glossary, speculative parallel routing glossary.

For modelling routing impact on your workload: model routing recommender — input your task mix + cost preference and see Prism's recommended config.

FAQ

Do I need a classifier at all, or can I use deterministic rules?

For ~80% of production cases, hand-coded rules work surprisingly well. "If the request contains code blocks, route to a code-specialised model" + "if the request is over 8K input tokens, route to long-context" captures most of the win without ML infrastructure. The case for a classifier shows up when the rule set grows beyond ~10 rules and starts conflicting, or when the task distribution is too varied for hand-coded heuristics. Most mature production deployments combine both: explicit rules for known cases + classifier for the rest.

How accurate does the classifier need to be?

Roughly 85-90% top-1 accuracy is enough for the routing math to work. The savings on the correctly-routed 85% dominate the noise from the misrouted 15%. Below 80% accuracy, the misrouting starts costing real money + quality; above 95%, the marginal accuracy gains don't change the savings significantly.

What models are best for simple-task routing?

In mid-2026: GPT-5.4-mini ($0.75 input + $4.50 output per M tokens), Claude Haiku 4.5 ($1.00 + $5.00), Gemini 3 Flash ($0.30 + $2.50), Groq Llama 8B ($0.05 + $0.08). The exact ranking depends on your specific workload — benchmark against your actual prompts before committing.

Does routing add latency?

Marginally. The classifier adds 5-20ms p95. Small models (the routing targets for simple tasks) are typically faster than the frontier models they replace — net latency often improves. The argument against routing on latency grounds is usually wrong on the data.

What about routing across providers vs within a provider?

Both work. Within-provider routing (route between GPT-5.4-mini and GPT-5.4 on OpenAI) captures the per-tier price gap with single-vendor simplicity. Across-provider routing (route GPT-5.4-mini for simple, Claude Sonnet 4.6 for reasoning) captures additional capability optimisations + multi-provider resilience. Most production deployments end up across-provider because the cost/quality frontier varies by category — Anthropic excels at reasoning; OpenAI's mini tier is the gold standard for cheap small-task work; Groq is fast for high-throughput simple traffic.

How do I A/B test the quality of routing changes?

Hold-out group at the project/user level. Route 95% of traffic through the new routing; 5% stays on the old single-model behaviour. Compare per-task-type quality signals (thumbs ratio, customer-reported issues) between the two groups over 1-2 weeks. If the routed group's quality is flat or improved, expand. If it regresses on a task type, investigate that specific slice.

Will routing make my application fragile (single-point-of-failure on the classifier)?

The classifier should fail-open — if classification fails (e.g. exception, timeout), default to a sensible model (typically the balanced-mode large model). The routing decision is an optimisation, not a hard requirement; the application keeps working even when the classifier doesn't.

How does routing interact with caching?

Routing happens at request time before the cache; the cache fingerprint includes the (resolved) model name, so different routings produce different cache keys. The wedges stack — routing reduces the cost of cache misses; caching avoids many requests from reaching the routing decision at all. Most production deployments run both, in this order: cache lookup first (Layer 1 + Layer 2); on miss, run the classifier + router; dispatch to the chosen model.

The routing-savings math is one of the most predictable cost wins in LLM operations. Combined with provider-native caching and exact-match caching (techniques #1 and #2 in the ranked cluster), the cumulative bill cut is typically 50-70% on production workloads. Model your specific shape via the savings calculator and the routing recommender.

DEV Community