Neilton Rocha

Posted on May 7

LLM Routing: How to cut AI Infrastructure costs by 70% Without losing quality

#ai #software #llm #architecture

Running everything on frontier models is an operational mistake. Here is the routing architecture that reduced cost per task from $8.20 to $2.44 in production

GPT-5.5 costs 34x more than DeepSeek V4-Pro. 95% of your queries do not need frontier.

Routing (upfront decision) and cascading (confidence-based fallback) solve different problems. Production uses both.

ESKOM.ai went from $8.20 to $2.44 per completed task. Same quality. 70% cost reduction.

Every week someone tells me the same thing: "I tested an AI agent and it was useless. LLMs are overrated."

My answer: you sent a cardiac surgeon to put on a bandage and did not give him the patient notes.

The model is not the problem. The problem is that everything runs on frontier with zero selection logic. Here is what that costs at scale:

Model	Cost per 1M tokens	Multiplier
DeepSeek V4-Pro	$0.435	1x
GPT-4o-mini	$1.50	3.4x
Claude Sonnet 4.5	$5.00	11.5x
GPT-5.5	$15.00	34.5x
Claude Opus 4.7	$26.00	59.8x

A well-designed router pushes 95% of traffic to the cheap tier and reserves frontier for the 5% that actually needs it.

Routing vs. Cascading

These are two distinct strategies. Mixing them up is the most common architecture mistake I see.

Routing is an upfront decision. A classifier evaluates the query and picks a tier before any LLM call is made. One decision, one execution, no fallback.

query = "Extract the cost values from document X"
tier = classifier.predict(query)  # returns "simple"
response = router.call(tier, query)  # DeepSeek, $0.435/1M

Use it for structured, well-defined workloads: extraction, classification, fixed-template generation. The trade-off is that a wrong classification has no recovery path.

Cascading starts at the cheapest model and escalates based on a confidence score. If the output confidence falls below a threshold, it retries at the next tier automatically.

response = deepseek.call(query)

if response.confidence < 0.70:
    response = sonnet.call(query)

# Total cost: $0.435 + $5.00 = $5.435 vs. $26 going straight to Opus

Use it for unpredictable workloads: open-ended analysis, financial or legal reasoning, anything where query difficulty varies widely.

The latency trade-off is real. Sequential calls at 100ms each stack up. If P95 exceeds 500ms, rethink.

Production systems use both. Routing for structured flows, cascading for analytical ones.

The 3-layer architecture

Request
   |
[Semantic Cache] -- hit --> Response (zero cost)
   | miss
[Intent Classifier] (0.5B model, ~5ms)
   |
   |-- Simple     --> DeepSeek V4-Pro   ($0.435/1M)
   |-- Medium     --> GPT-4o-mini       ($1.50/1M)
   |-- Critical   --> GPT-5.5 / Opus    ($15-26/1M)
                         ^
                  [Confidence Gate]
                  confidence < 0.70: escalate

Layer 1, semantic cache: before any classification, check if this query was already answered. For B2C products with repetitive queries, a 30-40% hit rate is realistic. Marginal cost: zero.

Layer 2, Intent classifier: a small model (0.5B parameters) trained on your actual workload, not on generic benchmarks. Running locally with vLLM, latency is under 5ms. Cost is roughly $0.20/hour of GPU.

Layer 3, confidence gate: each response returns a score. Below 0.70, escalate. Above 0.85, trust it. Financial and legal domains bypass the gate and go straight to frontier.

Five mistakes that break this in production

No observability on routing decisions. If you are not logging the classifier score, selected tier, and final confidence for every query, you will not know when calibration drifts. The system degrades silently.

Single provider dependency. If DeepSeek goes down, your cheap tier goes with it. Configure a same-tier fallback on a different provider.

Tail miscalibration. Overall accuracy of 94% sounds good. The 6% that fails is exactly the rare, high-stakes queries your classifier has the least training data for. Oversample the tail when you validate.

Cascade latency stacking. Three sequential calls at 100ms each equals 300ms. Sometimes paying $26 directly on Opus is cheaper than the latency cost to your conversion rate.

Thresholds set by intuition. Run an A/B test. Compare 0.65 vs. 0.75 for one week. Measure escalation rate, average quality score, and cost per task. The optimal point is specific to your workload.

Real case: ESKOM.ai

ESKOM.ai runs an agent stack for energy data processing. Before and after implementing the architecture above:

Metric	Before	After
Query distribution	100% GPT-4.5	70% DeepSeek / 25% mid / 5% frontier
Cost per task	$8.20	$2.44
Escalation rate	N/A	2.8%
P95 latency	250ms	180ms
Quality score	4.1/5	4.2/5

At 30,000 tasks per month, that is $27,000 saved in the first month. Annualized: approximately $324,000.

An escalation rate of 2.8% means the classifier was well calibrated: 97.2% of queries stayed at the initial tier. If yours is above 5%, retrain.

How to implement this week

LiteLLM handles routing and fallback across 100+ models with a YAML config:

pip install litellm

from litellm import Router

router = Router(model_list=[
    {"model_name": "tier-simple",   "litellm_params": {"model": "deepseek/deepseek-v4-pro"}},
    {"model_name": "tier-medium",   "litellm_params": {"model": "gpt-4o-mini"}},
    {"model_name": "tier-frontier", "litellm_params": {"model": "claude-opus-4"}},
])

RouteLLM (Berkeley, arXiv:2410.13765) provides a calibration matrix trained on your query history. Their published benchmark: 85% of queries routed to cheap tier, maintaining 95% of frontier quality.

vLLM lets you run the intent classifier locally for sub-5ms latency and full query privacy:

pip install vllm
vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype auto

Four-week rollout:

Week 1: LiteLLM with 3 tiers + structured logging
Week 2: Confidence gate + domain overrides (finance and legal to frontier)
Week 3: Empirical threshold calibration via A/B test
Week 4: Monitor cost per task, escalation rate, quality score

Target at week 4: cost per task down at least 40%. If not, the classifier needs more domain-specific training data.

That is exactly backwards.

The defensible moat in AI infrastructure is not which model you have access to. Every competitor has API access to the same frontier models you do. The moat is how efficiently you decide which model to use for each specific task.

Organizations building this routing layer today operate with a 10-30x structural cost advantage over everyone still sending everything to frontier. That gap compounds. As usage scales, the teams without a router pay exponentially more for the same output quality.

The question is not "which LLM is best." The question is "which LLM is best for this query, right now, given confidence and latency constraints."

If this was useful:

Drop a comment with the escalation rate you are currently seeing in production, or the tier distribution you are targeting. I read every one.

If you are setting up LLM routing for the first time and hit a wall on classifier calibration, tell me your domain and workload profile. Worth a follow-up post if enough people are stuck on the same step.

References: RouteLLM · LiteLLM · arXiv:2410.13765