Running GPT-4o on every task is like hiring a senior engineer to sort your inbox.
Most ML teams wire all inference calls to the same frontier model and call it "safe." It's not safe. It's a budget leak.
Here's the math that changed how I build pipelines:
A typical customer support system has two dominant task types — classification ("is this billing or technical?") and structured extraction ("pull the order ID"). Together they account for ~60% of inference calls.
Neither needs chain-of-thought reasoning. Neither benefits from a 200B+ parameter model pondering an order number.
Yet both get routed to GPT-4o by default.
I benchmarked this directly. On a 1,000-sample extraction task from financial documents:
- Quantized Llama-3 70B (Q4_K_M): F1 = 0.91, ~$0.003/request
- GPT-4o: F1 = 0.94, ~$0.12/request
That's a 40x cost difference for a 3-point F1 gap. In most production systems, 0.91 F1 is more than sufficient.
The 5-Node Decision Tree Framework
The framework I use now is a 5-node decision tree that routes tasks based on four signals:
- Input token count (< 500?)
- Output determinism (JSON/enum expected?)
- Reasoning depth score (1–5 scale)
- Latency SLA (< 200ms P95?)
def route_task(prompt: str, output_schema: dict | None, latency_sla_ms: int) -> str:
"""
Returns the model tier to use for a given task.
Tiers: 'tier1' | 'tier2' | 'tier3'
"""
token_count = estimate_tokens(prompt) # lightweight tokenizer
reasoning_depth = score_reasoning_depth(prompt) # keyword + heuristic classifier
is_structured = output_schema is not None
is_latency_sensitive = latency_sla_ms < 200
if token_count < 500 and is_structured and reasoning_depth <= 2:
return "tier1" # Haiku / quantized Llama — ~$0.003/request
if reasoning_depth <= 3 and not is_latency_sensitive:
return "tier2" # Mid-tier — ~$0.01–0.03/request
return "tier3" # Frontier model only — ~$0.10–0.15/request
The 5 Task Classes
Tier 1 — Classification & Tool Execution
Models: Haiku / quantized Llama (Q4_K_M)
- Binary or multi-class classification
- Structured extraction (JSON, enums)
- Tool call routing in agentic pipelines
- Cost: ~$0.003/request
{
"task": "extract_order_id",
"tier": "tier1",
"model": "claude-haiku-3",
"output_schema": {
"order_id": "string",
"customer_id": "string",
"issue_type": "billing | technical | shipping | other"
}
}
Tier 2 — Summarization & Transformation
Models: Mid-tier (e.g., GPT-4o-mini, Haiku with larger context)
- Document summarization
- Format conversion
- Translation
- Cost: ~$0.01–0.03/request
Tier 3 — Multi-step Reasoning
Models: Frontier only (GPT-4o, Claude Sonnet, Gemini 1.5 Pro)
- Complex analysis requiring chain-of-thought
- Code generation with debugging
- Multi-document synthesis
- Cost: ~$0.10–0.15/request
The Routing Classifier
The routing classifier itself runs on a Haiku-class model. Its cost is roughly 0.1% of the savings it generates. It pays for itself on the first routed request.
The classifier evaluates:
- Token count of the incoming prompt
- Presence of structured output schema
- Keyword signals for reasoning depth
- Latency requirements from the request metadata
REASONING_KEYWORDS = [
"analyze", "compare", "synthesize", "debug", "explain why",
"step by step", "chain of thought", "evaluate", "critique"
]
def score_reasoning_depth(prompt: str) -> int:
"""
Returns a 1–5 reasoning depth score.
1 = pure classification/extraction
5 = deep multi-step reasoning required
"""
prompt_lower = prompt.lower()
keyword_hits = sum(1 for kw in REASONING_KEYWORDS if kw in prompt_lower)
token_count = estimate_tokens(prompt)
base_score = 1
base_score += min(keyword_hits, 2) # max +2 from keywords
base_score += 1 if token_count > 1000 else 0 # long prompts skew complex
base_score += 1 if token_count > 3000 else 0 # very long = almost certainly tier3
return min(base_score, 5)
Real Production Numbers
One number from our agentic pipeline at QEval: routing a 10-step ReAct loop — frontier model only for planning, Haiku for tool execution — cut cost per loop from $1.47 to $0.18. Accuracy delta was under 3%.
# Before routing: all steps on GPT-4o
# 10 steps × ~$0.147/step = $1.47/loop
# After routing:
# 2 planning steps × $0.12 = $0.24
# 8 tool steps × $0.003 = $0.024
# 1 routing call × $0.003 = $0.003
# Total = $0.267 → real-world measured: $0.18 with caching
The mental shift that matters: stop optimizing cost-per-token. Optimize cost-per-correct-answer.
Implementation Checklist
- [ ] Audit your top 5 inference call types by volume
- [ ] Score each on reasoning depth (1–5)
- [ ] Identify which are classification/extraction (Tier 1 candidates)
- [ ] Build a lightweight routing classifier
- [ ] A/B test Tier 1 model vs frontier on your actual data
- [ ] Measure F1 delta — if < 5 points, route to Tier 1
References
- Stanford HAI 2025 AI Index Report
- Sebastian Raschka: State of LLM Reasoning and Inference Scaling
- NVIDIA Post-Training Quantization for LLMs
- ScaleMindLabs: KV Cache Compression FP8/INT4
- VAST Data — 2026: The Year of AI Inference
If you're building routing logic for agentic pipelines or wrestling with inference cost at scale, I'd love to compare notes — find me on LinkedIn. I share production AI/ML architecture insights regularly, and I'm always curious what thresholds and signals others are using in their own routing classifiers.
Top comments (0)