The Problem: One Model for Everything
Here's what a typical Claude Code agent loop looks like under the hood:
User prompt → Claude Sonnet (classify intent) → Claude Sonnet (retrieve context)
→ Claude Sonnet (summarize retrieved docs) → Claude Sonnet (generate response)
→ Claude Sonnet (format output)
Five calls. Each one hitting Sonnet. At Claude Sonnet pricing (roughly $3/MTok input, $15/MTok output as of this writing), a moderately complex agent task with 10K input tokens and 2K output tokens per call costs:
5 calls × (10K × $0.003 + 2K × $0.015) = 5 × ($0.030 + $0.030) = $0.30 per task run
That sounds small. Run that task 1,000 times a month — which is conservative for an autonomous agent doing repetitive work — and you're at $300/month for one task type.
Now look at what most of those calls actually need:
- Classify intent: Takes a string, returns a category. This is a pattern-matching problem.
- Retrieve context: String similarity search. No synthesis required.
- Summarize retrieved docs: Compression of existing text. No novel reasoning.
- Generate response: This one actually needs intelligence.
- Format output: String transformation. Deterministic.
Three of five calls don't need Sonnet. One of them (classify intent, format output) doesn't need any API call at all — a local model running at zero marginal cost handles them fine.
That's the routing opportunity.
The Routing Principle
Before dispatching a subtask to any model, answer three questions:
1. Does this require judgment or just processing?
Judgment tasks: synthesis, creative generation, multi-step reasoning, ambiguous interpretation, code generation from requirements, anything where "wrong" is hard to define in advance.
Processing tasks: classification into fixed categories, text compression/summarization, format conversion, extraction of named entities, boolean routing decisions.
Judgment → Tier 2 minimum. Processing → Tier 0 or Tier 1 viable.
2. Does it need to be right on the first attempt, or can it retry cheaply?
Some subtasks sit on the critical path. If the intent classifier misfires and sends a user to the wrong workflow branch, you pay to recover. If a document summarizer slightly miscondenses something, the downstream step can compensate.
High-stakes, no-retry → Tier 1 minimum. Low-stakes, recoverable → Tier 0 viable.
3. What's the token budget for this step?
Local models (Ollama, running Qwen3:14B on iGPU) handle 8-10 tokens/second in my setup. That's fine for 500-token classification tasks. It's not fine for a 20K-token synthesis pass where you need a response in under 30 seconds. Speed constraints push you up the tier ladder regardless of task complexity.
The decision tree:
Is this a synthesis/reasoning/generation task?
├── Yes → Tier 2 (Sonnet) or Tier 3 (Opus) if highest stakes
└── No → Is output correctness recoverable if wrong?
├── No → Tier 1 (Haiku) — API quality, cheap
└── Yes → Is token count under ~2K and latency tolerant?
├── Yes → Tier 0 (Ollama local) — zero API cost
└── No → Tier 1 (Haiku)
Implementation
Here's the router as a standalone module. The classify() function takes a task description string and returns a tier integer. get_model() maps that tier to a model identifier.
# model_router.py
from enum import IntEnum
import re
class Tier(IntEnum):
LOCAL = 0 # Ollama — zero API cost
HAIKU = 1 # Claude Haiku 4.5 — cheap, API quality
SONNET = 2 # Claude Sonnet — primary work
OPUS = 3 # Claude Opus — highest stakes only
TIER_MODELS = {
Tier.LOCAL: "ollama:qwen3:14b",
Tier.HAIKU: "claude-haiku-4-5",
Tier.SONNET: "claude-sonnet-4-5",
Tier.OPUS: "claude-opus-4-5",
}
# Task patterns that signal each tier.
# Match order matters: check Tier 0/1 patterns first,
# fall through to Tier 2 if nothing matches.
LOCAL_PATTERNS = [
r"\bclassif(y|ication|ier)\b",
r"\broute\b.*\btask\b",
r"\bsummariz(e|ation)\b",
r"\bextract\b.*(entity|entities|field|fields|name|date|number)",
r"\bformat\b.*(output|json|markdown|csv)",
r"\bparse\b.*(string|text|input)",
r"\bis this (about|related to|a)\b",
r"\bcategori(ze|zation)\b",
r"\bdetect\b.*(intent|topic|language|sentiment)",
r"\btranslate\b.*(format|schema)",
]
HAIKU_PATTERNS = [
r"\bvalidat(e|ion)\b",
r"\bcheck\b.*(schema|format|constraint|rule)",
r"\bfilter\b",
r"\brank\b.*(list|candidates|results)",
r"\bscore\b",
r"\byes.{0,10}no\b", # binary decisions
r"\btrue.{0,10}false\b",
r"\bshould (i|we|this)\b",
]
OPUS_PATTERNS = [
r"\bcritical\b",
r"\bhigh.?stakes\b",
r"\birreversible\b",
r"\bproduction (deploy|release|launch)\b",
r"\bsecurity (audit|review|analysis)\b",
r"\blegal\b",
r"\barchitect(ure)? decision\b",
]
def classify(task: str) -> Tier:
"""
Classify a task description string and return the appropriate model tier.
Conservative by default: unknown tasks get Tier 2 (Sonnet).
"""
task_lower = task.lower().strip()
# Check Opus patterns first — these override everything
for pattern in OPUS_PATTERNS:
if re.search(pattern, task_lower):
return Tier.OPUS
# Check if task clearly fits Local tier
local_matches = sum(
1 for p in LOCAL_PATTERNS if re.search(p, task_lower)
)
if local_matches >= 1 and len(task_lower) < 500:
return Tier.LOCAL
# Check Haiku tier
for pattern in HAIKU_PATTERNS:
if re.search(pattern, task_lower):
return Tier.HAIKU
# Default: Sonnet
return Tier.SONNET
def get_model(tier: Tier) -> str:
"""Return the model identifier for the given tier."""
return TIER_MODELS[tier]
def route(task: str) -> tuple[Tier, str]:
"""Convenience wrapper: classify + return (tier, model_id)."""
tier = classify(task)
return tier, get_model(tier)
Injecting this into a Claude Code script:
If you're running Claude Code in script mode (claude -p), you typically don't call the API directly — Claude Code handles the model. But if you're orchestrating sub-agent calls via the Anthropic SDK directly (which is common when you have a Claude Code agent spinning up subordinate tasks), the router drops in cleanly:
# agent_loop.py
import anthropic
from model_router import route, Tier
client = anthropic.Anthropic()
def run_subtask(task_description: str, prompt: str) -> str:
tier, model = route(task_description)
# Tier 0: local inference via Ollama (no Anthropic API call)
if tier == Tier.LOCAL:
return run_ollama(model.replace("ollama:", ""), prompt)
# Tiers 1-3: Anthropic API
response = client.messages.create(
model=model,
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def run_ollama(model_name: str, prompt: str) -> str:
"""Call local Ollama endpoint directly."""
import httpx
resp = httpx.post(
"http://localhost:11434/api/generate",
json={"model": model_name, "prompt": prompt, "stream": False},
timeout=60.0
)
return resp.json()["response"]
Integrating with a Claude Code tool definition:
If your agent uses Claude Code's native tool calling, you can route at the tool dispatch layer:
# In your tool handler
TOOL_TIER_OVERRIDES = {
"classify_intent": Tier.LOCAL,
"summarize_document": Tier.LOCAL,
"extract_fields": Tier.LOCAL,
"validate_schema": Tier.HAIKU,
"rank_candidates": Tier.HAIKU,
"generate_code": Tier.SONNET,
"synthesize_findings": Tier.SONNET,
"review_security": Tier.OPUS,
}
def dispatch_tool(tool_name: str, tool_input: dict) -> str:
# Use hard-coded override if known, otherwise classify from tool_name
if tool_name in TOOL_TIER_OVERRIDES:
tier = TOOL_TIER_OVERRIDES[tool_name]
else:
tier = classify(tool_name + " " + str(tool_input))
model = get_model(tier)
# ... dispatch to appropriate model
Real Numbers
Here's the actual breakdown from my autonomous agent infrastructure, running a mix of ClawMart listing maintenance, content generation, and ACE license delivery tasks over a 30-day period.
Before routing — all tasks on Sonnet:
| Task type | Calls/day | Avg tokens (in/out) | Daily cost |
|---|---|---|---|
| Intent classification | 120 | 800 / 50 | $0.32 |
| Document summarization | 40 | 3,200 / 400 | $0.44 |
| Field extraction | 80 | 600 / 120 | $0.20 |
| Schema validation | 60 | 400 / 80 | $0.13 |
| Content generation | 15 | 2,000 / 1,500 | $0.29 |
| Code synthesis | 10 | 4,000 / 2,000 | $0.42 |
| Total | 325 | — | $1.80/day ($54/mo) |
After routing:
| Task type | Tier | Daily cost |
|---|---|---|
| Intent classification | 0 (Ollama) | $0.00 |
| Document summarization | 0 (Ollama) | $0.00 |
| Field extraction | 0 (Ollama) | $0.00 |
| Schema validation | 1 (Haiku) | ~$0.004 |
| Content generation | 2 (Sonnet) | $0.29 |
| Code synthesis | 2 (Sonnet) | $0.42 |
| Total | — | ~$0.71/day ($21/mo) |
That's a 61% reduction. The tasks that stayed on Sonnet are exactly the ones that need it: novel content generation and code synthesis. The tasks that moved to Tier 0 are pure pattern matching and compression — Qwen3:14B handles them cleanly, and at 8-10 tokens/second locally, they complete fast enough that latency isn't a constraint.
A few observations from running this in production:
- Classification accuracy on Tier 0 is high for constrained tasks. When the output space is a small fixed set of categories, Qwen3:14B makes fewer errors than you'd expect. The failure mode is ambiguous prompts, not model capability.
- Haiku 4.5 is underused by most teams. It's genuinely capable for structured validation and ranking tasks, and it costs roughly 15x less than Sonnet for input tokens. Most teams skip straight to Sonnet out of habit.
-
The routing classifier itself costs almost nothing. My
classify()function is pure regex — no model call, zero latency, zero cost. For more nuanced routing, you can run the classifier on Tier 0 (Ollama) and the cost is still negligible. - Retry budgets matter. I give Tier 0 tasks two retries before escalating to Tier 1. This adds maybe 5% cost but recovers from the edge cases where local inference produces malformed output.
What Breaks Without This
The failure mode I see most often in unrouted agents isn't cost — it's the Sonnet context window filling up with low-value intermediate processing. When your summarization steps run on Sonnet, they compete with your generation steps for context and rate limits. Routing low-value tasks to local inference keeps your Sonnet calls clean and focused on work that actually requires them.
The second failure mode is rate limit exhaustion. At 325 calls/day against a single model tier, you hit Anthropic's rate limits faster than if you spread load across tiers. Tier distribution is rate limit distribution.
The Packaged Framework
The routing logic above is a simplified version of what I built and use in production. The full framework includes:
- Pre-trained classifiers for 40+ task types with confidence scores
- Cost tracking that logs actual spend per task type to a local SQLite DB
- A dashboard that shows cost breakdown and tier distribution over time
- Retry logic with automatic tier escalation on failure
- Integration examples for Claude Code scripts, Anthropic SDK, and LangChain
The full Token Cost Intelligence skill is available on ClawMart: Token Cost Intelligence — OpenClaw Optimization Framework ($29).
If you're running any Claude Code agents at scale — even moderate scale — the routing framework pays for itself in the first day of usage.
W. Kyle Million (K¹) builds autonomous AI infrastructure at IntuiTek¹. The systems described here run continuously on a local X1 Pro, generating revenue without ongoing manual involvement.
Top comments (0)