Claude Code Is Burning Your API Budget: The Model Routing Architecture That Fixes It
Published: 2026-04-10 | Author: ~K¹ / IntuiTek¹
Most Claude Code agents default to Claude Sonnet for everything. Every task — whether it's classifying a one-sentence input or synthesizing a 50-page analysis — hits the same frontier model at the same price.
That's expensive. And unnecessary. Most agent work doesn't require frontier reasoning.
This is the model routing architecture I've been running on a production autonomous agent for several weeks. It cuts API spend to near-zero for a large class of tasks while keeping quality exactly where it needs to be.
The Problem: One Model For Everything
When you let an agent run autonomously, it executes many tasks you never directly observe. Inbox polling. Classification. Summarization. Routing decisions. Content extraction. Cache lookups.
None of those require Claude Sonnet. But if your agent is calling the Anthropic API for all of them, you're paying Sonnet prices for work that a 7B local model handles correctly 95% of the time.
Over a day of autonomous operation — inbox heartbeats every 10 minutes, background monitoring, content generation — those costs compound. More importantly, you're burning through rate limits and subscription headroom you need for the tasks that actually warrant frontier quality.
The 4-Tier Model Routing Architecture
The fix is a tiered routing system. Every task gets assigned to the cheapest tier that can handle it correctly. Here's the full tier table:
Tier 0 | Local (Ollama) | Classification, routing, summarization, extraction
Tier 1 | Claude Haiku | Structured tasks needing API-quality output
Tier 2 | Claude Sonnet | Primary reasoning, code, multi-step synthesis
Tier 3 | Claude Opus | Highest-stakes decisions, irreversible actions
The goal: push as much work as possible to Tier 0 (zero cost), use Tier 1 for structured outputs that need reliability, reserve Tier 2 for actual reasoning, and treat Tier 3 as nearly never-used.
Tier 0: Local Inference With Ollama
Ollama runs language models locally. On a machine with a capable GPU or NPU, it's genuinely fast. On a CPU-only box, it's slower but still workable for async tasks.
Install:
curl -fsSL https://ollama.com/install.sh | sh
Pull a model:
ollama pull qwen2.5:7b # 4.7GB — good generalist, fast
ollama pull qwen2.5:14b # 9GB — better quality, slower
Verify:
ollama list
ollama run qwen2.5:7b "Classify this task: summarize a user inbox message. Return: classification/routing/generation/analysis"
Start server at boot (cron):
@reboot /home/aegis/.local/bin/ollama serve >> /home/aegis/intuitek/logs/ollama.log 2>&1
API usage (same interface as OpenAI):
curl http://localhost:11434/api/generate \
-d '{"model":"qwen2.5:7b","prompt":"Is this a question or a statement? Reply with one word.\n\nInput: What time is the meeting?","stream":false}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
Tier 0 is your first call. If the task fits — make it here. Zero API cost.
Tier 1: Haiku For Structured Output
Some tasks need more reliability than a local 7B model provides, but don't need Sonnet's reasoning depth. Haiku hits that gap.
Good Haiku use cases:
- JSON extraction from semi-structured text
- Template filling with validation
- Short-form content generation (subject lines, metadata, descriptions)
- Tool call routing where format matters
- Confirmation logic ("does this output match the schema? yes/no")
Haiku is cheap. For most background tasks, Haiku cost is negligible compared to Sonnet. The key is not using Haiku where you could use Tier 0, and not using Sonnet where Haiku would work.
Tier 2: Sonnet For Real Reasoning
Claude Sonnet (subscription or API) is your primary worker for tasks that require:
- Multi-step reasoning with dependencies
- Code generation and debugging
- Long-form synthesis from multiple sources
- Judgment calls with ambiguous inputs
- Anything that touches production systems
The routing rule: if the task requires more than one reasoning step, or if a wrong answer has meaningful consequences, use Sonnet.
On a Claude Max subscription, Sonnet is your primary resource. Treat it as the default for everything non-trivial, but don't waste it on classification.
Tier 3: Opus For Irreversibility
Opus is expensive. It's also the best reasoning model available. Use it exactly when:
- The decision is irreversible (dropping data, production deploys, removing credentials)
- The stakes are asymmetric (getting this wrong costs significantly more than getting it right)
- You need the system's best possible judgment, not good-enough judgment
In practice, a well-designed autonomous agent almost never needs Opus. If you find yourself reaching for it frequently, it's a signal that your task decomposition is wrong — you're loading too much into single calls that should be broken into cheaper steps.
The Routing Decision In Practice
Here's the decision tree I encode in every agent's CLAUDE.md:
Is this task running inside a headless cron? (background/async)
└─ YES → Start at Tier 0. Escalate only if output is wrong format.
└─ NO → Evaluate task type:
└─ Classification / routing / summarization → Tier 0
└─ Structured extraction needing reliability → Tier 1
└─ Reasoning, code, synthesis, judgment → Tier 2
└─ Irreversible action, highest-stakes → Tier 3
Code-level implementation:
def route_to_model(task_type: str, is_async: bool = False) -> str:
"""Return model string for the task."""
tier0_tasks = {"classify", "route", "summarize", "extract", "label", "filter"}
tier1_tasks = {"structured_output", "template_fill", "validate", "short_generate"}
tier3_tasks = {"irreversible_action", "production_deploy", "credential_change"}
if is_async and task_type in tier0_tasks:
return "ollama/qwen2.5:7b" # Zero cost
if task_type in tier0_tasks:
return "ollama/qwen2.5:7b"
if task_type in tier1_tasks:
return "claude-haiku-4-5-20251001"
if task_type in tier3_tasks:
return "claude-opus-4-6"
return "claude-sonnet-4-6" # Default: Tier 2
In an agentic Claude Code context, this routing is encoded as instructions in CLAUDE.md. The agent reads it at session start and routes accordingly:
## MODEL ROUTING — CHECK BEFORE EVERY API CALL
| Tier | Model | Use for |
|------|-------|---------|
| 0 | Ollama local (qwen2.5:7b) | Classification, routing, summarization, extraction |
| 1 | Haiku 4.5 | Structured tasks needing API quality |
| 2 | Claude Sonnet | Primary reasoning and code |
| 3 | Opus | Highest stakes only — irreversible actions |
Before any API call: assign a Tier. Start at 0. Escalate only with justification.
Check Ollama first: `ollama list`. If unavailable, escalate to Tier 1 minimum.
Do not silently fall through to Sonnet.
The last line matters: do not silently fall through. Without that instruction, agents will default to Sonnet when Ollama is unavailable — which defeats the routing system on any day the local server is down.
What This Looks Like In A Real Agent Loop
My autonomous agent runs an inbox heartbeat every 10 minutes. Here's the actual flow with routing applied:
10:00 — Inbox heartbeat fires (headless cron)
→ Check ~/intuitek/inbox/ for files [filesystem — no model call]
→ Empty. Check coordination/ [filesystem]
→ Empty. Log heartbeat [filesystem]
→ Evaluate: any idle-state work worth doing?
→ Task type: "assess current operational state and select next action"
→ This requires judgment → Tier 2 (Sonnet)
→ Decision: write new article
→ Task type: "long-form technical content generation"
→ Tier 2 confirmed
→ Draft article [Sonnet]
→ Task type: "extract article metadata (title, slug, tags)"
→ Tier 0 — local extraction
→ Run: ollama run qwen2.5:7b "Extract title and 3 tags from: ..."
→ Publish to dev.to [API call — no model needed]
→ Task type: "classify post flair: tutorial, resource, discussion, showcase"
→ Tier 0
→ Route: Tutorial/Guide
→ Post to Reddit [API call — no model needed]
→ Update SHARED_MIND.md [write — no model call]
10:47 — Cycle complete. Model usage: Sonnet for draft + judgment only.
The classification and metadata extraction steps that would otherwise call Sonnet are routed to Tier 0. In a full day of operation, this represents dozens of calls that never reach the API.
The Cost Breakdown
Here's the rough math on a day of autonomous operation, with and without routing:
Without routing (everything → Sonnet):
- 84 inbox heartbeats (10min × 7am–11pm): 84 × ~500 tokens = ~42k tokens input
- 10 classification/routing decisions: 10 × ~200 tokens = ~2k tokens
- 5 content extractions: 5 × ~300 tokens = ~1.5k tokens
- 2 article drafts: 2 × ~3k tokens = ~6k tokens
Total input: ~51.5k tokens → Sonnet price
With routing:
- 84 inbox heartbeats: filesystem checks, no model call = $0.00
- 10 classification decisions: Tier 0 (Ollama) = $0.00
- 5 extractions: Tier 0 = $0.00
- 2 article drafts: Tier 2 (Sonnet) = 6k tokens input
Total API input: ~6k tokens → ~95% reduction in API spend for background ops
The drift doesn't matter for the classification tasks. Local 7B models are accurate enough for "is this a classification task or a generation task?" They're not accurate enough for "synthesize this complex legal argument" — which is why that stays at Tier 2.
The One Thing That Breaks This
The failure mode: routing instructions in CLAUDE.md that the agent ignores.
This happens when:
- The instruction is buried — agents skim long files in time-constrained sessions
- The instruction is ambiguous — "use cheaper models when possible" is not actionable
- No enforcement — the agent has no feedback when it violates the rule
Fixes:
- Put routing at the top of CLAUDE.md, before any other configuration
- Make it a table, not prose — agents parse tables reliably
- Add a check: "Before any API call, state which tier you are using and why"
- For critical cost control: instrument your agent to log every API call with the model used, then review the log
The instrumentation step sounds manual. It is, initially. But once you see which tasks are consistently hitting Tier 2 when they should be at Tier 0, you can add them to the routing table and catch it automatically on the next run.
Putting It Together
The architecture is simple:
- Install Ollama locally. Pull qwen2.5:7b or your preferred model.
- Add the tier routing table to your CLAUDE.md — at the top.
- Add the "do not silently fall through" guard.
- For any background/async task: default to Tier 0, escalate only on failure.
- Instrument and review: check which calls are hitting Sonnet that shouldn't be.
Most agents running today are paying Sonnet rates for classification work. That's not a reasoning problem — it's a routing problem. Once you add the table and make it explicit, the agent follows it.
The complete routing configuration — including the CLAUDE.md template, the Ollama setup script, and the instrumentation pattern — is available as a packaged skill at ClawMart.
~K¹ (W. Kyle Million) / IntuiTek¹ — autonomous AI infrastructure
Top comments (0)