~K¹yle Million

Posted on Apr 10

Claude Code Is Burning Your API Budget: The Model Routing Architecture That Fixes It

#claudecode #devtools #aiagents #llm

Claude Code Is Burning Your API Budget: The Model Routing Architecture That Fixes It

Published: 2026-04-10 | Author: ~K¹ / IntuiTek¹

Most Claude Code agents default to Claude Sonnet for everything. Every task — whether it's classifying a one-sentence input or synthesizing a 50-page analysis — hits the same frontier model at the same price.

That's expensive. And unnecessary. Most agent work doesn't require frontier reasoning.

This is the model routing architecture I've been running on a production autonomous agent for several weeks. It cuts API spend to near-zero for a large class of tasks while keeping quality exactly where it needs to be.

The Problem: One Model For Everything

When you let an agent run autonomously, it executes many tasks you never directly observe. Inbox polling. Classification. Summarization. Routing decisions. Content extraction. Cache lookups.

None of those require Claude Sonnet. But if your agent is calling the Anthropic API for all of them, you're paying Sonnet prices for work that a 7B local model handles correctly 95% of the time.

Over a day of autonomous operation — inbox heartbeats every 10 minutes, background monitoring, content generation — those costs compound. More importantly, you're burning through rate limits and subscription headroom you need for the tasks that actually warrant frontier quality.

The 4-Tier Model Routing Architecture

The fix is a tiered routing system. Every task gets assigned to the cheapest tier that can handle it correctly. Here's the full tier table:

Tier 0 | Local (Ollama)    | Classification, routing, summarization, extraction
Tier 1 | Claude Haiku      | Structured tasks needing API-quality output
Tier 2 | Claude Sonnet     | Primary reasoning, code, multi-step synthesis
Tier 3 | Claude Opus       | Highest-stakes decisions, irreversible actions

The goal: push as much work as possible to Tier 0 (zero cost), use Tier 1 for structured outputs that need reliability, reserve Tier 2 for actual reasoning, and treat Tier 3 as nearly never-used.

Tier 0: Local Inference With Ollama

Ollama runs language models locally. On a machine with a capable GPU or NPU, it's genuinely fast. On a CPU-only box, it's slower but still workable for async tasks.

Install:

curl -fsSL https://ollama.com/install.sh | sh

Pull a model:

ollama pull qwen2.5:7b    # 4.7GB — good generalist, fast
ollama pull qwen2.5:14b   # 9GB — better quality, slower

Verify:

ollama list
ollama run qwen2.5:7b "Classify this task: summarize a user inbox message. Return: classification/routing/generation/analysis"

Start server at boot (cron):

@reboot /home/aegis/.local/bin/ollama serve >> /home/aegis/intuitek/logs/ollama.log 2>&1

API usage (same interface as OpenAI):

curl http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5:7b","prompt":"Is this a question or a statement? Reply with one word.\n\nInput: What time is the meeting?","stream":false}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"

Tier 0 is your first call. If the task fits — make it here. Zero API cost.

Tier 1: Haiku For Structured Output

Some tasks need more reliability than a local 7B model provides, but don't need Sonnet's reasoning depth. Haiku hits that gap.

Good Haiku use cases:

JSON extraction from semi-structured text
Template filling with validation
Short-form content generation (subject lines, metadata, descriptions)
Tool call routing where format matters
Confirmation logic ("does this output match the schema? yes/no")

Haiku is cheap. For most background tasks, Haiku cost is negligible compared to Sonnet. The key is not using Haiku where you could use Tier 0, and not using Sonnet where Haiku would work.

Tier 2: Sonnet For Real Reasoning

Claude Sonnet (subscription or API) is your primary worker for tasks that require:

Multi-step reasoning with dependencies
Code generation and debugging
Long-form synthesis from multiple sources
Judgment calls with ambiguous inputs
Anything that touches production systems

The routing rule: if the task requires more than one reasoning step, or if a wrong answer has meaningful consequences, use Sonnet.

On a Claude Max subscription, Sonnet is your primary resource. Treat it as the default for everything non-trivial, but don't waste it on classification.

Tier 3: Opus For Irreversibility

Opus is expensive. It's also the best reasoning model available. Use it exactly when:

The decision is irreversible (dropping data, production deploys, removing credentials)
The stakes are asymmetric (getting this wrong costs significantly more than getting it right)
You need the system's best possible judgment, not good-enough judgment

In practice, a well-designed autonomous agent almost never needs Opus. If you find yourself reaching for it frequently, it's a signal that your task decomposition is wrong — you're loading too much into single calls that should be broken into cheaper steps.

The Routing Decision In Practice

Here's the decision tree I encode in every agent's CLAUDE.md:

Is this task running inside a headless cron? (background/async)
  └─ YES → Start at Tier 0. Escalate only if output is wrong format.
  └─ NO → Evaluate task type:
       └─ Classification / routing / summarization → Tier 0
       └─ Structured extraction needing reliability → Tier 1
       └─ Reasoning, code, synthesis, judgment → Tier 2
       └─ Irreversible action, highest-stakes → Tier 3

Code-level implementation:

def route_to_model(task_type: str, is_async: bool = False) -> str:
    """Return model string for the task."""

    tier0_tasks = {"classify", "route", "summarize", "extract", "label", "filter"}
    tier1_tasks = {"structured_output", "template_fill", "validate", "short_generate"}
    tier3_tasks = {"irreversible_action", "production_deploy", "credential_change"}

    if is_async and task_type in tier0_tasks:
        return "ollama/qwen2.5:7b"  # Zero cost

    if task_type in tier0_tasks:
        return "ollama/qwen2.5:7b"

    if task_type in tier1_tasks:
        return "claude-haiku-4-5-20251001"

    if task_type in tier3_tasks:
        return "claude-opus-4-6"

    return "claude-sonnet-4-6"  # Default: Tier 2

In an agentic Claude Code context, this routing is encoded as instructions in CLAUDE.md. The agent reads it at session start and routes accordingly:

## MODEL ROUTING — CHECK BEFORE EVERY API CALL

| Tier | Model | Use for |
|------|-------|---------|
| 0 | Ollama local (qwen2.5:7b) | Classification, routing, summarization, extraction |
| 1 | Haiku 4.5 | Structured tasks needing API quality |
| 2 | Claude Sonnet | Primary reasoning and code |
| 3 | Opus | Highest stakes only — irreversible actions |

Before any API call: assign a Tier. Start at 0. Escalate only with justification.
Check Ollama first: `ollama list`. If unavailable, escalate to Tier 1 minimum.
Do not silently fall through to Sonnet.

The last line matters: do not silently fall through. Without that instruction, agents will default to Sonnet when Ollama is unavailable — which defeats the routing system on any day the local server is down.

What This Looks Like In A Real Agent Loop

My autonomous agent runs an inbox heartbeat every 10 minutes. Here's the actual flow with routing applied:

10:00 — Inbox heartbeat fires (headless cron)
  → Check ~/intuitek/inbox/ for files [filesystem — no model call]
  → Empty. Check coordination/ [filesystem]
  → Empty. Log heartbeat [filesystem]
  → Evaluate: any idle-state work worth doing?
     → Task type: "assess current operational state and select next action"
     → This requires judgment → Tier 2 (Sonnet)
  → Decision: write new article
     → Task type: "long-form technical content generation"
     → Tier 2 confirmed
     → Draft article [Sonnet]
  → Task type: "extract article metadata (title, slug, tags)"
     → Tier 0 — local extraction
     → Run: ollama run qwen2.5:7b "Extract title and 3 tags from: ..."
  → Publish to dev.to [API call — no model needed]
  → Task type: "classify post flair: tutorial, resource, discussion, showcase"
     → Tier 0
     → Route: Tutorial/Guide
  → Post to Reddit [API call — no model needed]
  → Update SHARED_MIND.md [write — no model call]

10:47 — Cycle complete. Model usage: Sonnet for draft + judgment only.

The classification and metadata extraction steps that would otherwise call Sonnet are routed to Tier 0. In a full day of operation, this represents dozens of calls that never reach the API.

The Cost Breakdown

Here's the rough math on a day of autonomous operation, with and without routing:

Without routing (everything → Sonnet):

- 84 inbox heartbeats (10min × 7am–11pm): 84 × ~500 tokens = ~42k tokens input
- 10 classification/routing decisions: 10 × ~200 tokens = ~2k tokens
- 5 content extractions: 5 × ~300 tokens = ~1.5k tokens
- 2 article drafts: 2 × ~3k tokens = ~6k tokens
Total input: ~51.5k tokens → Sonnet price

With routing:

- 84 inbox heartbeats: filesystem checks, no model call = $0.00
- 10 classification decisions: Tier 0 (Ollama) = $0.00
- 5 extractions: Tier 0 = $0.00
- 2 article drafts: Tier 2 (Sonnet) = 6k tokens input
Total API input: ~6k tokens → ~95% reduction in API spend for background ops

The drift doesn't matter for the classification tasks. Local 7B models are accurate enough for "is this a classification task or a generation task?" They're not accurate enough for "synthesize this complex legal argument" — which is why that stays at Tier 2.

The One Thing That Breaks This

The failure mode: routing instructions in CLAUDE.md that the agent ignores.

This happens when:

The instruction is buried — agents skim long files in time-constrained sessions
The instruction is ambiguous — "use cheaper models when possible" is not actionable
No enforcement — the agent has no feedback when it violates the rule

Fixes:

Put routing at the top of CLAUDE.md, before any other configuration
Make it a table, not prose — agents parse tables reliably
Add a check: "Before any API call, state which tier you are using and why"
For critical cost control: instrument your agent to log every API call with the model used, then review the log

The instrumentation step sounds manual. It is, initially. But once you see which tasks are consistently hitting Tier 2 when they should be at Tier 0, you can add them to the routing table and catch it automatically on the next run.

Putting It Together

The architecture is simple:

Install Ollama locally. Pull qwen2.5:7b or your preferred model.
Add the tier routing table to your CLAUDE.md — at the top.
Add the "do not silently fall through" guard.
For any background/async task: default to Tier 0, escalate only on failure.
Instrument and review: check which calls are hitting Sonnet that shouldn't be.

Most agents running today are paying Sonnet rates for classification work. That's not a reasoning problem — it's a routing problem. Once you add the table and make it explicit, the agent follows it.

The complete routing configuration — including the CLAUDE.md template, the Ollama setup script, and the instrumentation pattern — is available as a packaged skill at ClawMart.

~K¹ (W. Kyle Million) / IntuiTek¹ — autonomous AI infrastructure

Top comments (1)

Harjot Singh • May 31

This is the exact architecture I've bet my whole project on, so I read it nodding the whole way. The core insight you're naming - classify the task, route mechanical work to a cheap model, escalate only the genuinely hard reasoning to a frontier model - is the single highest-leverage cost move in agentic coding, and most tooling completely ignores it because the lazy default (one premium model for everything) is easier to build.

The hard parts in practice, which I'd love your take on: (1) the router/classifier itself can't be expensive or it eats the savings, and (2) context handoff between models without losing the thread. I solved it in Moonshift (a multi-agent pipeline: prompt to a shipped SaaS on your own GitHub + Vercel) by scoping context per agent and keeping routing cheap/heuristic - that's what gets a full build to ~$3 flat. First run's free, no card. Genuinely great post - how are you doing the classification step: a cheap model call, heuristics on task type, or learned routing? That's the part I find makes or breaks the architecture.