I run about 50 Claude Code agent calls a day. Only 8 of them need the expensive model.
The rest? They're writing commit messages, reviewing diffs, running tests, generating docs. Tasks that don't require deep reasoning — just reliable pattern matching. And yet, by default, every single one of those calls hits the same model at the same price.
Here's how I fixed that with a 3-tier routing strategy that sends each task to the cheapest model that can handle it.
The problem: one model fits none
Claude Code's agent system is powerful. You can spin up subagents for code review, testing, commits, debugging — the works. But out of the box, they all use the same model. That's like paying a senior architect to format your README.
The fix isn't complicated. You just need to match the model to the task.
The 3-tier model strategy
I run 17 agents across my development workflow. Here's how they break down:
Tier 3: Sonnet (full reasoning) → 8 agents (32%)
Tier 2: Haiku (fast + cheap) → 17 agents (68%)
Tier 1: Ollama (free, local) → 2 models (0% API cost)
Tier 3 — Sonnet: only when you need reasoning
These are the tasks where cutting corners burns you:
- Planning — decomposing a feature into ordered tasks with dependencies
- Debugging — multi-file root cause analysis from a stack trace
- Security review — catching injection vectors, CORS misconfig, auth gaps
- Complex implementation — writing actual business logic across files
- Research — investigating approaches, comparing tradeoffs
Sonnet stays on these because the cost of a wrong answer exceeds the cost of the API call. A bad security review doesn't save you money — it costs you an incident.
Tier 2 — Haiku: the workhorse
This is where the savings live. These tasks need an LLM, but they don't need deep reasoning:
- Code review — pattern-matching against a checklist (missing error handling, unused imports, style violations)
- Test runner — executing tests, parsing output, reporting pass/fail
- Commit messages — reading a diff, writing an imperative summary
- Docs — updating a README section, writing a changelog entry
- DevOps — generating a Dockerfile, writing CI config from a template
- Git operations — merge conflict resolution, branch management
Haiku runs at $0.25/1M input tokens vs Sonnet's $3/1M. That's a 12x difference. For tasks that are essentially "read this structured input, produce this structured output," Haiku is more than capable.
Here's what the model assignment looks like — one field per agent definition:
# code-reviewer agent
model: haiku # doesn't need Sonnet for checklist-style review
# debugger agent
model: sonnet # root cause analysis needs real reasoning
# commit agent
model: haiku # diff in, message out — bounded task
Tier 1 — Ollama: zero cost, zero latency
Some tasks are so mechanical that even Haiku is overkill. For these, I route to local Ollama models running on my Mac:
# LiteLLM routing config
model_list:
- model_name: local-commit
litellm_params:
model: ollama/tavernari/git-commit-message
api_base: http://localhost:11434
- model_name: local-fast
litellm_params:
model: ollama/qwen2.5-coder:7b
api_base: http://localhost:11434
router_settings:
fallback_models:
- claude-haiku-4-5 # escalation safety net
tavernari/git-commit-message is a purpose-built 8B model that reads diffs and outputs conventional commit messages. It runs at 40+ tokens/sec on Apple Silicon with zero API cost. For a task I trigger dozens of times a day, that adds up.
The key detail: fallback_models. If the local model fails validation, the request escalates to Haiku automatically. You get the cost savings without the risk.
The quality gate: don't trust, verify
Routing to cheaper models only works if you catch bad output before it hits your codebase. I use a validation script that sits between the local model and the next stage:
# Pipe contractor output through validation
echo "$DIFF" | ollama run tavernari/git-commit-message \
| cast-validate-contractor.sh --type commit --model local-commit
The validator checks for:
- Empty output — model didn't generate anything useful
- Hallucination markers — "As an AI", "I cannot", "I'm not sure"
- Length bounds — too short (lazy) or too long (rambling)
- Format compliance — commit messages must start with a capital letter in imperative mood
If validation fails, the task escalates to Haiku. If Haiku's output also fails review, it escalates to Sonnet. Every escalation gets logged, so over time you can see which tasks actually need the more expensive model and which ones you're safely routing locally.
Local model output
→ Validation (format, length, hallucination check)
✓ pass → next stage
✗ fail → escalate to Haiku
✗ fail → escalate to Sonnet
→ log escalation reason
What this looks like in practice
Here's a realistic daily breakdown at ~50 agent calls:
| Tier | Calls/day | Avg tokens | Cost/1K tokens | Daily cost |
|---|---|---|---|---|
| Sonnet | 20 | 6,000 | $0.003 | $0.36 |
| Haiku | 18 | 1,500 | $0.00025 | ~$0.01 |
| Ollama | 12 | 1,500 | $0.00 | $0.00 |
| Total | 50 | ~$0.37/day |
Without tiering, if everything ran on Sonnet: ~$0.90/day. If you're running everything on Sonnet today, that's up to a 60% reduction. Even with a mixed baseline, the Ollama tier alone eliminates your most frequent API calls entirely — and the gap widens with volume.
But honestly? The bigger win isn't cost. It's latency. Local Ollama inference on Apple Silicon has no network round-trip. For commit messages and log summaries that fire multiple times per session, the response feels instant. That's a workflow improvement you notice every single session.
What NOT to route locally
This is just as important as what you do route. Keep these on Sonnet:
- Security analysis — small models miss subtle vulnerabilities. A false negative here has real consequences.
- Root cause debugging — multi-step causal reasoning across files and stack traces. 7B models generate plausible-sounding but wrong hypotheses.
- Planning and task decomposition — requires understanding the full codebase context and dependency ordering.
- Complex code generation — anything beyond boilerplate. The risk is subtle bugs that pass review but fail at runtime.
- Anything requiring >8K context — local models degrade quickly past their context window.
The rule of thumb: if the cost of a wrong answer is "I regenerate it," route it cheap. If the cost is "I debug it for an hour," keep it on Sonnet.
Try it yourself
The tiered model strategy isn't tied to any specific framework — you can apply it to any Claude Code setup with subagents. The key ideas:
- Audit your agent calls. Which ones are just "structured input → structured output"?
- Drop those to Haiku. One config change per agent.
- For the most mechanical tasks, try Ollama locally. Commit messages are the easiest starting point.
- Add a validation gate. Never let cheap model output flow unchecked into your codebase.
If you want to see the full implementation — agent definitions, LiteLLM configs, validation scripts, and the escalation logging — the framework I built this on is open source:
- castframework.dev — docs and architecture overview
- GitHub: claude-agent-team — the core framework with all 17 agents
- GitHub: cast-hooks — hook scripts including the contractor validator
What's your agent-to-model ratio? Are you running everything on the same tier, or have you started routing? Drop a comment — I'm curious how others are handling this.
Top comments (1)
Great breakdown. It’s so easy to default to the 'smartest' model, but the 12x price difference between Sonnet and Haiku is no joke when you're running dozens of agent calls a day. The latency win from running local Ollama models on Apple Silicon is honestly just as big of a selling point as the cost savings.