Edward Kubiak

Posted on Apr 10

Most of your Claude Code agents don't need Sonnet

#ai #claude #llm #agents

I run about 50 Claude Code agent calls a day. Only 8 of them need the expensive model.

The rest? They're writing commit messages, reviewing diffs, running tests, generating docs. Tasks that don't require deep reasoning — just reliable pattern matching. And yet, by default, every single one of those calls hits the same model at the same price.

Here's how I fixed that with a 3-tier routing strategy that sends each task to the cheapest model that can handle it.

The problem: one model fits none

Claude Code's agent system is powerful. You can spin up subagents for code review, testing, commits, debugging — the works. But out of the box, they all use the same model. That's like paying a senior architect to format your README.

The fix isn't complicated. You just need to match the model to the task.

The 3-tier model strategy

I run 17 agents across my development workflow. Here's how they break down:

Tier 3: Sonnet (full reasoning)     →  8 agents  (32%)
Tier 2: Haiku (fast + cheap)        → 17 agents  (68%)  
Tier 1: Ollama (free, local)        →  2 models   (0% API cost)

Tier 3 — Sonnet: only when you need reasoning

These are the tasks where cutting corners burns you:

Planning — decomposing a feature into ordered tasks with dependencies
Debugging — multi-file root cause analysis from a stack trace
Security review — catching injection vectors, CORS misconfig, auth gaps
Complex implementation — writing actual business logic across files
Research — investigating approaches, comparing tradeoffs

Sonnet stays on these because the cost of a wrong answer exceeds the cost of the API call. A bad security review doesn't save you money — it costs you an incident.

Tier 2 — Haiku: the workhorse

This is where the savings live. These tasks need an LLM, but they don't need deep reasoning:

Code review — pattern-matching against a checklist (missing error handling, unused imports, style violations)
Test runner — executing tests, parsing output, reporting pass/fail
Commit messages — reading a diff, writing an imperative summary
Docs — updating a README section, writing a changelog entry
DevOps — generating a Dockerfile, writing CI config from a template
Git operations — merge conflict resolution, branch management

Haiku runs at $0.25/1M input tokens vs Sonnet's $3/1M. That's a 12x difference. For tasks that are essentially "read this structured input, produce this structured output," Haiku is more than capable.

Here's what the model assignment looks like — one field per agent definition:

# code-reviewer agent
model: haiku    # doesn't need Sonnet for checklist-style review

# debugger agent  
model: sonnet   # root cause analysis needs real reasoning

# commit agent
model: haiku    # diff in, message out — bounded task

Tier 1 — Ollama: zero cost, zero latency

Some tasks are so mechanical that even Haiku is overkill. For these, I route to local Ollama models running on my Mac:

# LiteLLM routing config
model_list:
  - model_name: local-commit
    litellm_params:
      model: ollama/tavernari/git-commit-message
      api_base: http://localhost:11434
  - model_name: local-fast
    litellm_params:
      model: ollama/qwen2.5-coder:7b
      api_base: http://localhost:11434

router_settings:
  fallback_models:
    - claude-haiku-4-5    # escalation safety net

tavernari/git-commit-message is a purpose-built 8B model that reads diffs and outputs conventional commit messages. It runs at 40+ tokens/sec on Apple Silicon with zero API cost. For a task I trigger dozens of times a day, that adds up.

The key detail: fallback_models. If the local model fails validation, the request escalates to Haiku automatically. You get the cost savings without the risk.

The quality gate: don't trust, verify

Routing to cheaper models only works if you catch bad output before it hits your codebase. I use a validation script that sits between the local model and the next stage:

# Pipe contractor output through validation
echo "$DIFF" | ollama run tavernari/git-commit-message \
  | cast-validate-contractor.sh --type commit --model local-commit

The validator checks for:

Empty output — model didn't generate anything useful
Hallucination markers — "As an AI", "I cannot", "I'm not sure"
Length bounds — too short (lazy) or too long (rambling)
Format compliance — commit messages must start with a capital letter in imperative mood

If validation fails, the task escalates to Haiku. If Haiku's output also fails review, it escalates to Sonnet. Every escalation gets logged, so over time you can see which tasks actually need the more expensive model and which ones you're safely routing locally.

Local model output
  → Validation (format, length, hallucination check)
      ✓ pass → next stage
      ✗ fail → escalate to Haiku
           ✗ fail → escalate to Sonnet
           → log escalation reason

What this looks like in practice

Here's a realistic daily breakdown at ~50 agent calls:

Tier	Calls/day	Avg tokens	Cost/1K tokens	Daily cost
Sonnet	20	6,000	$0.003	$0.36
Haiku	18	1,500	$0.00025	~$0.01
Ollama	12	1,500	$0.00	$0.00
Total	50			~$0.37/day

Without tiering, if everything ran on Sonnet: ~$0.90/day. If you're running everything on Sonnet today, that's up to a 60% reduction. Even with a mixed baseline, the Ollama tier alone eliminates your most frequent API calls entirely — and the gap widens with volume.

But honestly? The bigger win isn't cost. It's latency. Local Ollama inference on Apple Silicon has no network round-trip. For commit messages and log summaries that fire multiple times per session, the response feels instant. That's a workflow improvement you notice every single session.

What NOT to route locally

This is just as important as what you do route. Keep these on Sonnet:

Security analysis — small models miss subtle vulnerabilities. A false negative here has real consequences.
Root cause debugging — multi-step causal reasoning across files and stack traces. 7B models generate plausible-sounding but wrong hypotheses.
Planning and task decomposition — requires understanding the full codebase context and dependency ordering.
Complex code generation — anything beyond boilerplate. The risk is subtle bugs that pass review but fail at runtime.
Anything requiring >8K context — local models degrade quickly past their context window.

The rule of thumb: if the cost of a wrong answer is "I regenerate it," route it cheap. If the cost is "I debug it for an hour," keep it on Sonnet.

Try it yourself

The tiered model strategy isn't tied to any specific framework — you can apply it to any Claude Code setup with subagents. The key ideas:

Audit your agent calls. Which ones are just "structured input → structured output"?
Drop those to Haiku. One config change per agent.
For the most mechanical tasks, try Ollama locally. Commit messages are the easiest starting point.
Add a validation gate. Never let cheap model output flow unchecked into your codebase.

If you want to see the full implementation — agent definitions, LiteLLM configs, validation scripts, and the escalation logging — the framework I built this on is open source:

castframework.dev — docs and architecture overview
GitHub: claude-agent-team — the core framework with all 17 agents
GitHub: cast-hooks — hook scripts including the contractor validator

What's your agent-to-model ratio? Are you running everything on the same tier, or have you started routing? Drop a comment — I'm curious how others are handling this.

Top comments (1)

Felix Helleckes • Apr 10

Great breakdown. It’s so easy to default to the 'smartest' model, but the 12x price difference between Sonnet and Haiku is no joke when you're running dozens of agent calls a day. The latency win from running local Ollama models on Apple Silicon is honestly just as big of a selling point as the cost savings.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.