DEV Community

Valentyn Solomko
Valentyn Solomko

Posted on

AI Agent Cost Optimization — Sonnet Hybrid (Haiku + OpenRouter)

Visualizes the transition from an expensive scheme to a new hybrid architecture and shows the result

Executive summary

Our subagents (.claude/agents/*.md) all run on Sonnet 4.6 by default at $3 / $15 per 1M (input/output). That's the single biggest line on our AI spend. This doc lays out a hybrid architecture where most subagents run on Haiku 4.5 ($1 / $5) and offload heavy generation/analysis to OpenRouter (paid key, already in .env.local) via the existing /fix-review REST plumbing. A handful of high-risk agents stay on Sonnet.

Expected savings at full rollout: ≈ 70–85 % of the current subagent spend, with eq­ui­valent quality for analytical tasks and gated quality for generational tasks.

Context

What exists today:

  • 11 subagents declared in .claude/agents/*.md, all on Sonnet by default.
  • /fix-review skill already delegates 3 model rounds to OpenRouter (DeepSeek V3.2, Qwen3 Coder Next, Grok 4.1 Fast) and only calls Sonnet for the Arbiter round. That's the template.
  • .claude/skills/lib/rest.sh exposes chat_payload + rest_post — provider-agnostic REST helpers, already wired for OpenRouter.
  • CLI agents installed: opencode, cursor-agent, kilo, codex. Each supports headless invocation and --model selection. We used codex exec for the 5 Tenancy v3 ship PRs while Claude's sub-agent quota was exhausted — that confirmed the hybrid pattern works.
  • OPENROUTER_API_KEY is paid — no free-tier rate-limit anxiety.

What's wasteful today:

  • Subagents like code-simplifier, docs-maintainer, pm-issue-writer do small, structured prose tasks on Sonnet. A Gemini Flash Lite at $0.037 / $0.15 would be ~100× cheaper and equally good for that workload.
  • code-generator spends Sonnet tokens on file IO loops that an OpenRouter-backed CLI (Codex / Opencode) can handle for ~10× less.

Provider/model inventory (verified 2026-04-22)

OpenRouter (paid, direct REST via lib/rest.sh)

Model $/1M in $/1M out Best for
deepseek/deepseek-v3.2 0.26 0.38 Code generation, refactor, security review
qwen/qwen3-coder-next 0.15 0.80 TS-heavy code, test generation
x-ai/grok-4.1-fast 0.20 0.50 Fast parallel review vote
google/gemini-2.5-flash 0.075 0.30 Docs, ADRs, prose
google/gemini-2.5-flash-lite 0.037 0.15 Lint classification, small structured calls

Baseline comparison:

  • Claude Sonnet 4.6: $3 / $15
  • Claude Haiku 4.5: $1 / $5

CLIs (useful when we need a built-in tool-use loop with file edits)

CLI Headless command Cheap / free options Notes
opencode opencode run --model X "prompt" opencode/gpt-5-nano, opencode/*-free, or any google/* Good for multi-file tasks
kilo kilo run --model X "message" kilo/kilo-auto/free, kilo/openrouter/free, kilo/x-ai/grok-code-fast-1:optimized:free Thin over OpenRouter
cursor-agent cursor-agent --model X -p "prompt" composer-2-fast (Cursor Pro), gpt-5.3-codex-* Good for large-scale rewrites
codex codex exec -C <dir> -s danger-full-access "prompt" OpenAI gpt-5.4 default Needs -s danger-full-access — bubblewrap sandbox unreliable on this host

Route CLIs through --model openrouter/<id> when we want them to hit the same OR quota.

Target architecture

Three subagent classes, three patterns:

Pattern A — Skill replaces subagent (direct OR REST)

For pure analytical subagents whose input is a small artifact (diff, lint output, hook source) and whose output is a structured text blob (report, suggestion list, simplified file).

parent Claude → Skill (thin bash) → rest_post to OpenRouter → deterministic output
Enter fullscreen mode Exit fullscreen mode

No agentic loop, no Sonnet at all. Example: code-simplifier/simplify skill.

Pattern B — Haiku orchestrator + direct OR REST

For subagents that need a validation gate: generate output with a cheap model, then have a small, cheap reasoner verify the shape, run tests / decide the next step.

Haiku orchestrator subagent
  → OR REST (e.g. qwen3-coder-next) to generate candidate
  → Haiku validates shape + runs verifier
  → success: commit; failure: retry with better prompt
Enter fullscreen mode Exit fullscreen mode

Applies to test-generator, security-reviewer (Haiku + chorus review), static-analysis.

Pattern C — Haiku orchestrator + CLI worker (tool-use loop)

When the task requires real file IO and iterative tool use (multi-file edits, builds, tests in a loop), delegate to a CLI that has a built-in tool-use loop.

Haiku orchestrator subagent
  → codex exec / opencode run (routed through openrouter/deepseek-v3.2)
  → Haiku reviews diff + lint + test output before commit
Enter fullscreen mode Exit fullscreen mode

Applies to code-generator.

Pattern D — Keep Sonnet

For subagents where project-specific conventions in CLAUDE.md matter more than cost, and where mistakes are high-blast-radius.

Applies to bug-fixer, migration-generator, project-devops.

Per-agent decision matrix

Agent Pattern Primary model Validation gate
code-generator C openrouter/deepseek/deepseek-v3.2 via codex exec Haiku 4.5 reviews diff; runs lint + tests
test-generator B openrouter/qwen/qwen3-coder-next Haiku 4.5 checks shape; runs npm run test
code-simplifier A (skill) openrouter/deepseek/deepseek-v3.2 None — output applied directly
security-reviewer B chorus review (3 OR models parallel) Haiku 4.5 synthesizes findings
static-analysis B openrouter/google/gemini-2.5-flash-lite Haiku 4.5 classifies safe vs unsafe fixes
docs-maintainer B openrouter/google/gemini-2.5-flash Haiku 4.5 lints prose against CLAUDE.md style
pm-issue-writer B openrouter/google/gemini-2.5-flash Haiku 4.5 checks OB Base compliance
ci-build-agent B (Haiku-only, no OR) Haiku 4.5 Small context; no delegation needed
bug-fixer D Sonnet 4.6 Iterative debugging needs full context
migration-generator D Sonnet 4.6 RLS / immutability conventions are subtle
project-devops D Sonnet 4.6 SSH / TLS / DB — prod-blast-radius

Migration order

  1. code-simplifier → Pattern A skill. Smallest scope. Baseline measurement.
  2. static-analysis → Pattern B. Similar surface.
  3. docs-maintainer → Pattern B. Prose on Gemini Flash.
  4. pm-issue-writer → Pattern B. Template work.
  5. security-reviewer → Pattern B wrapped around existing chorus review.
  6. test-generator → Pattern B with test execution loop.
  7. code-generator → Pattern C with Codex backend. Largest, highest-risk.
  8. ci-build-agent → Pattern B (Haiku-only).

At each step:

  • Ship one agent transition as its own PR.
  • Measure 3–5 real-task invocations before rolling to the next agent.
  • Metrics: token cost per invocation, success rate (did the output commit cleanly?), human-correction rate (did we have to fix the output manually?).
  • If the success rate drops > 10 % vs the Sonnet baseline, stop and reassess.

Skills with the same treatment

  • /fix-review — already done.
  • /task, /backlog, /pm-issue-writer — Haiku is enough; no OR delegation needed.
  • /ship — add --cheap flag that swaps its code-generator step to Pattern C.
  • /find-bugs — delegate to chorus debug.
  • /review — already uses parallel agents.

Risks

  1. Haiku context awareness is lower than Sonnet. Prompts must inline the most relevant CLAUDE.md / docs/* section rather than assume the agent "knows the codebase". Mitigation: for each agent that moves, expand the system prompt to include the exact conventions it must adhere to (e.g., stale-time rules, RLS patterns).
  2. OpenRouter model drift. A model string that works today may change default behavior next week (e.g., deepseek-v3.2 swaps back-end). Pin models explicitly and record the version in config.yaml / telemetry.
  3. CLI sandboxing quirks. codex exec needed -s danger-full-access this session (bubblewrap failure on this host). Tests in CI may hit similar issues. Each CLI migration needs a --dry-run (smoke) test.
  4. Free vs paid confusion. Free :free tiers have rate limits that silently 429. Migration MUST use paid openrouter/* model IDs and must NOT mix with CLIs in their free default mode (kilo/kilo-auto/free).
  5. Rollback coupling. If a migrated agent regresses and we revert it to Sonnet, we need to keep the Pattern-A skill / Pattern-B wrapper in place for downstream callers. So: don't delete Sonnet-agent definitions until the hybrid has run cleanly for at least a week.
  6. Log attribution. When OR models do the work, our existing .claude/skills/fix-review/telemetry.jsonl records cost per-PR. We need an equivalent file per migrated subagent, or at least a shared one.
  7. Output parsing tolerance. Cheap models return messy output (no JSON mode guarantees across providers). Orchestrators must tolerate trailing prose, markdown fences around JSON, etc.

Non-goals / out of scope

  • Rewriting /fix-review — it already does this pattern.
  • Moving bug-fixer, migration-generator, project-devops to any cheaper model. Their blast radius dominates their spend.
  • Replacing the orchestrator Claude (top-level) with Haiku. The top-level agent needs to read this doc, understand the whole session, and make decisions — Sonnet stays there.
  • Switching CLI worker choice inside a single PR / session. Pick one backend per migration and stick with it.

Verification/acceptance

The overhaul is successful when:

  1. 8 of 11 subagents run on Haiku + OR delegate by default.
  2. Telemetry shows ≥ 50 % cost reduction on subagent spend over a rolling 7-day window.
  3. Human-correction rate (PRs where a sub-agent-produced change had to be manually fixed before merge) stays within 15 % of the Sonnet baseline.
  4. No regressions in security reviews (security-reviewer catches the same class of findings it did on Sonnet — validated by a manual replay of 3 recent PRs).
  5. docs/ai-agent-cost-optimization.md and each migrated agent's frontmatter reflect the new model + pattern.

References

  • .claude/skills/fix-review/config.yaml — provider/model routing template.
  • .claude/skills/lib/rest.sh — REST helpers.
  • .claude/agents/*.md — current subagent definitions.
  • ~/wrk/projects/chorus/chorus/plugins/chorus/scripts/companion.mjs — multi-CLI review pipeline.
  • OpenRouter pricing: https://openrouter.ai/models (snapshot 2026-04-22).

Top comments (0)