Valentyn Solomko

Posted on Apr 27

AI Agent Cost Optimization — Sonnet Hybrid (Haiku + OpenRouter)

#ai #architecture #agents #claude

Executive summary

Our subagents (.claude/agents/*.md) all run on Sonnet 4.6 by default at $3 / $15 per 1M (input/output). That's the single biggest line on our AI spend. This doc lays out a hybrid architecture where most subagents run on Haiku 4.5 ($1 / $5) and offload heavy generation/analysis to OpenRouter (paid key, already in .env.local) via the existing /fix-review REST plumbing. A handful of high-risk agents stay on Sonnet.

Expected savings at full rollout: ≈ 70–85 % of the current subagent spend, with equivalent quality for analytical tasks and gated quality for generational tasks.

Context

What exists today:

11 subagents declared in .claude/agents/*.md, all on Sonnet by default.
/fix-review skill already delegates 3 model rounds to OpenRouter (DeepSeek V3.2, Qwen3 Coder Next, Grok 4.1 Fast) and only calls Sonnet for the Arbiter round. That's the template.
.claude/skills/lib/rest.sh exposes chat_payload + rest_post — provider-agnostic REST helpers, already wired for OpenRouter.
CLI agents installed: opencode, cursor-agent, kilo, codex. Each supports headless invocation and --model selection. We used codex exec for the 5 Tenancy v3 ship PRs while Claude's sub-agent quota was exhausted — that confirmed the hybrid pattern works.
OPENROUTER_API_KEY is paid — no free-tier rate-limit anxiety.

What's wasteful today:

Subagents like code-simplifier, docs-maintainer, pm-issue-writer do small, structured prose tasks on Sonnet. A Gemini Flash Lite at $0.037 / $0.15 would be ~100× cheaper and equally good for that workload.
code-generator spends Sonnet tokens on file IO loops that an OpenRouter-backed CLI (Codex / Opencode) can handle for ~10× less.

Provider/model inventory (verified 2026-04-22)

OpenRouter (paid, direct REST via `lib/rest.sh`)

Model	$/1M in	$/1M out	Best for
`deepseek/deepseek-v3.2`	0.26	0.38	Code generation, refactor, security review
`qwen/qwen3-coder-next`	0.15	0.80	TS-heavy code, test generation
`x-ai/grok-4.1-fast`	0.20	0.50	Fast parallel review vote
`google/gemini-2.5-flash`	0.075	0.30	Docs, ADRs, prose
`google/gemini-2.5-flash-lite`	0.037	0.15	Lint classification, small structured calls

Baseline comparison:

Claude Sonnet 4.6: $3 / $15
Claude Haiku 4.5: $1 / $5

CLIs (useful when we need a built-in tool-use loop with file edits)

CLI	Headless command	Cheap / free options	Notes
`opencode`	`opencode run --model X "prompt"`	`opencode/gpt-5-nano`, `opencode/-free`, or any `google/`	Good for multi-file tasks
`kilo`	`kilo run --model X "message"`	`kilo/kilo-auto/free`, `kilo/openrouter/free`, `kilo/x-ai/grok-code-fast-1:optimized:free`	Thin over OpenRouter
`cursor-agent`	`cursor-agent --model X -p "prompt"`	`composer-2-fast` (Cursor Pro), `gpt-5.3-codex-*`	Good for large-scale rewrites
`codex`	`codex exec -C <dir> -s danger-full-access "prompt"`	OpenAI `gpt-5.4` default	Needs `-s danger-full-access` — bubblewrap sandbox unreliable on this host

Route CLIs through --model openrouter/<id> when we want them to hit the same OR quota.

Target architecture

Three subagent classes, three patterns:

Pattern A — Skill replaces subagent (direct OR REST)

For pure analytical subagents whose input is a small artifact (diff, lint output, hook source) and whose output is a structured text blob (report, suggestion list, simplified file).

parent Claude → Skill (thin bash) → rest_post to OpenRouter → deterministic output

No agentic loop, no Sonnet at all. Example: code-simplifier → /simplify skill.

Pattern B — Haiku orchestrator + direct OR REST

For subagents that need a validation gate: generate output with a cheap model, then have a small, cheap reasoner verify the shape, run tests / decide the next step.

Haiku orchestrator subagent
  → OR REST (e.g. qwen3-coder-next) to generate candidate
  → Haiku validates shape + runs verifier
  → success: commit; failure: retry with better prompt

Applies to test-generator, security-reviewer (Haiku + chorus review), static-analysis.

Pattern C — Haiku orchestrator + CLI worker (tool-use loop)

When the task requires real file IO and iterative tool use (multi-file edits, builds, tests in a loop), delegate to a CLI that has a built-in tool-use loop.

Haiku orchestrator subagent
  → codex exec / opencode run (routed through openrouter/deepseek-v3.2)
  → Haiku reviews diff + lint + test output before commit

Applies to code-generator.

Pattern D — Keep Sonnet

For subagents where project-specific conventions in CLAUDE.md matter more than cost, and where mistakes are high-blast-radius.

Applies to bug-fixer, migration-generator, project-devops.

Per-agent decision matrix

Agent	Pattern	Primary model	Validation gate
`code-generator`	C	`openrouter/deepseek/deepseek-v3.2` via `codex exec`	Haiku 4.5 reviews diff; runs lint + tests
`test-generator`	B	`openrouter/qwen/qwen3-coder-next`	Haiku 4.5 checks shape; runs `npm run test`
`code-simplifier`	A (skill)	`openrouter/deepseek/deepseek-v3.2`	None — output applied directly
`security-reviewer`	B	`chorus review` (3 OR models parallel)	Haiku 4.5 synthesizes findings
`static-analysis`	B	`openrouter/google/gemini-2.5-flash-lite`	Haiku 4.5 classifies safe vs unsafe fixes
`docs-maintainer`	B	`openrouter/google/gemini-2.5-flash`	Haiku 4.5 lints prose against CLAUDE.md style
`pm-issue-writer`	B	`openrouter/google/gemini-2.5-flash`	Haiku 4.5 checks OB Base compliance
`ci-build-agent`	B (Haiku-only, no OR)	Haiku 4.5	Small context; no delegation needed
`bug-fixer`	D	Sonnet 4.6	Iterative debugging needs full context
`migration-generator`	D	Sonnet 4.6	RLS / immutability conventions are subtle
`project-devops`	D	Sonnet 4.6	SSH / TLS / DB — prod-blast-radius

Migration order

code-simplifier → Pattern A skill. Smallest scope. Baseline measurement.
static-analysis → Pattern B. Similar surface.
docs-maintainer → Pattern B. Prose on Gemini Flash.
pm-issue-writer → Pattern B. Template work.
security-reviewer → Pattern B wrapped around existing chorus review.
test-generator → Pattern B with test execution loop.
code-generator → Pattern C with Codex backend. Largest, highest-risk.
ci-build-agent → Pattern B (Haiku-only).

At each step:

Ship one agent transition as its own PR.
Measure 3–5 real-task invocations before rolling to the next agent.
Metrics: token cost per invocation, success rate (did the output commit cleanly?), human-correction rate (did we have to fix the output manually?).
If the success rate drops > 10 % vs the Sonnet baseline, stop and reassess.

Skills with the same treatment

/fix-review — already done.
/task, /backlog, /pm-issue-writer — Haiku is enough; no OR delegation needed.
/ship — add --cheap flag that swaps its code-generator step to Pattern C.
/find-bugs — delegate to chorus debug.
/review — already uses parallel agents.

Risks

Haiku context awareness is lower than Sonnet. Prompts must inline the most relevant CLAUDE.md / docs/* section rather than assume the agent "knows the codebase". Mitigation: for each agent that moves, expand the system prompt to include the exact conventions it must adhere to (e.g., stale-time rules, RLS patterns).
OpenRouter model drift. A model string that works today may change default behavior next week (e.g., deepseek-v3.2 swaps back-end). Pin models explicitly and record the version in config.yaml / telemetry.
CLI sandboxing quirks. codex exec needed -s danger-full-access this session (bubblewrap failure on this host). Tests in CI may hit similar issues. Each CLI migration needs a --dry-run (smoke) test.
Free vs paid confusion. Free :free tiers have rate limits that silently 429. Migration MUST use paid openrouter/* model IDs and must NOT mix with CLIs in their free default mode (kilo/kilo-auto/free).
Rollback coupling. If a migrated agent regresses and we revert it to Sonnet, we need to keep the Pattern-A skill / Pattern-B wrapper in place for downstream callers. So: don't delete Sonnet-agent definitions until the hybrid has run cleanly for at least a week.
Log attribution. When OR models do the work, our existing .claude/skills/fix-review/telemetry.jsonl records cost per-PR. We need an equivalent file per migrated subagent, or at least a shared one.
Output parsing tolerance. Cheap models return messy output (no JSON mode guarantees across providers). Orchestrators must tolerate trailing prose, markdown fences around JSON, etc.

Non-goals / out of scope

Rewriting /fix-review — it already does this pattern.
Moving bug-fixer, migration-generator, project-devops to any cheaper model. Their blast radius dominates their spend.
Replacing the orchestrator Claude (top-level) with Haiku. The top-level agent needs to read this doc, understand the whole session, and make decisions — Sonnet stays there.
Switching CLI worker choice inside a single PR / session. Pick one backend per migration and stick with it.

Verification/acceptance

The overhaul is successful when:

8 of 11 subagents run on Haiku + OR delegate by default.
Telemetry shows ≥ 50 % cost reduction on subagent spend over a rolling 7-day window.
Human-correction rate (PRs where a sub-agent-produced change had to be manually fixed before merge) stays within 15 % of the Sonnet baseline.
No regressions in security reviews (security-reviewer catches the same class of findings it did on Sonnet — validated by a manual replay of 3 recent PRs).
docs/ai-agent-cost-optimization.md and each migrated agent's frontmatter reflect the new model + pattern.

References

.claude/skills/fix-review/config.yaml — provider/model routing template.
.claude/skills/lib/rest.sh — REST helpers.
.claude/agents/*.md — current subagent definitions.
~/wrk/projects/chorus/chorus/plugins/chorus/scripts/companion.mjs — multi-CLI review pipeline.
OpenRouter pricing: https://openrouter.ai/models (snapshot 2026-04-22).

DEV Community

AI Agent Cost Optimization — Sonnet Hybrid (Haiku + OpenRouter)

Executive summary

Context

Provider/model inventory (verified 2026-04-22)

OpenRouter (paid, direct REST via `lib/rest.sh`)

CLIs (useful when we need a built-in tool-use loop with file edits)

Target architecture

Pattern A — Skill replaces subagent (direct OR REST)

Pattern B — Haiku orchestrator + direct OR REST

Pattern C — Haiku orchestrator + CLI worker (tool-use loop)

Pattern D — Keep Sonnet

Per-agent decision matrix

Migration order

Skills with the same treatment

Risks

Non-goals / out of scope

Verification/acceptance

References

Top comments (0)

Executive summary

Context

Provider/model inventory (verified 2026-04-22)

OpenRouter (paid, direct REST via lib/rest.sh)

CLIs (useful when we need a built-in tool-use loop with file edits)

Target architecture

Pattern A — Skill replaces subagent (direct OR REST)

Pattern B — Haiku orchestrator + direct OR REST

Pattern C — Haiku orchestrator + CLI worker (tool-use loop)

Pattern D — Keep Sonnet

Per-agent decision matrix

Migration order

Skills with the same treatment

Risks

Non-goals / out of scope

Verification/acceptance

References

OpenRouter (paid, direct REST via `lib/rest.sh`)