DEV Community

Patrick Hughes
Patrick Hughes

Posted on • Originally published at bmdpat.com

Anthropic's Advisor Tool Is the Cost-Split Pattern You Should Already Be Running

Anthropic just shipped the Advisor tool on the Claude Platform API. The idea: a cheap executor model (Haiku or Sonnet) runs the agent loop. It escalates to Opus only when it hits something hard.

Their claim: Haiku with an Opus advisor more than doubled its standalone benchmark score while costing significantly less than running Sonnet continuously.

This is not new. This is a cost-split pattern that production teams have been running manually for a year. Anthropic just productized it as a single API flag.

The pattern is the durable part. The API is not.

The Pattern

Most agent workloads follow a power law. 80-90% of calls are routine: parsing, formatting, simple classification, tool dispatch. 10-20% are genuinely hard: ambiguous edge cases, multi-step reasoning, novel situations.

Running Opus (or GPT-4o, or any frontier model) on the routine 80% is like hiring a senior engineer to rename variables. Expensive. Wasteful. And the output is identical to what a smaller model produces.

The cost-split pattern:

  1. Primary model: cheap, fast, handles the routine work. Haiku, GPT-4o-mini, Llama 3.1 8B on local hardware.
  2. Advisor model: expensive, powerful, called only when needed. Opus, GPT-4o, Claude Sonnet 4.
  3. Escalation signal: the trigger that routes a call from primary to advisor.

The signal can be anything: confidence score below a threshold, token count above a limit, tool call depth exceeding a boundary, or a user-defined rule.

The Math

Let us run real numbers.

Scenario: 1,000 agent calls per day.

All-Opus pricing (input $15/M tokens, output $75/M tokens, ~1K tokens per call average):

  • Daily cost: ~$90
  • Monthly: ~$2,700

Cost-split with 85% Haiku ($0.25/$1.25 per M tokens) and 15% Opus:

  • Haiku: 850 calls x ~$0.0015 = $1.28/day
  • Opus: 150 calls x ~$0.09 = $13.50/day
  • Daily total: ~$14.78
  • Monthly: ~$443

Savings: 84%. And that is before you factor in the option of running the primary model locally.

With Llama 3.1 8B on an RTX 5070 Ti (our actual setup), the primary model cost is zero. The only cost is the 15% that escalates to the API. Monthly cost drops to ~$405.

The crossover point: cost-split wins as long as your escalation rate stays below ~60%. If more than 60% of calls need the frontier model, you are better off running Opus on everything. But 60% is a high bar. Most production workloads escalate 10-20%.

Why This Matters for AgentGuard

AgentGuard's budget guards currently fire after spend. BudgetGuard tracks accumulated cost and raises BudgetExceeded when the limit hits. That is a reactive safety net.

The advisor pattern is proactive. It prevents the spend from happening in the first place by routing cheap work to cheap models.

We are shipping a BudgetAwareEscalation guard (PR #331 on the AgentGuard repo) that combines both:

from agentguard import Tracer, BudgetGuard

tracer = Tracer(guards=[
    BudgetGuard(
        max_cost_usd=50.00,
        warn_at_pct=0.8,
    ),
])
Enter fullscreen mode Exit fullscreen mode

The reactive guard (BudgetGuard) is already shipping. The proactive pattern (escalation routing) is next.

The key difference from Anthropic's implementation: AgentGuard works on any provider. Anthropic's advisor tool only works with Claude models. Our version works with OpenAI, local Llama, Mistral, or any two-model combination.

The Playbook

If you are running AI agents today and paying for frontier model calls, here is the playbook:

Step 1: Measure. What percentage of your calls actually need frontier-model quality? Log a week of calls with their complexity. Most teams are surprised at how routine the majority of work is.

Step 2: Set a budget ceiling. Use AgentGuard to add a hard dollar limit. This is your safety net while you experiment.

pip install agentguard47
Enter fullscreen mode Exit fullscreen mode
from agentguard import Tracer, BudgetGuard

tracer = Tracer(guards=[
    BudgetGuard(max_cost_usd=10.00, warn_at_pct=0.8),
])
Enter fullscreen mode Exit fullscreen mode

Step 3: Route cheap work to a cheap model. Start with the obvious cases: formatting, parsing, simple classification. Use Haiku, GPT-4o-mini, or a local model.

Step 4: Define your escalation signal. Start simple: escalate when the primary model returns low confidence or when the task involves multi-step reasoning. Refine over time.

Step 5: Measure again. Compare cost, quality, and latency. The pattern should reduce cost 50-85% with minimal quality degradation on routine work.

The Honest Take

Anthopic shipping this as a first-class API feature validates the thesis. Runtime cost control for AI agents is not a nice-to-have. It is table stakes.

But vendor-specific implementations lock you in. If you use Anthropic's advisor tool, you can only use Claude models. If Anthropic changes pricing or deprecates the feature, your cost-split strategy breaks.

The portable version of this pattern is a guard that runs in your code, works with any provider, and gives you the escalation routing without the lock-in.

That is what we are building.

Try AgentGuard. Zero dependencies. MIT license. Budget guards, loop detection, and retry protection for any AI agent.

Top comments (0)