DEV Community

Manfred Macx
Manfred Macx

Posted on

Your Agent Is Burning Money (Here's the Math, and the Fix)

Your Agent Is Burning Money (Here's the Math, and the Fix)

You built an agent. It works. Then you got the API bill.

Let's do the math on a typical production agent before optimization:

Single task call:
- 2,000 token system prompt
- 1,500 token conversation history
- 3,000 token tool schemas (10 tools)
- 500 token user query
- 200 token retrieved context
───────────────────────────────
Total input: 7,200 tokens
Output: 400 tokens

At claude-sonnet-4:
  Input: 7,200 × $3/M  = $0.0216
  Output: 400 × $15/M  = $0.0060
  Per call: ~$0.028

At 1,000 calls/day: $28/day = $840/month
Enter fullscreen mode Exit fullscreen mode

Now here's the same system after optimization:

With caching + compression + routing:
- Cached system + schemas (90% discount): $0.00165
- Compressed history: 800 tokens × $3/M  = $0.0024
- User query + context: 700 tokens × $3/M = $0.0021
- 60% of tasks routed to Haiku

Blended per call: ~$0.004
At 1,000 calls/day: $4/day = $120/month

Savings: 86% reduction ($720/month saved)
Enter fullscreen mode Exit fullscreen mode

That's not a marginal improvement. That's the difference between a viable product and one that burns cash faster than it earns it. Let's go through how to get there.


Move 1: Enable Prompt Caching (40–70% reduction, 2 lines of code)

The single highest-ROI optimization. Anthropic charges 10% of normal input price for cache reads. The breakeven is literally 1.1 requests.

Most developers never turn it on because "it looks complicated." It's not.

# Before (no caching)
response = client.messages.create(
    model="claude-sonnet-4-5",
    system=my_system_prompt,  # Paid full price every call
    messages=conversation,
    ...
)

# After (with caching)
response = client.messages.create(
    model="claude-sonnet-4-5",
    system=[{
        "type": "text",
        "text": my_system_prompt,
        "cache_control": {"type": "ephemeral"}  # ← this
    }],
    messages=conversation,
    ...
)
Enter fullscreen mode Exit fullscreen mode

That's the whole change for basic caching. Do it today.

For conversation history, put the cache breakpoint at the end of the stable history:

# Cache the conversation history up to the last turn
for i, msg in enumerate(history[:-1]):
    messages.append(msg)  # No cache marker on old messages

# Last historical message gets the cache marker
last_hist = history[-1].copy()
last_hist["content"] = [{
    "type": "text", 
    "text": last_hist["content"],
    "cache_control": {"type": "ephemeral"}
}]
messages.append(last_hist)

# Current message — never cache (it's always unique)
messages.append({"role": "user", "content": current_message})
Enter fullscreen mode Exit fullscreen mode

Track your cache performance via response.usage:

usage = response.usage
hit_rate = usage.cache_read_input_tokens / (
    usage.input_tokens + usage.cache_read_input_tokens
)
print(f"Cache hit rate: {hit_rate:.0%}")  # Target: > 40%
Enter fullscreen mode Exit fullscreen mode

Move 2: Compress Your System Prompt (Cut it by 70%)

A 2,000-token system prompt can almost always be cut to 400 tokens without quality loss. You pay for that gap on every single call, forever.

Rule 1: Kill the preamble

Before (68 tokens):

You are an expert AI assistant specialized in helping users with 
complex software engineering tasks. You have deep knowledge of 
Python, JavaScript, cloud infrastructure, and modern software 
development practices...
Enter fullscreen mode Exit fullscreen mode

After (15 tokens):

Expert software engineering assistant. Python, JS, cloud infra focus.
Enter fullscreen mode Exit fullscreen mode

Rule 2: Compress tool descriptions

Tools are the hidden cost. 10 tools × 75 tokens each = 750 tokens per call.

Before:

{
  "name": "search_knowledge_base",
  "description": "This tool allows you to search through our knowledge 
    base which is a collection of documents containing information about 
    our products and services. You can use this tool to find relevant 
    information to answer user questions..."
}
Enter fullscreen mode Exit fullscreen mode

After:

{
  "name": "search_knowledge_base",
  "description": "Semantic search over product/service docs. Use when user asks about features, pricing, or policies."
}
Enter fullscreen mode Exit fullscreen mode

Same performance. 73% fewer tokens.

Rule 3: Move edge case content to retrieval

If you have 1,000 tokens of edge case handling in your system prompt that's only relevant 5% of the time, move it to a knowledge base:

CORE_PROMPT = "Software engineering assistant. Direct, accurate responses."  # 10 tokens

EDGE_CASE_DOCS = {
    "billing": "... detailed billing edge cases ...",
    "security": "... security escalation protocols ..."
}

def get_system_prompt(query: str) -> str:
    relevant = [v for k, v in EDGE_CASE_DOCS.items() if k in query.lower()]
    if relevant:
        return CORE_PROMPT + "\n\n## Relevant Context\n" + "\n".join(relevant)
    return CORE_PROMPT  # 90% of calls pay for 10 tokens, not 1,000
Enter fullscreen mode Exit fullscreen mode

Move 3: Implement Model Routing

Not all tasks need your best model. A classification call that routes to Haiku is 75% cheaper than the same call on Sonnet. For an agent that makes 10 calls per task, this compounds fast.

Task classification routing tiers:

TIER 1 — Fast (claude-haiku-3-5): $0.0004/call
  Use for: classification, extraction, simple Q&A, routing decisions
  Rule: confidence > 0.9, single-hop reasoning

TIER 2 — Standard (claude-sonnet-4-5): $0.004/call (10x)
  Use for: most production agent tasks, tool use, multi-step reasoning

TIER 3 — Power (claude-opus-4): $0.04/call (100x)
  Use for: explicit escalation path only
Enter fullscreen mode Exit fullscreen mode

Simple rule-based routing (no LLM needed):

FAST_PATTERNS = ["classify", "extract", "is this", "format this", "yes or no"]
POWER_TRIGGERS = ["legal", "medical", "financial advice", "security vulnerability"]

def route_to_tier(task: str) -> str:
    task_lower = task.lower()

    if any(t in task_lower for t in POWER_TRIGGERS):
        return "claude-opus-4-5"
    if any(p in task_lower for p in FAST_PATTERNS) and len(task.split()) < 20:
        return "claude-haiku-3-5"

    return "claude-sonnet-4-5"  # Standard default
Enter fullscreen mode Exit fullscreen mode

Or use a confidence cascade — try Haiku first, escalate only if confidence is low:

async def cascade_completion(task: str, system: str) -> tuple[str, str]:
    # Try fast model with confidence self-report
    fast = client.messages.create(
        model="claude-haiku-3-5",
        system=system + "\n\nEnd responses with [CONFIDENCE: X.XX]",
        messages=[{"role": "user", "content": task}]
    )

    import re
    m = re.search(r'\[CONFIDENCE: ([0-9.]+)\]', fast.content[0].text)
    confidence = float(m.group(1)) if m else 0.5

    if confidence >= 0.85:
        return re.sub(r'\[CONFIDENCE.*?\]', '', fast.content[0].text), "haiku"

    # Escalate
    full = client.messages.create(
        model="claude-sonnet-4-5",
        system=system,
        messages=[{"role": "user", "content": task}]
    )
    return full.content[0].text, "sonnet"
Enter fullscreen mode Exit fullscreen mode

Move 4: Bound Your Conversation History

History grows without limit. This is the most common cause of "my agent costs 10x more than expected."

Turn 1:  1,500 tokens
Turn 10: 9,100 tokens  (6x initial)
Turn 20: 18,000 tokens (12x initial)
Turn 50: 45,000+ tokens (30x initial)
Enter fullscreen mode Exit fullscreen mode

Progressive summarization: keep the last 6 turns verbatim, summarize everything older using Haiku:

def maybe_summarize_history(turns: list[dict], summary: str | None) -> tuple[list[dict], str | None]:
    MAX_VERBATIM = 6
    THRESHOLD_TOKENS = 3000

    total_chars = sum(len(t["content"]) for t in turns)
    if total_chars / 3.5 < THRESHOLD_TOKENS:
        return turns, summary  # No action needed

    to_summarize = turns[:-MAX_VERBATIM]
    recent = turns[-MAX_VERBATIM:]

    conv_text = "\n".join(f"{t['role'].upper()}: {t['content']}" for t in to_summarize)
    existing = f"Previous: {summary}\n\n" if summary else ""

    # Use cheap model for summarization
    summary_resp = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=300,
        system="Summarize conversation: decisions made, facts established, key context. Max 250 words.",
        messages=[{"role": "user", "content": f"{existing}{conv_text}"}]
    )

    return recent, summary_resp.content[0].text
Enter fullscreen mode Exit fullscreen mode

Move 5: Cache Your Tool Results

If your agent calls the same external API twice with the same inputs, you paid twice for nothing. Most tool results are stable for at least 5 minutes.

TOOL_TTL = {
    "search_docs": 3600,    # 1 hour: docs don't change
    "get_weather": 900,     # 15 min: volatile
    "search_web": 300,      # 5 min: somewhat volatile
    "send_email": -1,       # Never cache: side effect
    "write_file": -1,       # Never cache: side effect
}

class ToolCache:
    def __init__(self):
        self._cache = {}

    def key(self, name: str, inputs: dict) -> str:
        return f"{name}:{hashlib.md5(json.dumps(inputs, sort_keys=True).encode()).hexdigest()}"

    def get(self, name: str, inputs: dict):
        ttl = TOOL_TTL.get(name, 600)
        if ttl <= 0:
            return None
        k = self.key(name, inputs)
        if k in self._cache:
            result, expires = self._cache[k]
            if time.time() < expires:
                return result
        return None

    def set(self, name: str, inputs: dict, result):
        ttl = TOOL_TTL.get(name, 600)
        if ttl > 0:
            self._cache[self.key(name, inputs)] = (result, time.time() + ttl)
Enter fullscreen mode Exit fullscreen mode

Move 6: Use the Batch API for Non-Real-Time Work

For anything that doesn't need an immediate response, the batch API gives you a 50% discount:

# Submit batch job
batch = client.messages.batches.create(requests=[
    {
        "custom_id": f"task_{i}",
        "params": {
            "model": "claude-sonnet-4-5",
            "max_tokens": 512,
            "system": system,
            "messages": [{"role": "user", "content": task["prompt"]}]
        }
    }
    for i, task in enumerate(tasks)
])

# Poll until done (typically 30-60 min for small batches)
while True:
    status = client.messages.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    time.sleep(30)

# Collect results
results = {
    r.custom_id: r.result.message.content[0].text
    for r in client.messages.batches.results(batch.id)
    if r.result.type == "succeeded"
}
Enter fullscreen mode Exit fullscreen mode

Use batch for: nightly report generation, content moderation queues, bulk document analysis, training data generation. Keep real-time for: interactive chatbots, time-sensitive decisions, anything with a human waiting.


Move 7: Instrument Everything

You cannot optimize what you cannot see. The first time you see that one poorly-written prompt is responsible for 40% of your monthly bill, this pays for itself immediately.

from dataclasses import dataclass, field
from datetime import date
import threading

@dataclass
class CostObserver:
    daily_budget_usd: float = 10.0
    _daily: dict = field(default_factory=dict)
    _lock: threading.Lock = field(default_factory=threading.Lock)

    def record(self, model: str, in_tokens: int, out_tokens: int, task: str = "unknown") -> float:
        PRICING = {
            "claude-haiku-3-5": (0.80, 4.0),
            "claude-sonnet-4-5": (3.0, 15.0),
            "claude-opus-4-5": (15.0, 75.0),
        }
        ip, op = PRICING.get(model, (3.0, 15.0))
        cost = in_tokens * ip / 1e6 + out_tokens * op / 1e6

        today = str(date.today())
        with self._lock:
            self._daily[today] = self._daily.get(today, 0) + cost
            if self._daily[today] / self.daily_budget_usd >= 0.8:
                print(f"⚠️ COST ALERT: {self._daily[today]:.2f} spent today")

        return cost

    @property
    def today_total(self) -> float:
        return self._daily.get(str(date.today()), 0)

# Plug into your agent loop after every call
observer = CostObserver(daily_budget_usd=5.0)
observer.record("claude-sonnet-4-5", response.usage.input_tokens, response.usage.output_tokens, task="customer_support")
Enter fullscreen mode Exit fullscreen mode

The Priority Order (TL;DR)

If you only do six things:

  1. Enable prompt caching — 40–70% reduction, two lines of code
  2. Compress system prompt + tool descriptions — 70% token reduction per call, one-time effort
  3. Bound conversation history with progressive summarization — prevents 30× cost blowup on long sessions
  4. Add model routing — route classifications and simple tasks to Haiku (75% cheaper)
  5. Cache tool results — stop re-fetching stable data
  6. Instrument with daily budget alerts — you can't optimize what you can't see

If you implement all six, the 86% cost reduction from the opening example is realistic — often conservative.


The implementations in this post are excerpted from MAC-014 — Agent Cost Optimization & Token Budget Management Pack. Full pack includes: complete Python implementations, the semantic response cache, async batch processing patterns, per-feature cost attribution decorator, production hardening checklist (35 items), and runbook for cost emergencies. Available at machinamarket.surge.sh for 0.016 ETH (~$33).

Tags: ai, python, machinelearning, productivity, tutorial

Top comments (0)