Your Agent Is Burning Money (Here's the Math, and the Fix)
You built an agent. It works. Then you got the API bill.
Let's do the math on a typical production agent before optimization:
Single task call:
- 2,000 token system prompt
- 1,500 token conversation history
- 3,000 token tool schemas (10 tools)
- 500 token user query
- 200 token retrieved context
───────────────────────────────
Total input: 7,200 tokens
Output: 400 tokens
At claude-sonnet-4:
Input: 7,200 × $3/M = $0.0216
Output: 400 × $15/M = $0.0060
Per call: ~$0.028
At 1,000 calls/day: $28/day = $840/month
Now here's the same system after optimization:
With caching + compression + routing:
- Cached system + schemas (90% discount): $0.00165
- Compressed history: 800 tokens × $3/M = $0.0024
- User query + context: 700 tokens × $3/M = $0.0021
- 60% of tasks routed to Haiku
Blended per call: ~$0.004
At 1,000 calls/day: $4/day = $120/month
Savings: 86% reduction ($720/month saved)
That's not a marginal improvement. That's the difference between a viable product and one that burns cash faster than it earns it. Let's go through how to get there.
Move 1: Enable Prompt Caching (40–70% reduction, 2 lines of code)
The single highest-ROI optimization. Anthropic charges 10% of normal input price for cache reads. The breakeven is literally 1.1 requests.
Most developers never turn it on because "it looks complicated." It's not.
# Before (no caching)
response = client.messages.create(
model="claude-sonnet-4-5",
system=my_system_prompt, # Paid full price every call
messages=conversation,
...
)
# After (with caching)
response = client.messages.create(
model="claude-sonnet-4-5",
system=[{
"type": "text",
"text": my_system_prompt,
"cache_control": {"type": "ephemeral"} # ← this
}],
messages=conversation,
...
)
That's the whole change for basic caching. Do it today.
For conversation history, put the cache breakpoint at the end of the stable history:
# Cache the conversation history up to the last turn
for i, msg in enumerate(history[:-1]):
messages.append(msg) # No cache marker on old messages
# Last historical message gets the cache marker
last_hist = history[-1].copy()
last_hist["content"] = [{
"type": "text",
"text": last_hist["content"],
"cache_control": {"type": "ephemeral"}
}]
messages.append(last_hist)
# Current message — never cache (it's always unique)
messages.append({"role": "user", "content": current_message})
Track your cache performance via response.usage:
usage = response.usage
hit_rate = usage.cache_read_input_tokens / (
usage.input_tokens + usage.cache_read_input_tokens
)
print(f"Cache hit rate: {hit_rate:.0%}") # Target: > 40%
Move 2: Compress Your System Prompt (Cut it by 70%)
A 2,000-token system prompt can almost always be cut to 400 tokens without quality loss. You pay for that gap on every single call, forever.
Rule 1: Kill the preamble
Before (68 tokens):
You are an expert AI assistant specialized in helping users with
complex software engineering tasks. You have deep knowledge of
Python, JavaScript, cloud infrastructure, and modern software
development practices...
After (15 tokens):
Expert software engineering assistant. Python, JS, cloud infra focus.
Rule 2: Compress tool descriptions
Tools are the hidden cost. 10 tools × 75 tokens each = 750 tokens per call.
Before:
{
"name": "search_knowledge_base",
"description": "This tool allows you to search through our knowledge
base which is a collection of documents containing information about
our products and services. You can use this tool to find relevant
information to answer user questions..."
}
After:
{
"name": "search_knowledge_base",
"description": "Semantic search over product/service docs. Use when user asks about features, pricing, or policies."
}
Same performance. 73% fewer tokens.
Rule 3: Move edge case content to retrieval
If you have 1,000 tokens of edge case handling in your system prompt that's only relevant 5% of the time, move it to a knowledge base:
CORE_PROMPT = "Software engineering assistant. Direct, accurate responses." # 10 tokens
EDGE_CASE_DOCS = {
"billing": "... detailed billing edge cases ...",
"security": "... security escalation protocols ..."
}
def get_system_prompt(query: str) -> str:
relevant = [v for k, v in EDGE_CASE_DOCS.items() if k in query.lower()]
if relevant:
return CORE_PROMPT + "\n\n## Relevant Context\n" + "\n".join(relevant)
return CORE_PROMPT # 90% of calls pay for 10 tokens, not 1,000
Move 3: Implement Model Routing
Not all tasks need your best model. A classification call that routes to Haiku is 75% cheaper than the same call on Sonnet. For an agent that makes 10 calls per task, this compounds fast.
Task classification routing tiers:
TIER 1 — Fast (claude-haiku-3-5): $0.0004/call
Use for: classification, extraction, simple Q&A, routing decisions
Rule: confidence > 0.9, single-hop reasoning
TIER 2 — Standard (claude-sonnet-4-5): $0.004/call (10x)
Use for: most production agent tasks, tool use, multi-step reasoning
TIER 3 — Power (claude-opus-4): $0.04/call (100x)
Use for: explicit escalation path only
Simple rule-based routing (no LLM needed):
FAST_PATTERNS = ["classify", "extract", "is this", "format this", "yes or no"]
POWER_TRIGGERS = ["legal", "medical", "financial advice", "security vulnerability"]
def route_to_tier(task: str) -> str:
task_lower = task.lower()
if any(t in task_lower for t in POWER_TRIGGERS):
return "claude-opus-4-5"
if any(p in task_lower for p in FAST_PATTERNS) and len(task.split()) < 20:
return "claude-haiku-3-5"
return "claude-sonnet-4-5" # Standard default
Or use a confidence cascade — try Haiku first, escalate only if confidence is low:
async def cascade_completion(task: str, system: str) -> tuple[str, str]:
# Try fast model with confidence self-report
fast = client.messages.create(
model="claude-haiku-3-5",
system=system + "\n\nEnd responses with [CONFIDENCE: X.XX]",
messages=[{"role": "user", "content": task}]
)
import re
m = re.search(r'\[CONFIDENCE: ([0-9.]+)\]', fast.content[0].text)
confidence = float(m.group(1)) if m else 0.5
if confidence >= 0.85:
return re.sub(r'\[CONFIDENCE.*?\]', '', fast.content[0].text), "haiku"
# Escalate
full = client.messages.create(
model="claude-sonnet-4-5",
system=system,
messages=[{"role": "user", "content": task}]
)
return full.content[0].text, "sonnet"
Move 4: Bound Your Conversation History
History grows without limit. This is the most common cause of "my agent costs 10x more than expected."
Turn 1: 1,500 tokens
Turn 10: 9,100 tokens (6x initial)
Turn 20: 18,000 tokens (12x initial)
Turn 50: 45,000+ tokens (30x initial)
Progressive summarization: keep the last 6 turns verbatim, summarize everything older using Haiku:
def maybe_summarize_history(turns: list[dict], summary: str | None) -> tuple[list[dict], str | None]:
MAX_VERBATIM = 6
THRESHOLD_TOKENS = 3000
total_chars = sum(len(t["content"]) for t in turns)
if total_chars / 3.5 < THRESHOLD_TOKENS:
return turns, summary # No action needed
to_summarize = turns[:-MAX_VERBATIM]
recent = turns[-MAX_VERBATIM:]
conv_text = "\n".join(f"{t['role'].upper()}: {t['content']}" for t in to_summarize)
existing = f"Previous: {summary}\n\n" if summary else ""
# Use cheap model for summarization
summary_resp = client.messages.create(
model="claude-haiku-3-5",
max_tokens=300,
system="Summarize conversation: decisions made, facts established, key context. Max 250 words.",
messages=[{"role": "user", "content": f"{existing}{conv_text}"}]
)
return recent, summary_resp.content[0].text
Move 5: Cache Your Tool Results
If your agent calls the same external API twice with the same inputs, you paid twice for nothing. Most tool results are stable for at least 5 minutes.
TOOL_TTL = {
"search_docs": 3600, # 1 hour: docs don't change
"get_weather": 900, # 15 min: volatile
"search_web": 300, # 5 min: somewhat volatile
"send_email": -1, # Never cache: side effect
"write_file": -1, # Never cache: side effect
}
class ToolCache:
def __init__(self):
self._cache = {}
def key(self, name: str, inputs: dict) -> str:
return f"{name}:{hashlib.md5(json.dumps(inputs, sort_keys=True).encode()).hexdigest()}"
def get(self, name: str, inputs: dict):
ttl = TOOL_TTL.get(name, 600)
if ttl <= 0:
return None
k = self.key(name, inputs)
if k in self._cache:
result, expires = self._cache[k]
if time.time() < expires:
return result
return None
def set(self, name: str, inputs: dict, result):
ttl = TOOL_TTL.get(name, 600)
if ttl > 0:
self._cache[self.key(name, inputs)] = (result, time.time() + ttl)
Move 6: Use the Batch API for Non-Real-Time Work
For anything that doesn't need an immediate response, the batch API gives you a 50% discount:
# Submit batch job
batch = client.messages.batches.create(requests=[
{
"custom_id": f"task_{i}",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 512,
"system": system,
"messages": [{"role": "user", "content": task["prompt"]}]
}
}
for i, task in enumerate(tasks)
])
# Poll until done (typically 30-60 min for small batches)
while True:
status = client.messages.batches.retrieve(batch.id)
if status.processing_status == "ended":
break
time.sleep(30)
# Collect results
results = {
r.custom_id: r.result.message.content[0].text
for r in client.messages.batches.results(batch.id)
if r.result.type == "succeeded"
}
Use batch for: nightly report generation, content moderation queues, bulk document analysis, training data generation. Keep real-time for: interactive chatbots, time-sensitive decisions, anything with a human waiting.
Move 7: Instrument Everything
You cannot optimize what you cannot see. The first time you see that one poorly-written prompt is responsible for 40% of your monthly bill, this pays for itself immediately.
from dataclasses import dataclass, field
from datetime import date
import threading
@dataclass
class CostObserver:
daily_budget_usd: float = 10.0
_daily: dict = field(default_factory=dict)
_lock: threading.Lock = field(default_factory=threading.Lock)
def record(self, model: str, in_tokens: int, out_tokens: int, task: str = "unknown") -> float:
PRICING = {
"claude-haiku-3-5": (0.80, 4.0),
"claude-sonnet-4-5": (3.0, 15.0),
"claude-opus-4-5": (15.0, 75.0),
}
ip, op = PRICING.get(model, (3.0, 15.0))
cost = in_tokens * ip / 1e6 + out_tokens * op / 1e6
today = str(date.today())
with self._lock:
self._daily[today] = self._daily.get(today, 0) + cost
if self._daily[today] / self.daily_budget_usd >= 0.8:
print(f"⚠️ COST ALERT: {self._daily[today]:.2f} spent today")
return cost
@property
def today_total(self) -> float:
return self._daily.get(str(date.today()), 0)
# Plug into your agent loop after every call
observer = CostObserver(daily_budget_usd=5.0)
observer.record("claude-sonnet-4-5", response.usage.input_tokens, response.usage.output_tokens, task="customer_support")
The Priority Order (TL;DR)
If you only do six things:
- Enable prompt caching — 40–70% reduction, two lines of code
- Compress system prompt + tool descriptions — 70% token reduction per call, one-time effort
- Bound conversation history with progressive summarization — prevents 30× cost blowup on long sessions
- Add model routing — route classifications and simple tasks to Haiku (75% cheaper)
- Cache tool results — stop re-fetching stable data
- Instrument with daily budget alerts — you can't optimize what you can't see
If you implement all six, the 86% cost reduction from the opening example is realistic — often conservative.
The implementations in this post are excerpted from MAC-014 — Agent Cost Optimization & Token Budget Management Pack. Full pack includes: complete Python implementations, the semantic response cache, async batch processing patterns, per-feature cost attribution decorator, production hardening checklist (35 items), and runbook for cost emergencies. Available at machinamarket.surge.sh for 0.016 ETH (~$33).
Tags: ai, python, machinelearning, productivity, tutorial
Top comments (0)