LLM per-token prices fell between 9x and 900x over the past year. Yet most teams running agentic AI in production are seeing their API bills go up, not down. Here is exactly why, and the three code-level interventions that cut spend 60-80% without touching quality.
Why Agentic Workloads Break Your Token Budget
A chatbot interaction: 1 LLM call, ~3,000-10,000 tokens. Done.
An agentic task: plan the approach, call a tool, process results, decide next step, call another tool, validate output, loop if needed. That is 10-20 LLM calls, each carrying the growing context window from all previous steps. By step 8, you may be passing 60,000 tokens into every call -- most of it noise.
The math: agentic workflows burn 5-30x more tokens per completed task than a standard chatbot exchange. A 10x price drop combined with a 20x token increase means your bill doubled.
There are three places the money leaks.
Leak 1: Context Bloat -- Fix with Compression
Most agentic pipelines append every step's output to a running context that gets passed to every subsequent LLM call. By step 6, you are paying full price to send the model information from step 1 that is no longer relevant.
Before passing context to any LLM call, compress it:
from anthropic import Anthropic
client = Anthropic()
def compress_context(conversation_history: list[dict], current_task: str,
token_budget: int = 20000) -> list[dict]:
"""
Compress older turns if context exceeds budget.
Keeps recent turns intact, summarizes older ones.
"""
raw_tokens = sum(len(str(m)) // 4 for m in conversation_history)
if raw_tokens <= token_budget:
return conversation_history
recent = conversation_history[-3:]
older = conversation_history[:-3]
if not older:
return recent
summary_prompt = f"""Summarize the following conversation history into 2-3 sentences,
keeping only information relevant to: {current_task}
History: {older}"""
summary = client.messages.create(
model="claude-haiku-4-5-20251001", # cheap model for summarization
max_tokens=300,
messages=[{"role": "user", "content": summary_prompt}]
).content[0].text
compressed = [{"role": "system", "content": f"[Earlier context summary]: {summary}"}]
compressed.extend(recent)
return compressed
# Before any LLM call in your agent loop:
context = compress_context(conversation_history, current_task="validate invoice fields")
response = client.messages.create(model="claude-sonnet-4-6", messages=context, max_tokens=1000)
This alone typically reduces context size by 50-70% in long-running agentic workflows.
Leak 2: Frontier Model Overuse -- Fix with Model Routing
Using a frontier model for every step in your pipeline is like hiring a principal engineer to sort your email. Most agent steps -- classification, format conversion, simple lookups, routing decisions -- work fine with a small, fast, cheap model.
from enum import Enum
from dataclasses import dataclass
class TaskComplexity(Enum):
SIMPLE = "simple"
MEDIUM = "medium"
COMPLEX = "complex"
@dataclass
class ModelConfig:
model: str
cost_per_1k_input: float
cost_per_1k_output: float
MODEL_TIERS = {
TaskComplexity.SIMPLE: ModelConfig("claude-haiku-4-5-20251001", 0.00025, 0.00125),
TaskComplexity.MEDIUM: ModelConfig("claude-sonnet-4-6", 0.003, 0.015),
TaskComplexity.COMPLEX: ModelConfig("claude-opus-4-6", 0.015, 0.075),
}
def classify_task(task_description: str) -> TaskComplexity:
simple_keywords = ["classify", "categorize", "is this", "format", "convert", "route", "label"]
complex_keywords = ["analyze", "reason", "debug", "design", "plan", "evaluate", "compare"]
task_lower = task_description.lower()
if any(kw in task_lower for kw in simple_keywords):
return TaskComplexity.SIMPLE
elif any(kw in task_lower for kw in complex_keywords):
return TaskComplexity.COMPLEX
return TaskComplexity.MEDIUM
def routed_llm_call(task: str, messages: list[dict]) -> tuple[str, float]:
complexity = classify_task(task)
config = MODEL_TIERS[complexity]
response = client.messages.create(model=config.model, max_tokens=1000, messages=messages)
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
cost = (input_tokens / 1000 * config.cost_per_1k_input +
output_tokens / 1000 * config.cost_per_1k_output)
return response.content[0].text, cost
In most production pipelines, 70-80% of steps classify as SIMPLE or MEDIUM. Routing those to cheaper models cuts your average cost per task by 60-70%.
Leak 3: Redundant Calls -- Fix with Semantic Caching
Your agentic system is probably making the same LLM calls repeatedly. Different phrasing, same semantic content. Standard caching misses these. Semantic caching embeds the query and retrieves cached results for near-matches.
import numpy as np
from datetime import datetime, timedelta
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.92, ttl_hours: int = 24):
self.cache: list[dict] = []
self.threshold = similarity_threshold
self.ttl = timedelta(hours=ttl_hours)
def _embed(self, text: str) -> list[float]:
# Replace with real embedding model in production
import hashlib
seed = int(hashlib.md5(text.encode()).hexdigest(), 16) % (2**32)
return np.random.RandomState(seed).randn(1536).tolist()
def _cosine_similarity(self, a, b) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def get(self, query: str) -> str | None:
query_embedding = self._embed(query)
now = datetime.utcnow()
for entry in self.cache:
if now - entry["timestamp"] > self.ttl:
continue
if self._cosine_similarity(query_embedding, entry["embedding"]) >= self.threshold:
return entry["response"]
return None
def set(self, query: str, response: str):
self.cache.append({
"query": query,
"embedding": self._embed(query),
"response": response,
"timestamp": datetime.utcnow()
})
Production deployments with repetitive enterprise workloads typically see 30-50% cache hit rates -- eliminating a third to half your API calls entirely.
Putting It Together: Cost Tracking Per Step
None of this works without measurement. Add per-step cost tracking to your agent loop:
from dataclasses import dataclass, field
import time
@dataclass
class AgentStep:
name: str
model: str
cache_hit: bool
cost_usd: float
duration_ms: float
class CostAwareAgentRunner:
def __init__(self):
self.steps: list[AgentStep] = []
self.cache = SemanticCache()
def run_step(self, name: str, task: str, messages: list[dict]) -> str:
start = time.time()
cached = self.cache.get(task)
if cached:
self.steps.append(AgentStep(name, "cache", True, 0.0, (time.time()-start)*1000))
return cached
response_text, cost = routed_llm_call(task, messages)
self.cache.set(task, response_text)
self.steps.append(AgentStep(
name, classify_task(task).value, False, cost, (time.time()-start)*1000
))
return response_text
def cost_report(self) -> dict:
total = sum(s.cost_usd for s in self.steps)
hits = sum(1 for s in self.steps if s.cache_hit)
return {
"total_cost_usd": round(total, 6),
"steps": len(self.steps),
"cache_hit_rate": hits / len(self.steps) if self.steps else 0,
"by_step": [{"name": s.name, "cost": s.cost_usd, "model": s.model} for s in self.steps]
}
Once you have this instrumentation, the top three steps by token consumption almost always account for 60-70% of total spend. That tells you exactly where to focus.
A logistics client: $40K/month in LLM API costs, down to under $12K after model routing + semantic caching + context compression. Same volume, same quality. Frontier model performed better on complex steps because it was receiving cleaner, more focused context.
If you are hitting this in production and want a second set of eyes, feel free to DM me -- happy to dig in.
Top comments (0)