Every month I'd open our cloud billing dashboard and wince. Running AI agents in production at RapidClaw meant our token costs were climbing faster than our revenue. Sound familiar?
After three months of aggressive optimization, we cut our monthly token spend by 73% while actually improving agent response quality. Here's exactly how we did it — no vague advice, just the specific techniques that moved the needle.
The Problem: Death by a Thousand Tokens
When you're running AI agents that handle real workloads — deployment automation, infrastructure monitoring, code review — every unnecessary token adds up. Our agents were processing ~2M tokens per day across various tasks. At GPT-4-class pricing, that's not pocket change.
The root causes were predictable once we actually measured:
- Bloated system prompts copied-and-pasted across agents (avg 2,400 tokens each)
- No caching layer — identical queries hitting the LLM every time
- Redundant context stuffed into every request "just in case"
- Wrong model for the job — using frontier models for classification tasks
Strategy 1: Prompt Compression (Saved ~30%)
The biggest win was the simplest. We audited every system prompt and applied aggressive compression.
# BEFORE: 847 tokens
SYSTEM_PROMPT_BEFORE = """
You are a helpful deployment assistant for our cloud infrastructure.
You should help users deploy their applications to our Kubernetes cluster.
You have access to kubectl commands and can help troubleshoot issues.
When a user asks you to deploy something, you should first check if
the namespace exists, then validate the manifest, then apply it.
You should always be polite and professional in your responses.
You should explain what you're doing at each step.
If something goes wrong, provide clear error messages and suggestions.
Always confirm before making destructive changes.
Remember to check resource limits and quotas before deploying.
"""
# AFTER: 196 tokens
SYSTEM_PROMPT_AFTER = """
Role: K8s deployment agent.
Tools: kubectl
Flow: check namespace → validate manifest → apply
Rules: confirm destructive ops, check resource quotas, explain steps
"""
Same behavior, 77% fewer tokens. The key insight: LLMs don't need the verbose instructions we think they do. They need structured, precise constraints.
We built a simple compression pipeline:
import tiktoken
def audit_prompt(prompt: str, model: str = "gpt-4") -> dict:
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(prompt)
# Flag prompts over 500 tokens for review
return {
"token_count": len(tokens),
"needs_review": len(tokens) > 500,
"estimated_daily_cost": len(tokens) * CALLS_PER_DAY * COST_PER_TOKEN
}
# Run this on every agent prompt quarterly
for agent in get_all_agents():
report = audit_prompt(agent.system_prompt)
if report["needs_review"]:
print(f"⚠️ {agent.name}: {report['token_count']} tokens "
f"(${report['estimated_daily_cost']:.2f}/day)")
Strategy 2: Semantic Caching (Saved ~25%)
This was the highest-ROI engineering investment. We added a semantic similarity cache in front of our LLM calls.
import hashlib
import numpy as np
from redis import Redis
class SemanticCache:
def __init__(self, redis_url: str, similarity_threshold: float = 0.95):
self.redis = Redis.from_url(redis_url)
self.threshold = similarity_threshold
def get_embedding(self, text: str) -> np.ndarray:
"""Use a cheap embedding model — not the expensive LLM."""
# text-embedding-3-small costs ~$0.02/1M tokens
return embed_model.encode(text)
def lookup(self, query: str) -> str | None:
query_emb = self.get_embedding(query)
# Check against recent cached queries
for key in self.redis.scan_iter("cache:emb:*"):
cached_emb = np.frombuffer(self.redis.get(key))
similarity = np.dot(query_emb, cached_emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)
)
if similarity >= self.threshold:
response_key = key.decode().replace("emb:", "resp:")
return self.redis.get(response_key).decode()
return None
def store(self, query: str, response: str, ttl: int = 3600):
key_hash = hashlib.sha256(query.encode()).hexdigest()[:16]
emb = self.get_embedding(query)
self.redis.setex(f"cache:emb:{key_hash}", ttl, emb.tobytes())
self.redis.setex(f"cache:resp:{key_hash}", ttl, response)
The 0.95 similarity threshold was critical. Too low and you get stale/wrong cached responses. Too high and your cache hit rate tanks. We tuned this per agent type — deployment agents got 0.97 (precision matters), monitoring summarizers got 0.92 (more tolerance for variation).
Cache hit rates after one week:
- Infrastructure status queries: 67% hit rate
- Deployment validation: 41% hit rate
- Code review suggestions: 12% hit rate (too unique, as expected)
Strategy 3: Model Routing (Saved ~18%)
Not every task needs a frontier model. We built a lightweight router that directs requests to the cheapest capable model:
MODEL_TIERS = {
"classification": "gpt-4o-mini", # $0.15/1M input
"extraction": "gpt-4o-mini", # Simple structured output
"summarization": "gpt-4o", # Needs nuance
"reasoning": "gpt-4o", # Complex decisions
"code_generation": "claude-sonnet-4-6", # Best for code
}
def route_request(task_type: str, complexity_score: float) -> str:
"""Route to cheapest capable model based on task type and complexity."""
base_model = MODEL_TIERS.get(task_type, "gpt-4o")
# Override: bump up if complexity is high
if complexity_score > 0.8 and base_model.endswith("mini"):
return base_model.replace("-mini", "")
return base_model
We score complexity using a fast heuristic — input length, number of distinct entities, presence of code blocks, and whether the request involves multi-step reasoning. The heuristic itself runs on the cheapest model as a pre-filter.
Strategy 4: Context Window Management
This one's underrated. Instead of dumping the entire conversation history into every request, we implemented a sliding window with smart summarization:
def prepare_context(messages: list, max_tokens: int = 2000) -> list:
"""Keep recent messages verbatim, summarize older ones."""
recent = messages[-4:] # Last 2 exchanges verbatim
older = messages[:-4]
if not older:
return recent
# Summarize older context with a cheap model
summary = summarize(older, model="gpt-4o-mini")
return [{"role": "system", "content": f"Prior context: {summary}"}] + recent
This alone saved 15-20% on our longer agent conversations without any measurable quality drop.
Measuring What Matters
None of this works without observability. We track three metrics for every agent:
- Cost per successful task — not just cost per request
- Quality score — automated eval comparing optimized vs. unoptimized outputs
- Latency — cache hits are 50-100x faster than LLM calls
We built a simple dashboard that shows these per agent, per day. When cost-per-task creeps up, we investigate. When quality drops below threshold, we roll back.
At RapidClaw, we've baked these patterns into our agent deployment pipeline so every new agent starts with sane defaults — compressed prompts, caching enabled, model routing configured. It's not glamorous work, but it's the difference between an AI agent project that's a cost center and one that actually scales.
The Bottom Line
After implementing all four strategies:
| Metric | Before | After | Change |
|---|---|---|---|
| Daily token spend | ~2M | ~540K | -73% |
| Monthly cost | $1,840 | $497 | -73% |
| Avg response latency | 2.3s | 0.8s | -65% |
| Task success rate | 91% | 94% | +3% |
The latency improvement was an unexpected bonus — cache hits are basically free and instant.
If you're deploying AI agents and haven't optimized token costs yet, start with prompt compression. It's the fastest win with zero infrastructure changes. Then add caching. Then model routing. Each layer compounds on the last.
We're building more of these optimization primitives into the RapidClaw platform — if you're running agents in production and want to stop bleeding money on tokens, check it out.
I'm Tijo, founder of RapidClaw. I write about the unglamorous but critical parts of running AI in production. Follow me for more posts on agent ops, infra, and building startups with AI.
Top comments (0)