How I Cut Our AI Agent Token Costs by 73% Without Sacrificing Quality

#webdev #ai #cloud #devops

Every month I'd open our cloud billing dashboard and wince. Running AI agents in production at RapidClaw meant our token costs were climbing faster than our revenue. Sound familiar?

After three months of aggressive optimization, we cut our monthly token spend by 73% while actually improving agent response quality. Here's exactly how we did it — no vague advice, just the specific techniques that moved the needle.

The Problem: Death by a Thousand Tokens

When you're running AI agents that handle real workloads — deployment automation, infrastructure monitoring, code review — every unnecessary token adds up. Our agents were processing ~2M tokens per day across various tasks. At GPT-4-class pricing, that's not pocket change.

The root causes were predictable once we actually measured:

Bloated system prompts copied-and-pasted across agents (avg 2,400 tokens each)
No caching layer — identical queries hitting the LLM every time
Redundant context stuffed into every request "just in case"
Wrong model for the job — using frontier models for classification tasks

Strategy 1: Prompt Compression (Saved ~30%)

The biggest win was the simplest. We audited every system prompt and applied aggressive compression.

# BEFORE: 847 tokens
SYSTEM_PROMPT_BEFORE = """
You are a helpful deployment assistant for our cloud infrastructure.
You should help users deploy their applications to our Kubernetes cluster.
You have access to kubectl commands and can help troubleshoot issues.
When a user asks you to deploy something, you should first check if 
the namespace exists, then validate the manifest, then apply it.
You should always be polite and professional in your responses.
You should explain what you're doing at each step.
If something goes wrong, provide clear error messages and suggestions.
Always confirm before making destructive changes.
Remember to check resource limits and quotas before deploying.
"""

# AFTER: 196 tokens
SYSTEM_PROMPT_AFTER = """
Role: K8s deployment agent.
Tools: kubectl
Flow: check namespace → validate manifest → apply
Rules: confirm destructive ops, check resource quotas, explain steps
"""

Same behavior, 77% fewer tokens. The key insight: LLMs don't need the verbose instructions we think they do. They need structured, precise constraints.

We built a simple compression pipeline:

import tiktoken

def audit_prompt(prompt: str, model: str = "gpt-4") -> dict:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(prompt)

    # Flag prompts over 500 tokens for review
    return {
        "token_count": len(tokens),
        "needs_review": len(tokens) > 500,
        "estimated_daily_cost": len(tokens) * CALLS_PER_DAY * COST_PER_TOKEN
    }

# Run this on every agent prompt quarterly
for agent in get_all_agents():
    report = audit_prompt(agent.system_prompt)
    if report["needs_review"]:
        print(f"⚠️  {agent.name}: {report['token_count']} tokens "
              f"(${report['estimated_daily_cost']:.2f}/day)")

Strategy 2: Semantic Caching (Saved ~25%)

This was the highest-ROI engineering investment. We added a semantic similarity cache in front of our LLM calls.

import hashlib
import numpy as np
from redis import Redis

class SemanticCache:
    def __init__(self, redis_url: str, similarity_threshold: float = 0.95):
        self.redis = Redis.from_url(redis_url)
        self.threshold = similarity_threshold

    def get_embedding(self, text: str) -> np.ndarray:
        """Use a cheap embedding model — not the expensive LLM."""
        # text-embedding-3-small costs ~$0.02/1M tokens
        return embed_model.encode(text)

    def lookup(self, query: str) -> str | None:
        query_emb = self.get_embedding(query)

        # Check against recent cached queries
        for key in self.redis.scan_iter("cache:emb:*"):
            cached_emb = np.frombuffer(self.redis.get(key))
            similarity = np.dot(query_emb, cached_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)
            )
            if similarity >= self.threshold:
                response_key = key.decode().replace("emb:", "resp:")
                return self.redis.get(response_key).decode()
        return None

    def store(self, query: str, response: str, ttl: int = 3600):
        key_hash = hashlib.sha256(query.encode()).hexdigest()[:16]
        emb = self.get_embedding(query)
        self.redis.setex(f"cache:emb:{key_hash}", ttl, emb.tobytes())
        self.redis.setex(f"cache:resp:{key_hash}", ttl, response)

The 0.95 similarity threshold was critical. Too low and you get stale/wrong cached responses. Too high and your cache hit rate tanks. We tuned this per agent type — deployment agents got 0.97 (precision matters), monitoring summarizers got 0.92 (more tolerance for variation).

Cache hit rates after one week:

Infrastructure status queries: 67% hit rate
Deployment validation: 41% hit rate
Code review suggestions: 12% hit rate (too unique, as expected)

Strategy 3: Model Routing (Saved ~18%)

Not every task needs a frontier model. We built a lightweight router that directs requests to the cheapest capable model:

MODEL_TIERS = {
    "classification": "gpt-4o-mini",     # $0.15/1M input
    "extraction": "gpt-4o-mini",          # Simple structured output
    "summarization": "gpt-4o",            # Needs nuance
    "reasoning": "gpt-4o",               # Complex decisions
    "code_generation": "claude-sonnet-4-6", # Best for code
}

def route_request(task_type: str, complexity_score: float) -> str:
    """Route to cheapest capable model based on task type and complexity."""
    base_model = MODEL_TIERS.get(task_type, "gpt-4o")

    # Override: bump up if complexity is high
    if complexity_score > 0.8 and base_model.endswith("mini"):
        return base_model.replace("-mini", "")

    return base_model

We score complexity using a fast heuristic — input length, number of distinct entities, presence of code blocks, and whether the request involves multi-step reasoning. The heuristic itself runs on the cheapest model as a pre-filter.

Strategy 4: Context Window Management

This one's underrated. Instead of dumping the entire conversation history into every request, we implemented a sliding window with smart summarization:

def prepare_context(messages: list, max_tokens: int = 2000) -> list:
    """Keep recent messages verbatim, summarize older ones."""
    recent = messages[-4:]  # Last 2 exchanges verbatim
    older = messages[:-4]

    if not older:
        return recent

    # Summarize older context with a cheap model
    summary = summarize(older, model="gpt-4o-mini")

    return [{"role": "system", "content": f"Prior context: {summary}"}] + recent

This alone saved 15-20% on our longer agent conversations without any measurable quality drop.

Measuring What Matters

None of this works without observability. We track three metrics for every agent:

Cost per successful task — not just cost per request
Quality score — automated eval comparing optimized vs. unoptimized outputs
Latency — cache hits are 50-100x faster than LLM calls

We built a simple dashboard that shows these per agent, per day. When cost-per-task creeps up, we investigate. When quality drops below threshold, we roll back.

At RapidClaw, we've baked these patterns into our agent deployment pipeline so every new agent starts with sane defaults — compressed prompts, caching enabled, model routing configured. It's not glamorous work, but it's the difference between an AI agent project that's a cost center and one that actually scales.

The Bottom Line

After implementing all four strategies:

Metric	Before	After	Change
Daily token spend	~2M	~540K	-73%
Monthly cost	$1,840	$497	-73%
Avg response latency	2.3s	0.8s	-65%
Task success rate	91%	94%	+3%

The latency improvement was an unexpected bonus — cache hits are basically free and instant.

If you're deploying AI agents and haven't optimized token costs yet, start with prompt compression. It's the fastest win with zero infrastructure changes. Then add caching. Then model routing. Each layer compounds on the last.

We're building more of these optimization primitives into the RapidClaw platform — if you're running agents in production and want to stop bleeding money on tokens, check it out.

I'm Tijo, founder of RapidClaw. I write about the unglamorous but critical parts of running AI in production. Follow me for more posts on agent ops, infra, and building startups with AI.