Tijo Gaucher

Posted on Apr 6

Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%

#agents #ai #llm #performance

Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%

Six months ago our monthly LLM bill at RapidClaw hit a number I'd rather not print. We were running production AI agents across customer workloads, and every "let's just add one more tool call" was quietly compounding into a four-figure surprise on the invoice.

I'm Tijo Bear, founder of RapidClaw. We build infrastructure for teams who want to ship AI agents without becoming full-time prompt engineers. After spending a quarter obsessing over our own token economics, we cut spend by 73% — without degrading agent quality. Here are the seven patterns that mattered most.

1. Prompt caching is the cheapest 90% win you'll ever ship

If you're sending the same system prompt, tool definitions, or RAG context on every turn, you're paying full freight for tokens the model has already seen. Anthropic, OpenAI, and most major providers now support prompt caching with cache hits priced at roughly 10% of normal input tokens.

# Before: 4,200 input tokens/turn at full price
messages = [
    {"role": "system", "content": LARGE_SYSTEM_PROMPT + TOOL_DEFINITIONS},
    {"role": "user", "content": user_input}
]

# After: same prompt, marked cacheable
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": LARGE_SYSTEM_PROMPT + TOOL_DEFINITIONS,
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    {"role": "user", "content": user_input}
]

One line of config. ~85% cost reduction on the cached portion. There's no excuse not to ship this today.

2. Route by complexity, not by habit

Not every task needs your most expensive model. We built a tiny router that classifies incoming agent requests into three buckets and dispatches them to the cheapest model that can plausibly handle the job:

def route_model(task_type: str, context_size: int) -> str:
    if task_type in ("classify", "extract", "format"):
        return "haiku"          # ~$0.25/M input
    if context_size > 50_000 or task_type == "reason":
        return "sonnet"         # ~$3/M input
    return "haiku"              # default to cheap

We escalate to the bigger model only when the cheap one returns low confidence or fails validation. Roughly 68% of our agent calls now resolve on the small model. That alone moved the needle more than any other optimization.

3. Trim your tool definitions ruthlessly

Tool/function schemas are tokens too. We audited ours and found 11 tools with descriptions averaging 180 tokens each, half of which were redundant explanation the model didn't actually need.

Cut every tool description down to its single most informative sentence. Move worked examples into a separate retrievable doc the agent can fetch only when it needs guidance. We saved ~1,400 tokens per turn just by editing JSON.

4. Stop re-feeding the entire conversation history

The naive agent loop ships the full message history on every turn. By turn 12 you're paying for turns 1–11 again. Three things help:

Sliding window — keep only the last N turns verbatim
Summary compaction — once history exceeds a threshold, ask a cheap model to summarize older turns into a 200-token recap
Memory extraction — pull stable facts (user prefs, project IDs, decisions) into a structured memory store, then inject only the relevant rows

def compact_history(messages, threshold=20):
    if len(messages) < threshold:
        return messages
    old, recent = messages[:-10], messages[-10:]
    summary = cheap_summarize(old)
    return [{"role": "system", "content": f"Earlier context: {summary}"}] + recent

5. Cap your tool-call loops

The single biggest money pit in agent systems isn't the model — it's the runaway loop. An agent that retries a flaky tool 14 times will quietly burn through more budget than 200 normal sessions.

Hard cap iterations. Add exponential backoff. Surface a clear error to the user instead of letting the model keep paying to re-try. Our default is 8 tool calls per turn with a budget guardrail that aborts the session if input tokens exceed a configured ceiling. You can read more about how we handle this in our agent runtime docs.

6. Stream and short-circuit

If your agent's output gets parsed and acted on, you don't need to wait for the full completion. Stream the response and short-circuit as soon as you've got the structured field you need. We saved roughly 22% of output tokens on long-form generations by stopping early when a <done> sentinel was emitted.

async for chunk in client.messages.stream(...):
    buffer += chunk.text
    if "<done>" in buffer:
        break  # stop paying for more tokens

7. Self-host the cheap stuff

Not every step in an agent pipeline needs a frontier model. Embeddings, classification, reranking, simple extraction — these run beautifully on small open models you can deploy on a single GPU box for a fixed monthly cost.

We moved embeddings and intent classification onto a self-hosted setup and the marginal cost dropped to effectively zero. The frontier model still handles the hard reasoning, but the surrounding plumbing now runs on infrastructure we control. If you're curious how we deploy and scale these, we wrote up the full architecture on the RapidClaw blog.

The numbers

Stacked together, here's what each pattern contributed to our 73% cut:

Pattern	Contribution
Prompt caching	31%
Model routing	19%
Self-hosting plumbing	11%
History compaction	6%
Tool definition trim	3%
Loop caps + budget guard	2%
Stream short-circuit	1%

The lesson isn't that any single trick is magical — it's that token economics is additive. Five mediocre optimizations beat one heroic one, and they're far easier to ship.

What I'd do first if I were starting over

If I had to rebuild this from scratch tomorrow with one week to optimize, I'd ship in this order: prompt caching → loop caps → model routing → history compaction. Those four alone get you to roughly 60% savings and require no infrastructure changes.

Everything else is polish.

If you're building production agents and want a runtime that bakes these patterns in by default, that's exactly what we're building at RapidClaw. I'd love to hear how you're handling token economics in your own stack — drop a comment or hit me up.

— Tijo

Top comments (1)

Harjot Singh • May 31

7 patterns hitting 73% is a great writeup because it shows the real lesson: cost optimization isn't one silver bullet, it's a stack of compounding small wins. No single pattern gets you 73% - but routing + caching + context scoping + output trimming + capping retries multiply together, and suddenly the bill is a quarter of what it was. People keep looking for the one trick; the answer is "all of them, layered."

If I had to rank the seven by impact, my money's on context scoping and prompt/response caching doing the heavy lifting, with difficulty-routing close behind - those three tend to be 70-80% of any real saving. This exact compounding stack is what holds Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) at ~$3 flat per build: not one optimization, a dozen disciplined ones stacked. Genuinely one of the better cost posts I've read here. Of your 7, which two were the biggest movers? And did any of them trade off latency or quality, or were they all pure-win? The free-lunch vs tradeoff distinction is super useful for readers deciding what to adopt first.