DEV Community

Tijo Gaucher
Tijo Gaucher

Posted on

Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%

Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%

Six months ago our monthly LLM bill at RapidClaw hit a number I'd rather not print. We were running production AI agents across customer workloads, and every "let's just add one more tool call" was quietly compounding into a four-figure surprise on the invoice.

I'm Tijo Bear, founder of RapidClaw. We build infrastructure for teams who want to ship AI agents without becoming full-time prompt engineers. After spending a quarter obsessing over our own token economics, we cut spend by 73% — without degrading agent quality. Here are the seven patterns that mattered most.

1. Prompt caching is the cheapest 90% win you'll ever ship

If you're sending the same system prompt, tool definitions, or RAG context on every turn, you're paying full freight for tokens the model has already seen. Anthropic, OpenAI, and most major providers now support prompt caching with cache hits priced at roughly 10% of normal input tokens.

# Before: 4,200 input tokens/turn at full price
messages = [
    {"role": "system", "content": LARGE_SYSTEM_PROMPT + TOOL_DEFINITIONS},
    {"role": "user", "content": user_input}
]

# After: same prompt, marked cacheable
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": LARGE_SYSTEM_PROMPT + TOOL_DEFINITIONS,
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    {"role": "user", "content": user_input}
]
Enter fullscreen mode Exit fullscreen mode

One line of config. ~85% cost reduction on the cached portion. There's no excuse not to ship this today.

2. Route by complexity, not by habit

Not every task needs your most expensive model. We built a tiny router that classifies incoming agent requests into three buckets and dispatches them to the cheapest model that can plausibly handle the job:

def route_model(task_type: str, context_size: int) -> str:
    if task_type in ("classify", "extract", "format"):
        return "haiku"          # ~$0.25/M input
    if context_size > 50_000 or task_type == "reason":
        return "sonnet"         # ~$3/M input
    return "haiku"              # default to cheap
Enter fullscreen mode Exit fullscreen mode

We escalate to the bigger model only when the cheap one returns low confidence or fails validation. Roughly 68% of our agent calls now resolve on the small model. That alone moved the needle more than any other optimization.

3. Trim your tool definitions ruthlessly

Tool/function schemas are tokens too. We audited ours and found 11 tools with descriptions averaging 180 tokens each, half of which were redundant explanation the model didn't actually need.

Cut every tool description down to its single most informative sentence. Move worked examples into a separate retrievable doc the agent can fetch only when it needs guidance. We saved ~1,400 tokens per turn just by editing JSON.

4. Stop re-feeding the entire conversation history

The naive agent loop ships the full message history on every turn. By turn 12 you're paying for turns 1–11 again. Three things help:

  1. Sliding window — keep only the last N turns verbatim
  2. Summary compaction — once history exceeds a threshold, ask a cheap model to summarize older turns into a 200-token recap
  3. Memory extraction — pull stable facts (user prefs, project IDs, decisions) into a structured memory store, then inject only the relevant rows
def compact_history(messages, threshold=20):
    if len(messages) < threshold:
        return messages
    old, recent = messages[:-10], messages[-10:]
    summary = cheap_summarize(old)
    return [{"role": "system", "content": f"Earlier context: {summary}"}] + recent
Enter fullscreen mode Exit fullscreen mode

5. Cap your tool-call loops

The single biggest money pit in agent systems isn't the model — it's the runaway loop. An agent that retries a flaky tool 14 times will quietly burn through more budget than 200 normal sessions.

Hard cap iterations. Add exponential backoff. Surface a clear error to the user instead of letting the model keep paying to re-try. Our default is 8 tool calls per turn with a budget guardrail that aborts the session if input tokens exceed a configured ceiling. You can read more about how we handle this in our agent runtime docs.

6. Stream and short-circuit

If your agent's output gets parsed and acted on, you don't need to wait for the full completion. Stream the response and short-circuit as soon as you've got the structured field you need. We saved roughly 22% of output tokens on long-form generations by stopping early when a <done> sentinel was emitted.

async for chunk in client.messages.stream(...):
    buffer += chunk.text
    if "<done>" in buffer:
        break  # stop paying for more tokens
Enter fullscreen mode Exit fullscreen mode

7. Self-host the cheap stuff

Not every step in an agent pipeline needs a frontier model. Embeddings, classification, reranking, simple extraction — these run beautifully on small open models you can deploy on a single GPU box for a fixed monthly cost.

We moved embeddings and intent classification onto a self-hosted setup and the marginal cost dropped to effectively zero. The frontier model still handles the hard reasoning, but the surrounding plumbing now runs on infrastructure we control. If you're curious how we deploy and scale these, we wrote up the full architecture on the RapidClaw blog.

The numbers

Stacked together, here's what each pattern contributed to our 73% cut:

Pattern Contribution
Prompt caching 31%
Model routing 19%
Self-hosting plumbing 11%
History compaction 6%
Tool definition trim 3%
Loop caps + budget guard 2%
Stream short-circuit 1%

The lesson isn't that any single trick is magical — it's that token economics is additive. Five mediocre optimizations beat one heroic one, and they're far easier to ship.

What I'd do first if I were starting over

If I had to rebuild this from scratch tomorrow with one week to optimize, I'd ship in this order: prompt caching → loop caps → model routing → history compaction. Those four alone get you to roughly 60% savings and require no infrastructure changes.

Everything else is polish.


If you're building production agents and want a runtime that bakes these patterns in by default, that's exactly what we're building at RapidClaw. I'd love to hear how you're handling token economics in your own stack — drop a comment or hit me up.

— Tijo

Top comments (0)