Clamper ai

Posted on Mar 13 • Originally published at clamper.tech

Token Efficiency: How We Cut AI Agent Costs by 90%

#ai #openai #devtools #programming

AI agents burn tokens fast. Learn the practical strategies that reduced our OpenClaw agent costs by 90% without sacrificing capability, from context pruning to tiered model routing.

Originally published at clamper.tech

Running an AI agent is not like chatting with ChatGPT. A single conversation might cost you a fraction of a cent. But an always-on agent that reads files, executes commands, processes emails, and makes decisions? That can burn through $50 to $200 per day if you are not careful. We learned this the hard way. Then we fixed it. Here is exactly how we reduced our OpenClaw agent operating costs by over 90% while actually improving performance.

Why AI Agents Are Token Monsters

A chatbot sends a message, gets a response. Simple. An AI agent is different. Every action it takes requires context. It needs to understand what it is doing, why, what tools are available, what happened before, and what the user wants. All of that context is tokens.

Here is what a typical OpenClaw agent session looks like under the hood:

System prompt: 2,000 to 8,000 tokens (instructions, personality, tool definitions)
Workspace files: 1,000 to 5,000 tokens (AGENTS.md, SOUL.md, TOOLS.md loaded per turn)
Conversation history: 500 to 50,000+ tokens (grows with every exchange)
Tool results: 200 to 10,000 tokens per tool call (file contents, command output, search results)
Skill definitions: 500 to 3,000 tokens (available skills listed in every request)

A single turn can easily hit 30,000 to 60,000 input tokens. With Claude or GPT-4 class models, that is $0.10 to $0.50 per interaction. Over hundreds of daily interactions (heartbeats, automated tasks, user messages), costs add up fast.

Strategy 1: Tiered Model Routing

This single change made the biggest difference. Not every task needs the most powerful model. A heartbeat check that returns "nothing to do" does not need Claude Opus. A simple file read does not need GPT-4.

The idea is straightforward: route tasks to the cheapest model that can handle them.

# Model routing strategy
# Tier 1: Lightweight tasks (heartbeats, simple lookups, status checks)
# Model: Claude Haiku / GPT-4o Mini
# Cost: ~$0.001 per interaction
# Use for: 70% of all agent interactions

# Tier 2: Standard tasks (email drafts, code review, research)
# Model: Claude Sonnet / GPT-4o
# Cost: ~$0.01-0.03 per interaction
# Use for: 25% of interactions

# Tier 3: Complex tasks (architecture decisions, long-form writing, debugging)
# Model: Claude Opus / GPT-4
# Cost: ~$0.10-0.50 per interaction
# Use for: 5% of interactions

In OpenClaw, you can set the default model in your gateway config and override it per task. Clamper makes this easier with its cost tracking dashboard, so you can see exactly which tasks are burning the most tokens and adjust your routing accordingly.

The math is compelling. If 70% of your agent's interactions are Tier 1 tasks running on a model that costs 20x less than your top-tier model, you save roughly 65% right there.

Strategy 2: Context Window Pruning

Every token in your context window costs money on every single turn. If your system prompt is 8,000 tokens and your agent processes 200 messages per day, that is 1.6 million tokens per day just for the system prompt. At Claude Sonnet rates, that is around $4.80 per day for content the model reads but mostly ignores.

Here is how to prune effectively:

Trim Your System Prompt

Audit every line. Does the model actually need this instruction? We found that most system prompts contain 30% to 50% redundant or rarely-used instructions. Move infrequently needed context into files the agent reads on demand instead of loading every turn.

# Before: Everything in AGENTS.md (loaded every turn)
# AGENTS.md: 4,200 tokens with detailed instructions for every scenario

# After: Core instructions in AGENTS.md, details on demand
# AGENTS.md: 1,800 tokens (essentials only)
# docs/deployment-guide.md: loaded when deploying
# docs/email-rules.md: loaded when handling email
# docs/social-media-guide.md: loaded when posting

# Savings: ~2,400 tokens per turn = ~480,000 tokens/day

Conversation History Compression

Conversation history is the silent cost killer. After 20 messages, your history might be 15,000 tokens. After 50 messages, it could be 40,000+ tokens. Most of that is stale context the model does not need.

Effective strategies:

Rolling window: Keep only the last N messages in context. For most tasks, 10 to 15 messages is enough.
Summary compression: Periodically summarize older messages into a compact paragraph and replace the full history.
Tool output truncation: A file read that returned 5,000 tokens does not need to stay in history at full length. Truncate to a summary after processing.

Strategy 3: Smart Skill Loading

OpenClaw skills are powerful, but listing all available skills in every request adds up. If you have 30 installed skills, that skill index alone could be 2,000 to 4,000 tokens per turn.

The fix: load skill descriptions lazily. Instead of dumping every skill definition into the system prompt, provide a lightweight index (skill name + one-line description) and load the full SKILL.md only when the agent decides it needs that skill.

# Lazy skill loading approach

# System prompt includes only:
# - github: GitHub operations via gh CLI
# - weather: Get weather forecasts
# - email: Read and send email via IMAP/SMTP
# (~50 tokens for 10 skills)

# Instead of:
# Full SKILL.md content for each skill
# (~3,000 tokens for 10 skills)

# Agent reads the full SKILL.md only when it decides
# to use that specific skill. 95% of turns, it doesn't
# need any skill details at all.

This is actually how Clamper structures its skill management by default. Skills are indexed with minimal metadata, and full documentation is loaded on demand. The token savings compound over hundreds of daily interactions.

Strategy 4: Batch Operations

Every API call to the model carries the full context window. So if your agent needs to process 10 emails, doing them one at a time means paying for the context 10 times.

Instead, batch related operations into single turns:

# Expensive: One email per turn
# Turn 1: [full context] + "Process email 1" -> response
# Turn 2: [full context] + "Process email 2" -> response
# Turn 3: [full context] + "Process email 3" -> response
# Total: 3x context window cost

# Cheaper: Batch in one turn
# Turn 1: [full context] + "Here are 10 emails. For each,
#   categorize as urgent/normal/spam and draft a reply
#   if needed." -> all responses in one turn
# Total: 1x context window cost + slightly longer output

This works especially well for heartbeat checks. Instead of checking email, then calendar, then weather in three separate turns, combine them: "Check email, calendar, and weather. Report anything that needs attention." One turn, one context load.

Strategy 5: Subagent Isolation

This is an underrated cost optimization. When your main agent spawns a subagent for a complex task, that subagent starts with a clean, minimal context. It does not carry the main agent's full conversation history, memory files, or unrelated skill definitions.

A subagent for a coding task only needs:

The specific task description (200 to 500 tokens)
Relevant file contents (loaded on demand)
Tool definitions for tools it actually uses

Compare that to the main agent, which might be carrying 40,000 tokens of conversation history, memory files, and 30 skill definitions. The subagent does the same work at a fraction of the context cost.

# Main agent context: ~45,000 tokens
# Processing a coding task in main session:
# 45,000 input tokens x 5 turns = 225,000 tokens total

# Same task as subagent:
# Subagent context: ~8,000 tokens
# 8,000 input tokens x 5 turns = 40,000 tokens total

# Savings: 82% fewer input tokens for the same result

Strategy 6: Output Token Management

Input tokens get all the attention, but output tokens cost 3x to 5x more per token on most models. An agent that writes 2,000 token responses when 200 tokens would do is burning money.

Practical fixes:

Set tone in system prompt: "Be concise. Use bullet points. Skip preamble." This alone can cut output tokens by 40%.
Structured output: When the agent generates data (not prose), request JSON or structured formats. They are more compact than natural language.
Avoid redundant confirmation: An agent that says "I have successfully completed the task of reading the file and here is what I found" is wasting tokens. "File contents:" is enough.

Strategy 7: Caching and Memory Architecture

If your agent looks up the same information repeatedly, you are paying for it every time. A proper memory system prevents redundant lookups.

The three-layer memory approach works well here:

Layer 1 (Daily notes): Raw session logs. Written once, read occasionally. Low token cost.
Layer 2 (Knowledge graph): Extracted facts and patterns. Compact, searchable. Read frequently but small.
Layer 3 (Indexed knowledge): Full searchable archive. Queried on demand, not loaded into context.

The key insight: Layer 2 acts as a cache. Instead of re-reading a 3,000-token daily note to find one fact, the agent reads a 50-token entry from the knowledge layer. Over hundreds of lookups, this saves enormous token volume. Clamper's workspace scaffolding sets up this exact three-layer structure automatically when you run npm install -g clamper-ai and initialize your workspace.

The Numbers: Before and After

Here is our actual cost breakdown before and after applying these strategies to a production OpenClaw agent running 24/7:

Before Optimization

Daily interactions: ~250 (heartbeats + automated tasks + user messages)
Average input tokens per turn: 42,000
Average output tokens per turn: 1,800
Model: Claude Sonnet for everything
Daily input cost: ~$31.50
Daily output cost: ~$6.75
Total daily cost: ~$38.25
Monthly cost: ~$1,147

After Optimization

Daily interactions: ~220 (batching reduced total turns)
Tier 1 (Haiku, 70%): 154 turns x 6,000 avg tokens = $0.14
Tier 2 (Sonnet, 25%): 55 turns x 18,000 avg tokens = $2.97
Tier 3 (Opus, 5%): 11 turns x 35,000 avg tokens = $1.16
Output costs across all tiers: ~$0.95
Total daily cost: ~$5.22
Monthly cost: ~$157

That is an 86% reduction. And honestly, the agent performs better after optimization because the tiered routing means it uses the right model for each task instead of overkill on everything.

Quick Wins You Can Apply Today

If you only have 30 minutes, do these three things:

Switch heartbeats to a lightweight model. If your agent runs periodic heartbeat checks, route them to Haiku or GPT-4o Mini. This alone can save 40% to 60% of your total costs since heartbeats often make up the majority of agent interactions.
Audit your system prompt. Copy your AGENTS.md, SOUL.md, and TOOLS.md into a token counter. If the total is over 3,000 tokens, start trimming. Move detailed instructions into on-demand files.
Batch your heartbeat checks. Instead of separate checks for email, calendar, and notifications, combine them into a single heartbeat turn. Three checks for the price of one context load.

Monitoring Your Costs

You cannot optimize what you do not measure. Track these metrics:

Tokens per turn (input and output): Your primary cost driver. Track the average and the outliers.
Turns per day: Are you making unnecessary API calls? Can any be batched or eliminated?
Cost per task type: Which categories of work cost the most? That is where optimization has the biggest impact.
Model utilization: What percentage of turns are running on each model tier? Aim for 60% or more on your cheapest tier.

If you are using Clamper, the dashboard at clamper.tech gives you a real-time view of these metrics. Seeing your actual token consumption broken down by task type is often the wake-up call that motivates optimization.

Common Mistakes to Avoid

Over-optimizing on quality: Do not route complex coding tasks to a cheap model just to save money. The debugging time from bad code costs more than the token savings. Match the model to the task difficulty.
Ignoring output tokens: Input tokens are cheaper per token, but verbose agents generate massive output. A concise system prompt instruction pays for itself many times over.
Not measuring first: Do not guess where your costs are. Measure for a week, then optimize the biggest line items. You might be surprised where the tokens are going.
Premature truncation: Aggressively truncating context can break agent reasoning. Start conservative, then tighten gradually while monitoring task success rates.

The Bottom Line

Running an AI agent does not have to cost hundreds of dollars per month. With tiered model routing, context pruning, batching, subagent isolation, and proper memory architecture, you can run a capable 24/7 agent for under $5 per day.

The strategies here are not theoretical. They are the exact optimizations running in production on agents managed with Clamper and OpenClaw. Start with the quick wins (model routing and heartbeat batching), measure the impact, then work through the more involved optimizations as needed.

Your agent should be saving you time and money, not draining your API budget. Optimize smart, and it will.

Start optimizing your agent today. Clamper gives you cost tracking, workspace scaffolding, and memory management out of the box.

npm install -g clamper-ai

Learn more at clamper.tech

DEV Community