We measured what 10 tools × 1,000 calls/day actually costs. Here's the data.
Posted to r/ClaudeAI · r/LocalLLaMA · Hacker News
When you build an AI agent, you give it tools. Search the web. Read a file. Call an API. Query a database.
Each tool needs a description — a JSON block that tells the model what the tool does and what parameters it takes. Here's what a single tool looks like:
{
"name": "search_web",
"description": "Search the web for recent information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return"
}
},
"required": ["query"]
}
}
That single definition is about 80 tokens.
If your agent has 10 tools, you're sending ~800–1,200 tokens of tool definitions on every API call. Not once. Every call.
The actual numbers
We ran 1,000 simulated agent sessions across four agent sizes. Pricing at Claude Sonnet 4 input ($3 / 1M tokens).
| Tools | Tokens / call | 1k calls/day | Cost / month | Cost / year |
|---|---|---|---|---|
| 5 | ~600 | 600k tok/day | $54 | $657 |
| 10 | ~1,200 | 1.2M tok/day | $108 | $1,314 |
| 20 | ~2,400 | 2.4M tok/day | $216 | $2,628 |
| 50 | ~6,000 | 6M tok/day | $540 | $6,570 |
At 10k calls/day (not unusual for a production agent), multiply those numbers by 10.
Why this doesn't go away
The obvious answer is: Anthropic has prompt caching. Use that.
Prompt caching helps, but:
- Cached input tokens are still billed — at 10% of normal price. Not free.
- Cache TTL is 5 minutes. If your sessions are longer than 5 minutes apart, you pay full price.
- Cache invalidates on any change. If you add a tool, update a description, or rotate an API key in a tool — full price again.
So even with caching, you're paying for tool tokens. And most agents don't have caching set up at all.
What we built
Promptolian is a compression layer that sits between your code and any LLM API. You call it once at startup — everything else stays unchanged. It intercepts every API call, compresses what it can, and forwards the request. No proxy, no routing change, no new infrastructure.
It has three independent compression layers:
Layer 1 — Prompt compression
Replaces verbose patterns with compact equivalents before the text reaches the model. "You are an expert Python developer. Please write a function..." becomes "§EXP py developer. ACT write FN...". Runs locally in under 1ms. ~20% savings on typical prompts.
Layer 2 — Context engine
As a conversation grows, old turns get expensive. Promptolian summarises older messages and keeps only the most relevant recent turns — using a layout that works with how LLMs weight context. Up to 52.9% savings on long sessions.
Layer 3 — Tool schema compiler
This is the one that surprised us. It works in two phases:
Call 1 — compact DSL
Instead of the full JSON, the model receives a function-signature format:
search_web(query: str, max_results: int = 10) # Search the web for recent information
read_file(path: str, encoding: str = utf-8) # Read a local file
call_api(url: str, method: GET|POST, body: str) # HTTP request
Same information. About 40 tokens instead of 120. ~69% smaller.
Call 2 onward — reference only
The model already saw the full definitions on call 1. They're in the conversation context. From call 2, you can send:
TOOLS:[search_web,read_file,call_api]
~3 tokens. 97% smaller.
The model understands this because the definitions are in its context window from the previous turn. It knows what search_web does. You don't need to re-explain it.
All three layers are deterministic — no LLM calls, no data sent anywhere, sub-millisecond latency. The tool is open source and self-hostable.
Benchmark results across 20 prompt types
We ran our prompt compression layer against 20 real-world prompts (system prompts, user instructions, domain-specific text):
| Tier | Median CR | Mean CR | Range |
|---|---|---|---|
| Standard | 20.2% | 23.6% | 10–50% |
| Pro | 21.9% | 24.3% | 10–50% |
| Developer | 21.9% | 24.3% | 10–50% |
Verbose prompts (filler words, hedging language) compress 30–36%. Technical system prompts compress less (10–15%) because they're already dense. Short prompts can hit 40–50% but the absolute saving is smaller.
100% fact preservation across all 41 runs — numbers, file paths, named entities came through unchanged every time.
Combined savings: a real example
Agent setup: 10 tools, 2,000 calls/day, average 800-token system prompt, 5-turn sessions.
Without Promptolian:
- Tool schemas: 1,200 tok × 2,000 = 2.4M tok/day
- System prompt: 800 tok × 2,000 = 1.6M tok/day
- Total: 4M tok/day = ~$360/month
With Promptolian (session avg):
- Tool schemas: ~84 tok × 2,000 = 168k tok/day (93% saved)
- System prompt: ~620 tok × 2,000 = 1.24M tok/day (22% saved)
- Total: 1.41M tok/day = ~$127/month
Monthly saving: ~$233. Annual: ~$2,800. On a $19/month tool.
How to try it
# Install
pip install promptolian
# One line to compress every Anthropic call
from promptolian import patch_anthropic
patch_anthropic()
# Your existing code unchanged
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
system="You are an expert Python developer...", # compressed automatically
messages=[...],
max_tokens=1000,
)
# Check savings
from promptolian import get_stats
print(get_stats().summary())
# → 47 calls · 18,432 tok saved · 22.1% CR
For Claude Code users:
promptolian mcp install # adds to ~/.claude/settings.json
# restart Claude Code — done
Tool schema compression via the API:
curl -X POST https://api.promptolian.com/compress-tools \
-H "Content-Type: application/json" \
-d '{"tools": [...], "session_id": "my-session-1"}'
What we didn't solve
Being honest:
- The 3-token trick only works when definitions are in context. If you're running very long sessions where old turns get truncated, turn-2+ savings shrink.
- Prompt compression is rule-based, not neural. It works well on verbose/instructional text. Technical dense text compresses less.
- No OpenAI Responses API support yet — just the Chat Completions endpoint.
- The self-hosted server requires the repo — pip install alone doesn't bundle the full engine yet.
Open questions we'd love feedback on
- What's your typical tool count per agent?
- Do you use prompt caching today? Does it actually hit in practice?
- Would you pay for usage-based pricing (per token saved) vs flat monthly?
The full benchmark methodology and raw data are at promptolian.com/benchmarks.
Source: github.com/Maurizio-L/promptolian-public
Built by Maurizio Lospi — maurizio.lospi@gmail.com. Feedback welcome — especially if your numbers look different from mine.
Top comments (0)