Maurizio-L

Posted on May 23 • Originally published at promptolian.com

We measured what 10 tools 1,000 calls/day actually costs in AI agents

#ai #python #productivity #machinelearning

We measured what 10 tools × 1,000 calls/day actually costs. Here's the data.

Posted to r/ClaudeAI · r/LocalLLaMA · Hacker News

When you build an AI agent, you give it tools. Search the web. Read a file. Call an API. Query a database.

Each tool needs a description — a JSON block that tells the model what the tool does and what parameters it takes. Here's what a single tool looks like:

{
  "name": "search_web",
  "description": "Search the web for recent information",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The search query"
      },
      "max_results": {
        "type": "integer",
        "description": "Maximum number of results to return"
      }
    },
    "required": ["query"]
  }
}

That single definition is about 80 tokens.

If your agent has 10 tools, you're sending ~800–1,200 tokens of tool definitions on every API call. Not once. Every call.

The actual numbers

We ran 1,000 simulated agent sessions across four agent sizes. Pricing at Claude Sonnet 4 input ($3 / 1M tokens).

Tools	Tokens / call	1k calls/day	Cost / month	Cost / year
5	~600	600k tok/day	$54	$657
10	~1,200	1.2M tok/day	$108	$1,314
20	~2,400	2.4M tok/day	$216	$2,628
50	~6,000	6M tok/day	$540	$6,570

At 10k calls/day (not unusual for a production agent), multiply those numbers by 10.

Why this doesn't go away

The obvious answer is: Anthropic has prompt caching. Use that.

Prompt caching helps, but:

Cached input tokens are still billed — at 10% of normal price. Not free.
Cache TTL is 5 minutes. If your sessions are longer than 5 minutes apart, you pay full price.
Cache invalidates on any change. If you add a tool, update a description, or rotate an API key in a tool — full price again.

So even with caching, you're paying for tool tokens. And most agents don't have caching set up at all.

What we built

Promptolian is a compression layer that sits between your code and any LLM API. You call it once at startup — everything else stays unchanged. It intercepts every API call, compresses what it can, and forwards the request. No proxy, no routing change, no new infrastructure.

It has three independent compression layers:

Layer 1 — Prompt compression
Replaces verbose patterns with compact equivalents before the text reaches the model. "You are an expert Python developer. Please write a function..." becomes "§EXP py developer. ACT write FN...". Runs locally in under 1ms. ~20% savings on typical prompts.

Layer 2 — Context engine
As a conversation grows, old turns get expensive. Promptolian summarises older messages and keeps only the most relevant recent turns — using a layout that works with how LLMs weight context. Up to 52.9% savings on long sessions.

Layer 3 — Tool schema compiler

This is the one that surprised us. It works in two phases:

Call 1 — compact DSL

Instead of the full JSON, the model receives a function-signature format:

search_web(query: str, max_results: int = 10)  # Search the web for recent information
read_file(path: str, encoding: str = utf-8)    # Read a local file
call_api(url: str, method: GET|POST, body: str)  # HTTP request

Same information. About 40 tokens instead of 120. ~69% smaller.

Call 2 onward — reference only

The model already saw the full definitions on call 1. They're in the conversation context. From call 2, you can send:

TOOLS:[search_web,read_file,call_api]

~3 tokens. 97% smaller.

The model understands this because the definitions are in its context window from the previous turn. It knows what search_web does. You don't need to re-explain it.

All three layers are deterministic — no LLM calls, no data sent anywhere, sub-millisecond latency. The tool is open source and self-hostable.

Benchmark results across 20 prompt types

We ran our prompt compression layer against 20 real-world prompts (system prompts, user instructions, domain-specific text):

Tier	Median CR	Mean CR	Range
Standard	20.2%	23.6%	10–50%
Pro	21.9%	24.3%	10–50%
Developer	21.9%	24.3%	10–50%

Verbose prompts (filler words, hedging language) compress 30–36%. Technical system prompts compress less (10–15%) because they're already dense. Short prompts can hit 40–50% but the absolute saving is smaller.

100% fact preservation across all 41 runs — numbers, file paths, named entities came through unchanged every time.

Combined savings: a real example

Agent setup: 10 tools, 2,000 calls/day, average 800-token system prompt, 5-turn sessions.

Without Promptolian:

Tool schemas: 1,200 tok × 2,000 = 2.4M tok/day
System prompt: 800 tok × 2,000 = 1.6M tok/day
Total: 4M tok/day = ~$360/month

With Promptolian (session avg):

Tool schemas: ~84 tok × 2,000 = 168k tok/day (93% saved)
System prompt: ~620 tok × 2,000 = 1.24M tok/day (22% saved)
Total: 1.41M tok/day = ~$127/month

Monthly saving: ~$233. Annual: ~$2,800. On a $19/month tool.

How to try it

# Install
pip install promptolian

# One line to compress every Anthropic call
from promptolian import patch_anthropic
patch_anthropic()

# Your existing code unchanged
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    system="You are an expert Python developer...",  # compressed automatically
    messages=[...],
    max_tokens=1000,
)

# Check savings
from promptolian import get_stats
print(get_stats().summary())
# → 47 calls · 18,432 tok saved · 22.1% CR

For Claude Code users:

promptolian mcp install   # adds to ~/.claude/settings.json
# restart Claude Code — done

Tool schema compression via the API:

curl -X POST https://api.promptolian.com/compress-tools \
  -H "Content-Type: application/json" \
  -d '{"tools": [...], "session_id": "my-session-1"}'

What we didn't solve

Being honest:

The 3-token trick only works when definitions are in context. If you're running very long sessions where old turns get truncated, turn-2+ savings shrink.
Prompt compression is rule-based, not neural. It works well on verbose/instructional text. Technical dense text compresses less.
No OpenAI Responses API support yet — just the Chat Completions endpoint.
The self-hosted server requires the repo — pip install alone doesn't bundle the full engine yet.