You ship your agent. It works. Then the bill arrives.
You have a total token count from the provider dashboard. You have no idea which users, which prompts, or which tool calls drove it. You're flying blind.
This post shows how to track cost at the request level, not just the account level. You'll know your p50, p95, and p99 cost per request. You'll know which users cost 10x the average. You'll know whether your prompt cache is actually working.
What You Need to Measure
Three numbers matter for per-request cost attribution:
- Input tokens. Charged per token. Includes the system prompt, conversation history, and tool schemas.
- Output tokens. More expensive per token than input. Driven by response length and tool use.
- Cache hit ratio. Anthropic charges 10% of the input token price for cache reads. If your system prompt is large and cache hits are low, you're paying full price for every call.
agenttrace captures the first two. cachebench captures the third.
Setting Up agenttrace
agenttrace wraps each LLM call and records the token totals and latency.
from agenttrace import Tracer
import anthropic
client = anthropic.Anthropic()
tracer = Tracer(store_path="~/.myagent/traces.jsonl")
def call_llm(messages, user_id: str, request_id: str):
with tracer.trace(tags={"user_id": user_id, "request_id": request_id}) as span:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=messages,
)
span.record(
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
cache_creation_tokens=getattr(response.usage, "cache_creation_input_tokens", 0),
)
return response
Each trace is a line in the JSONL file. The file is append-only. Tags let you group by any dimension.
Computing USD Cost
Anthropic's pricing as of mid-2026 for claude-sonnet-4-6:
- Input: $3.00 per million tokens
- Output: $15.00 per million tokens
- Cache reads: $0.30 per million tokens
- Cache writes: $3.75 per million tokens
PRICING = {
"claude-sonnet-4-6": {
"input": 3.00 / 1_000_000,
"output": 15.00 / 1_000_000,
"cache_read": 0.30 / 1_000_000,
"cache_write": 3.75 / 1_000_000,
}
}
def compute_cost(usage, model="claude-sonnet-4-6"):
p = PRICING[model]
return (
usage.input_tokens * p["input"]
+ usage.output_tokens * p["output"]
+ getattr(usage, "cache_read_input_tokens", 0) * p["cache_read"]
+ getattr(usage, "cache_creation_input_tokens", 0) * p["cache_write"]
)
Store the cost alongside the token counts in your trace:
span.record(
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cost_usd=compute_cost(response.usage),
)
Analyzing Traces with agenttrace
Once you have traces, pull the report:
from agenttrace import Tracer
import statistics
tracer = Tracer(store_path="~/.myagent/traces.jsonl")
traces = tracer.load()
costs = [t["cost_usd"] for t in traces if "cost_usd" in t]
print(f"Requests: {len(costs)}")
print(f"Total: ${sum(costs):.4f}")
print(f"Mean: ${statistics.mean(costs):.4f}")
print(f"P50: ${statistics.median(costs):.4f}")
print(f"P95: ${statistics.quantiles(costs, n=20)[18]:.4f}")
print(f"P99: ${statistics.quantiles(costs, n=100)[98]:.4f}")
Then group by user:
from collections import defaultdict
by_user = defaultdict(list)
for trace in traces:
uid = trace.get("tags", {}).get("user_id", "unknown")
if "cost_usd" in trace:
by_user[uid].append(trace["cost_usd"])
# Sort users by total spend
ranked = sorted(by_user.items(), key=lambda x: sum(x[1]), reverse=True)
print("\nTop 10 users by cost:")
for uid, costs in ranked[:10]:
print(f" {uid}: ${sum(costs):.4f} across {len(costs)} requests")
This immediately tells you if one user drives 40% of your bill. It happens more often than you'd expect.
Checking Cache Hit Ratio with cachebench
If you're using Anthropic's prompt caching, you need to know whether it's actually working.
from cachebench import CacheBench
bench = CacheBench(store_path="~/.myagent/traces.jsonl")
report = bench.report()
print(f"Cache hit ratio: {report.hit_ratio:.1%}")
print(f"Cache miss ratio: {report.miss_ratio:.1%}")
print(f"Savings from cache: ${report.saved_usd:.4f}")
print(f"Cost without cache: ${report.would_cost_usd:.4f}")
A healthy cache hit ratio for a system with a large static system prompt should be above 80% after warmup. If you're seeing 20%, your cache is being invalidated on every request.
Common causes of low cache hit ratio:
- Dynamic content in the system prompt (timestamps, user names, request IDs)
- Inconsistent message ordering between requests
- Not marking the cacheable portion with
cache_control
The fix is to separate static from dynamic content:
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": LARGE_STATIC_CONTEXT, # ~50k tokens, same every request
"cache_control": {"type": "ephemeral"}, # mark for caching
},
{
"type": "text",
"text": f"User {user_id} asks: {user_input}", # dynamic, not cached
},
],
}
]
The cached portion must be identical byte-for-byte between requests. Any difference invalidates the cache entry.
Setting a Per-Request Budget Gate
Once you know your p95 cost, you can set a pre-flight gate that rejects requests likely to exceed budget.
from llm_cost_cap import CostCap
cap = CostCap(max_usd=0.10) # $0.10 per request
def safe_call(messages, tools):
estimate = cap.estimate(
model="claude-sonnet-4-6",
messages=messages,
tools=tools,
)
if estimate.exceeds_cap:
raise ValueError(
f"Estimated cost ${estimate.usd:.4f} exceeds cap ${cap.max_usd:.4f}"
)
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=tools,
messages=messages,
)
The estimate is based on token count approximation, not an API call. It runs in microseconds.
What This Does NOT Do
These tools measure cost after the fact (or approximate it before). They do not block runaway agents in real time. If an agent enters an infinite tool loop and makes 1000 calls before you notice, the traces will show it but won't stop it. For that, pair with tool-loop-guard or agent-deadline.
The USD estimates use hardcoded pricing. If Anthropic changes prices, you update the constants. The libraries do not pull live pricing from the API.
Per-request attribution requires you to pass a user_id or request_id in your trace tags. If you don't tag your traces, you get aggregate numbers only.
Design Notes
Storing traces as JSONL has the same advantage as conversation logs: you can grep and inspect without a running database. When a user reports "why did I get billed so much this week?", you open the trace file, filter by user ID, and sum the costs. It takes 30 seconds.
The token count in response.usage is exact. Use it, not the estimate. The estimate is only for pre-flight checks before you make the call.
Cache hit tracking requires reading cache_read_input_tokens from the usage object. This field is only present when cache reads occurred. Use getattr(usage, "cache_read_input_tokens", 0) so you don't crash on requests with no cache reads.
When This Applies
This pattern fits agents that:
- Serve multiple users and need per-user cost attribution
- Have a system prompt larger than 1024 tokens (caching is worthwhile)
- Need to enforce per-request cost limits
- Are running in production and need cost visibility
Skip per-request tracing for:
- Single-user personal tools where you just watch the dashboard
- Batch jobs where aggregate cost is the metric that matters
Quick Start
pip install agenttrace cachebench llm-cost-cap
No shared configuration. Each library reads from your trace JSONL path.
Related Libraries
| Library | What It Does | Language |
|---|---|---|
agenttrace |
Per-run cost + latency tracing with tags | Python |
cachebench |
Prompt cache hit ratio and savings reporting | Python |
llm-cost-cap |
Pre-flight USD cost gate before LLM call | Python |
token-budget-py |
Concurrent token/USD budget across agent runs | Python |
llm-budget-window |
Time-windowed token/USD budget enforcement | Python |
claude-cost |
Anthropic cost computation crate | Rust |
What's Next
Once you have per-request cost data, the next step is anomaly detection. Which requests are more than 3x the mean? Those are usually bugs: agents stuck in a loop, tool results that are unexpectedly large, or prompts that grew without you noticing.
driftvane does drift detection over metric streams. Point it at your per-request cost trace and it will flag when cost distribution shifts outside normal bounds.
For budget enforcement across concurrent agent runs (not just per-request), token-budget-py gives you a shared pool that multiple coroutines can draw from and return to.
Top comments (0)