Mukunda Rao Katta

Posted on May 25

Track LLM Cost Per User Request in Your Production Agent

#hermeschallenge #ai #python #agents

You ship your agent. It works. Then the bill arrives.

You have a total token count from the provider dashboard. You have no idea which users, which prompts, or which tool calls drove it. You're flying blind.

This post shows how to track cost at the request level, not just the account level. You'll know your p50, p95, and p99 cost per request. You'll know which users cost 10x the average. You'll know whether your prompt cache is actually working.

What You Need to Measure

Three numbers matter for per-request cost attribution:

Input tokens. Charged per token. Includes the system prompt, conversation history, and tool schemas.
Output tokens. More expensive per token than input. Driven by response length and tool use.
Cache hit ratio. Anthropic charges 10% of the input token price for cache reads. If your system prompt is large and cache hits are low, you're paying full price for every call.

agenttrace captures the first two. cachebench captures the third.

Setting Up agenttrace

agenttrace wraps each LLM call and records the token totals and latency.

from agenttrace import Tracer
import anthropic

client = anthropic.Anthropic()
tracer = Tracer(store_path="~/.myagent/traces.jsonl")

def call_llm(messages, user_id: str, request_id: str):
    with tracer.trace(tags={"user_id": user_id, "request_id": request_id}) as span:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=messages,
        )

        span.record(
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
            cache_creation_tokens=getattr(response.usage, "cache_creation_input_tokens", 0),
        )

    return response

Each trace is a line in the JSONL file. The file is append-only. Tags let you group by any dimension.

Computing USD Cost

Anthropic's pricing as of mid-2026 for claude-sonnet-4-6:

Input: $3.00 per million tokens
Output: $15.00 per million tokens
Cache reads: $0.30 per million tokens
Cache writes: $3.75 per million tokens

PRICING = {
    "claude-sonnet-4-6": {
        "input": 3.00 / 1_000_000,
        "output": 15.00 / 1_000_000,
        "cache_read": 0.30 / 1_000_000,
        "cache_write": 3.75 / 1_000_000,
    }
}

def compute_cost(usage, model="claude-sonnet-4-6"):
    p = PRICING[model]
    return (
        usage.input_tokens * p["input"]
        + usage.output_tokens * p["output"]
        + getattr(usage, "cache_read_input_tokens", 0) * p["cache_read"]
        + getattr(usage, "cache_creation_input_tokens", 0) * p["cache_write"]
    )

Store the cost alongside the token counts in your trace:

span.record(
    input_tokens=response.usage.input_tokens,
    output_tokens=response.usage.output_tokens,
    cost_usd=compute_cost(response.usage),
)

Analyzing Traces with agenttrace

Once you have traces, pull the report:

from agenttrace import Tracer
import statistics

tracer = Tracer(store_path="~/.myagent/traces.jsonl")
traces = tracer.load()

costs = [t["cost_usd"] for t in traces if "cost_usd" in t]
print(f"Requests: {len(costs)}")
print(f"Total:    ${sum(costs):.4f}")
print(f"Mean:     ${statistics.mean(costs):.4f}")
print(f"P50:      ${statistics.median(costs):.4f}")
print(f"P95:      ${statistics.quantiles(costs, n=20)[18]:.4f}")
print(f"P99:      ${statistics.quantiles(costs, n=100)[98]:.4f}")

Then group by user:

from collections import defaultdict

by_user = defaultdict(list)
for trace in traces:
    uid = trace.get("tags", {}).get("user_id", "unknown")
    if "cost_usd" in trace:
        by_user[uid].append(trace["cost_usd"])

# Sort users by total spend
ranked = sorted(by_user.items(), key=lambda x: sum(x[1]), reverse=True)
print("\nTop 10 users by cost:")
for uid, costs in ranked[:10]:
    print(f"  {uid}: ${sum(costs):.4f} across {len(costs)} requests")

This immediately tells you if one user drives 40% of your bill. It happens more often than you'd expect.

Checking Cache Hit Ratio with cachebench

If you're using Anthropic's prompt caching, you need to know whether it's actually working.

from cachebench import CacheBench

bench = CacheBench(store_path="~/.myagent/traces.jsonl")
report = bench.report()

print(f"Cache hit ratio:    {report.hit_ratio:.1%}")
print(f"Cache miss ratio:   {report.miss_ratio:.1%}")
print(f"Savings from cache: ${report.saved_usd:.4f}")
print(f"Cost without cache: ${report.would_cost_usd:.4f}")

A healthy cache hit ratio for a system with a large static system prompt should be above 80% after warmup. If you're seeing 20%, your cache is being invalidated on every request.

Common causes of low cache hit ratio:

Dynamic content in the system prompt (timestamps, user names, request IDs)
Inconsistent message ordering between requests
Not marking the cacheable portion with cache_control

The fix is to separate static from dynamic content:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LARGE_STATIC_CONTEXT,  # ~50k tokens, same every request
                "cache_control": {"type": "ephemeral"},  # mark for caching
            },
            {
                "type": "text",
                "text": f"User {user_id} asks: {user_input}",  # dynamic, not cached
            },
        ],
    }
]

The cached portion must be identical byte-for-byte between requests. Any difference invalidates the cache entry.

Setting a Per-Request Budget Gate

Once you know your p95 cost, you can set a pre-flight gate that rejects requests likely to exceed budget.

from llm_cost_cap import CostCap

cap = CostCap(max_usd=0.10)  # $0.10 per request

def safe_call(messages, tools):
    estimate = cap.estimate(
        model="claude-sonnet-4-6",
        messages=messages,
        tools=tools,
    )

    if estimate.exceeds_cap:
        raise ValueError(
            f"Estimated cost ${estimate.usd:.4f} exceeds cap ${cap.max_usd:.4f}"
        )

    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        tools=tools,
        messages=messages,
    )

The estimate is based on token count approximation, not an API call. It runs in microseconds.

What This Does NOT Do

These tools measure cost after the fact (or approximate it before). They do not block runaway agents in real time. If an agent enters an infinite tool loop and makes 1000 calls before you notice, the traces will show it but won't stop it. For that, pair with tool-loop-guard or agent-deadline.

The USD estimates use hardcoded pricing. If Anthropic changes prices, you update the constants. The libraries do not pull live pricing from the API.

Per-request attribution requires you to pass a user_id or request_id in your trace tags. If you don't tag your traces, you get aggregate numbers only.

Design Notes

Storing traces as JSONL has the same advantage as conversation logs: you can grep and inspect without a running database. When a user reports "why did I get billed so much this week?", you open the trace file, filter by user ID, and sum the costs. It takes 30 seconds.

The token count in response.usage is exact. Use it, not the estimate. The estimate is only for pre-flight checks before you make the call.

Cache hit tracking requires reading cache_read_input_tokens from the usage object. This field is only present when cache reads occurred. Use getattr(usage, "cache_read_input_tokens", 0) so you don't crash on requests with no cache reads.

When This Applies

This pattern fits agents that:

Serve multiple users and need per-user cost attribution
Have a system prompt larger than 1024 tokens (caching is worthwhile)
Need to enforce per-request cost limits
Are running in production and need cost visibility

Skip per-request tracing for:

Single-user personal tools where you just watch the dashboard
Batch jobs where aggregate cost is the metric that matters

Quick Start

pip install agenttrace cachebench llm-cost-cap

No shared configuration. Each library reads from your trace JSONL path.

Related Libraries

Library	What It Does	Language
`agenttrace`	Per-run cost + latency tracing with tags	Python
`cachebench`	Prompt cache hit ratio and savings reporting	Python
`llm-cost-cap`	Pre-flight USD cost gate before LLM call	Python
`token-budget-py`	Concurrent token/USD budget across agent runs	Python
`llm-budget-window`	Time-windowed token/USD budget enforcement	Python
`claude-cost`	Anthropic cost computation crate	Rust

What's Next

Once you have per-request cost data, the next step is anomaly detection. Which requests are more than 3x the mean? Those are usually bugs: agents stuck in a loop, tool results that are unexpectedly large, or prompts that grew without you noticing.

driftvane does drift detection over metric streams. Point it at your per-request cost trace and it will flag when cost distribution shifts outside normal bounds.

For budget enforcement across concurrent agent runs (not just per-request), token-budget-py gives you a shared pool that multiple coroutines can draw from and return to.

DEV Community