DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Track LLM Cost Per User Request in Your Production Agent

You ship your agent. It works. Then the bill arrives.

You have a total token count from the provider dashboard. You have no idea which users, which prompts, or which tool calls drove it. You're flying blind.

This post shows how to track cost at the request level, not just the account level. You'll know your p50, p95, and p99 cost per request. You'll know which users cost 10x the average. You'll know whether your prompt cache is actually working.


What You Need to Measure

Three numbers matter for per-request cost attribution:

  1. Input tokens. Charged per token. Includes the system prompt, conversation history, and tool schemas.
  2. Output tokens. More expensive per token than input. Driven by response length and tool use.
  3. Cache hit ratio. Anthropic charges 10% of the input token price for cache reads. If your system prompt is large and cache hits are low, you're paying full price for every call.

agenttrace captures the first two. cachebench captures the third.


Setting Up agenttrace

agenttrace wraps each LLM call and records the token totals and latency.

from agenttrace import Tracer
import anthropic

client = anthropic.Anthropic()
tracer = Tracer(store_path="~/.myagent/traces.jsonl")

def call_llm(messages, user_id: str, request_id: str):
    with tracer.trace(tags={"user_id": user_id, "request_id": request_id}) as span:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=messages,
        )

        span.record(
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
            cache_creation_tokens=getattr(response.usage, "cache_creation_input_tokens", 0),
        )

    return response
Enter fullscreen mode Exit fullscreen mode

Each trace is a line in the JSONL file. The file is append-only. Tags let you group by any dimension.


Computing USD Cost

Anthropic's pricing as of mid-2026 for claude-sonnet-4-6:

  • Input: $3.00 per million tokens
  • Output: $15.00 per million tokens
  • Cache reads: $0.30 per million tokens
  • Cache writes: $3.75 per million tokens
PRICING = {
    "claude-sonnet-4-6": {
        "input": 3.00 / 1_000_000,
        "output": 15.00 / 1_000_000,
        "cache_read": 0.30 / 1_000_000,
        "cache_write": 3.75 / 1_000_000,
    }
}

def compute_cost(usage, model="claude-sonnet-4-6"):
    p = PRICING[model]
    return (
        usage.input_tokens * p["input"]
        + usage.output_tokens * p["output"]
        + getattr(usage, "cache_read_input_tokens", 0) * p["cache_read"]
        + getattr(usage, "cache_creation_input_tokens", 0) * p["cache_write"]
    )
Enter fullscreen mode Exit fullscreen mode

Store the cost alongside the token counts in your trace:

span.record(
    input_tokens=response.usage.input_tokens,
    output_tokens=response.usage.output_tokens,
    cost_usd=compute_cost(response.usage),
)
Enter fullscreen mode Exit fullscreen mode

Analyzing Traces with agenttrace

Once you have traces, pull the report:

from agenttrace import Tracer
import statistics

tracer = Tracer(store_path="~/.myagent/traces.jsonl")
traces = tracer.load()

costs = [t["cost_usd"] for t in traces if "cost_usd" in t]
print(f"Requests: {len(costs)}")
print(f"Total:    ${sum(costs):.4f}")
print(f"Mean:     ${statistics.mean(costs):.4f}")
print(f"P50:      ${statistics.median(costs):.4f}")
print(f"P95:      ${statistics.quantiles(costs, n=20)[18]:.4f}")
print(f"P99:      ${statistics.quantiles(costs, n=100)[98]:.4f}")
Enter fullscreen mode Exit fullscreen mode

Then group by user:

from collections import defaultdict

by_user = defaultdict(list)
for trace in traces:
    uid = trace.get("tags", {}).get("user_id", "unknown")
    if "cost_usd" in trace:
        by_user[uid].append(trace["cost_usd"])

# Sort users by total spend
ranked = sorted(by_user.items(), key=lambda x: sum(x[1]), reverse=True)
print("\nTop 10 users by cost:")
for uid, costs in ranked[:10]:
    print(f"  {uid}: ${sum(costs):.4f} across {len(costs)} requests")
Enter fullscreen mode Exit fullscreen mode

This immediately tells you if one user drives 40% of your bill. It happens more often than you'd expect.


Checking Cache Hit Ratio with cachebench

If you're using Anthropic's prompt caching, you need to know whether it's actually working.

from cachebench import CacheBench

bench = CacheBench(store_path="~/.myagent/traces.jsonl")
report = bench.report()

print(f"Cache hit ratio:    {report.hit_ratio:.1%}")
print(f"Cache miss ratio:   {report.miss_ratio:.1%}")
print(f"Savings from cache: ${report.saved_usd:.4f}")
print(f"Cost without cache: ${report.would_cost_usd:.4f}")
Enter fullscreen mode Exit fullscreen mode

A healthy cache hit ratio for a system with a large static system prompt should be above 80% after warmup. If you're seeing 20%, your cache is being invalidated on every request.

Common causes of low cache hit ratio:

  1. Dynamic content in the system prompt (timestamps, user names, request IDs)
  2. Inconsistent message ordering between requests
  3. Not marking the cacheable portion with cache_control

The fix is to separate static from dynamic content:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LARGE_STATIC_CONTEXT,  # ~50k tokens, same every request
                "cache_control": {"type": "ephemeral"},  # mark for caching
            },
            {
                "type": "text",
                "text": f"User {user_id} asks: {user_input}",  # dynamic, not cached
            },
        ],
    }
]
Enter fullscreen mode Exit fullscreen mode

The cached portion must be identical byte-for-byte between requests. Any difference invalidates the cache entry.


Setting a Per-Request Budget Gate

Once you know your p95 cost, you can set a pre-flight gate that rejects requests likely to exceed budget.

from llm_cost_cap import CostCap

cap = CostCap(max_usd=0.10)  # $0.10 per request

def safe_call(messages, tools):
    estimate = cap.estimate(
        model="claude-sonnet-4-6",
        messages=messages,
        tools=tools,
    )

    if estimate.exceeds_cap:
        raise ValueError(
            f"Estimated cost ${estimate.usd:.4f} exceeds cap ${cap.max_usd:.4f}"
        )

    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        tools=tools,
        messages=messages,
    )
Enter fullscreen mode Exit fullscreen mode

The estimate is based on token count approximation, not an API call. It runs in microseconds.


What This Does NOT Do

These tools measure cost after the fact (or approximate it before). They do not block runaway agents in real time. If an agent enters an infinite tool loop and makes 1000 calls before you notice, the traces will show it but won't stop it. For that, pair with tool-loop-guard or agent-deadline.

The USD estimates use hardcoded pricing. If Anthropic changes prices, you update the constants. The libraries do not pull live pricing from the API.

Per-request attribution requires you to pass a user_id or request_id in your trace tags. If you don't tag your traces, you get aggregate numbers only.


Design Notes

Storing traces as JSONL has the same advantage as conversation logs: you can grep and inspect without a running database. When a user reports "why did I get billed so much this week?", you open the trace file, filter by user ID, and sum the costs. It takes 30 seconds.

The token count in response.usage is exact. Use it, not the estimate. The estimate is only for pre-flight checks before you make the call.

Cache hit tracking requires reading cache_read_input_tokens from the usage object. This field is only present when cache reads occurred. Use getattr(usage, "cache_read_input_tokens", 0) so you don't crash on requests with no cache reads.


When This Applies

This pattern fits agents that:

  • Serve multiple users and need per-user cost attribution
  • Have a system prompt larger than 1024 tokens (caching is worthwhile)
  • Need to enforce per-request cost limits
  • Are running in production and need cost visibility

Skip per-request tracing for:

  • Single-user personal tools where you just watch the dashboard
  • Batch jobs where aggregate cost is the metric that matters

Quick Start

pip install agenttrace cachebench llm-cost-cap
Enter fullscreen mode Exit fullscreen mode

No shared configuration. Each library reads from your trace JSONL path.


Related Libraries

Library What It Does Language
agenttrace Per-run cost + latency tracing with tags Python
cachebench Prompt cache hit ratio and savings reporting Python
llm-cost-cap Pre-flight USD cost gate before LLM call Python
token-budget-py Concurrent token/USD budget across agent runs Python
llm-budget-window Time-windowed token/USD budget enforcement Python
claude-cost Anthropic cost computation crate Rust

What's Next

Once you have per-request cost data, the next step is anomaly detection. Which requests are more than 3x the mean? Those are usually bugs: agents stuck in a loop, tool results that are unexpectedly large, or prompts that grew without you noticing.

driftvane does drift detection over metric streams. Point it at your per-request cost trace and it will flag when cost distribution shifts outside normal bounds.

For budget enforcement across concurrent agent runs (not just per-request), token-budget-py gives you a shared pool that multiple coroutines can draw from and return to.

Top comments (0)