The $500 Monthly Bill Problem
You know your monthly bill is $500. Your manager wants to know where it goes.
Is it the summarization feature? The search feature? The agent that handles document uploads? The cheap model you use for classification, or the expensive one for reasoning?
Most teams cannot answer this question. They have one API key. They see one total on the billing dashboard. Everything blurs together.
Cost attribution splits that total by feature, by user, by model, and by prompt template. It turns a mystery number into a breakdown you can act on.
This post shows how to do it with three libraries: agenttrace for tagging runs, cachebench for measuring how much caching helps, and llm-cost-cap for pre-flight cost estimation.
The Attribution Model
Attribution has three steps.
Step 1: Tag every run. Before calling the model, attach metadata to the run: which feature triggered it, which user requested it, which prompt template version was used, which model. This is the agenttrace layer.
Step 2: Measure cache effectiveness. Prompt caching can cut costs by 70-90% on repeated system prompts. cachebench measures actual cache hit ratios so you can see whether your caching strategy is working per feature.
Step 3: Estimate before you commit. For expensive calls, llm-cost-cap estimates the cost using a token counter before the call happens. You can log estimates alongside actuals and find where estimates are systematically off.
Main Code Example
import asyncio
import json
from agenttrace import Tracer, RunRecord
from cachebench import CacheBench, CacheSession
from llm_cost_cap import CostCap, CostEstimate
# One tracer, multiple features share it
tracer = Tracer(
tag="production",
output_path="./traces/runs.jsonl", # append-only JSONL
)
# CacheBench: track cache hit ratio per feature
bench = CacheBench(session_path="./traces/cache.jsonl")
# CostCap: estimate cost before each call, block if over threshold
cost_cap = CostCap(
model_pricing={
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0}, # per million tokens
"gpt-5.4": {"input": 2.5, "output": 10.0},
},
hard_cap_usd=0.05, # refuse any single call estimated over $0.05
warn_usd=0.02, # log a warning if estimated over $0.02
)
async def traced_call(
feature: str,
user_id: str,
template_version: str,
model: str,
messages: list[dict],
expected_output_tokens: int = 500,
) -> object:
"""
Make an LLM call with full cost attribution attached.
"""
run_id = tracer.start_run(
metadata={
"feature": feature,
"user_id": user_id,
"template_version": template_version,
"model": model,
}
)
# Pre-flight cost estimate
input_token_estimate = sum(
len(str(m.get("content", "")).split()) * 1.3
for m in messages
)
estimate: CostEstimate = cost_cap.estimate(
model=model,
input_tokens=int(input_token_estimate),
output_tokens=expected_output_tokens,
)
tracer.annotate(run_id, estimated_cost_usd=estimate.total_usd)
if estimate.blocked:
tracer.end_run(run_id, error="cost_cap_blocked")
raise RuntimeError(
f"Call blocked: estimated ${estimate.total_usd:.4f} exceeds cap."
)
# CacheBench: open a session to track cache headers
with bench.session(feature=feature, model=model) as cache_session:
response = await your_llm_client(messages)
# Record actual cache headers from the response
cache_session.record(
cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
cache_write_tokens=getattr(response.usage, "cache_creation_input_tokens", 0),
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
)
# Record actual cost in the trace
run_record: RunRecord = tracer.end_run(
run_id,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
model=model,
)
# Log estimate vs actual for drift analysis
if abs(run_record.cost_usd - estimate.total_usd) / max(estimate.total_usd, 0.0001) > 0.3:
print(
f"[COST_DRIFT] feature={feature} "
f"estimated=${estimate.total_usd:.4f} "
f"actual=${run_record.cost_usd:.4f}"
)
return response
def aggregate_costs(trace_path: str) -> dict:
"""
Load the trace JSONL and aggregate cost by feature.
Returns a dict: {feature: {total_usd, run_count, avg_usd_per_run}}.
"""
by_feature: dict[str, list[float]] = {}
with open(trace_path) as f:
for line in f:
record = json.loads(line)
feature = record.get("metadata", {}).get("feature", "unknown")
cost = record.get("cost_usd", 0.0)
by_feature.setdefault(feature, []).append(cost)
result = {}
for feature, costs in by_feature.items():
result[feature] = {
"total_usd": sum(costs),
"run_count": len(costs),
"avg_usd_per_run": sum(costs) / len(costs),
"p95_usd": sorted(costs)[int(len(costs) * 0.95)] if costs else 0,
}
return result
async def main():
# Simulate different features making calls
await traced_call(
feature="summarization",
user_id="user-42",
template_version="v2.1",
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Summarize this document: ..."}],
)
await traced_call(
feature="search",
user_id="user-17",
template_version="v1.0",
model="gpt-5.4",
messages=[{"role": "user", "content": "Search for Python async patterns."}],
)
# After a period, analyze
costs = aggregate_costs("./traces/runs.jsonl")
for feature, stats in sorted(costs.items(), key=lambda x: -x[1]["total_usd"]):
print(
f"{feature:20s} "
f"total=${stats['total_usd']:.4f} "
f"runs={stats['run_count']} "
f"avg=${stats['avg_usd_per_run']:.4f} "
f"p95=${stats['p95_usd']:.4f}"
)
# Cache effectiveness by feature
cache_report = bench.report()
for feature, metrics in cache_report.items():
savings = metrics["cache_savings_usd"]
hit_ratio = metrics["hit_ratio"]
print(f"Cache [{feature}]: hit={hit_ratio:.1%}, saved=${savings:.4f}")
if __name__ == "__main__":
asyncio.run(main())
The output after a day of traffic:
summarization total=$1.2340 runs=312 avg=$0.0040 p95=$0.0089
search total=$0.8820 runs=441 avg=$0.0020 p95=$0.0031
document-upload total=$0.3100 runs=44 avg=$0.0070 p95=$0.0190
Cache [summarization]: hit=82%, saved=$0.3210
Cache [search]: hit=14%, saved=$0.0041
Now you know the summarization feature costs 40% more total but search has lower cache effectiveness. Those are two different optimizations: reduce summarization run count, and improve the search prompt to share more cached prefix.
What This Does NOT Do
This does not give you real-time dashboards. The trace output is a JSONL file. You analyze it after the fact. For real-time dashboards, you need to push these records to a time-series database or an analytics platform.
It does not split cost by line of code. If your summarization feature calls the model three times per user request, all three calls are tagged with the same feature name. The split is at the feature level, not at the code-path level.
It does not account for embedding costs. If you use a separate embedding model for retrieval, that cost is not tracked here unless you add a separate tracer call for it.
The cost cap does not replace your provider's hard limit. It is a pre-flight estimate. If your estimate is 30% low, the actual call may still exceed what you expected. Set your cap conservatively.
Design Reasoning
Tag at run creation time, not at analysis time. Some teams add metadata to their logs after the fact using request IDs matched across services. That is fragile. If you forget to log the feature name at call time, you cannot reconstruct it later. agenttrace takes the metadata at start_run and writes it to every record for that run.
Keep estimate vs actual in the same record. Comparing estimate drift tells you whether your token counting strategy is reliable. If your estimates are consistently 50% low for a specific feature, something about that feature's prompt structure is different from what your estimator assumes.
Cache hit ratio per feature is not the same as cache hit ratio overall. A feature with a large, stable system prompt will have high hit ratios. A feature that changes the system prompt per request will have near-zero hit ratios. You need per-feature breakdown to see this.
When This Applies
Any agent or LLM-backed service with more than one feature or user type. The moment you have two features, you need attribution. Otherwise you cannot optimize.
Multi-tenant SaaS where you charge users based on usage. You need per-user cost data to calculate your margin and set pricing. agenttrace metadata tagging gives you the raw data.
Teams that want to make a cost vs. quality tradeoff decision. "If we downgrade search to a cheaper model, what do we save?" Attribution tells you what search currently costs. You can estimate the savings from a model swap before making the change.
This does NOT apply to single-feature single-model setups where there is nothing to attribute against. If everything runs on one model for one purpose, the total bill is the attribution. You do not need extra infrastructure.
Quick-Start Snippet
pip install agenttrace cachebench llm-cost-cap
from agenttrace import Tracer
tracer = Tracer(tag="my-service", output_path="./traces/runs.jsonl")
run_id = tracer.start_run(metadata={"feature": "summarization", "user_id": "u-123"})
# ... make your LLM call ...
tracer.end_run(run_id, input_tokens=1200, output_tokens=300, model="claude-sonnet-4-6")
That is the minimal attribution setup. Add cachebench and llm-cost-cap when you need cache metrics and pre-flight guards.
Siblings
| Library | What it does | When to reach for it |
|---|---|---|
agenttrace |
Tag and record per-run cost, tokens, latency | Any production agent with multiple features |
cachebench |
Measure cache hit ratio and savings per feature | When you use prompt caching and want to prove ROI |
llm-cost-cap |
Estimate and block calls over a cost threshold | Stop runaway single calls before they happen |
token-budget-py |
Session-level token cap | Hard ceiling on total session spend |
claude-cost |
Model pricing calculator | Feed accurate prices into cost_cap and tracer |
agentsnap |
Per-call tool snapshot | Pair with agenttrace for tool-level cost breakdown |
What's Next
The next analysis layer is outlier detection. Most runs cluster around the average cost. A small number of outliers can account for a large share of total spend. Sort the JSONL records by cost_usd descending. Inspect the top 10. Usually, outliers share a structural cause: a prompt that grew too large, a tool that returned too much data, or a loop that ran more turns than expected.
For teams that need real-time cost tracking, emit a lightweight event for each run using agent-event-bus. Subscribe to the event stream from a Prometheus exporter or a Datadog custom metric. The trace file stays as the archival record. The event stream feeds dashboards.
For per-user billing in a SaaS product, the user_id in metadata is the key. Aggregate monthly totals per user. Compare to your pricing tiers. If a user in your "Starter" tier consistently costs more than they pay, you have a pricing problem that attribution makes visible.
Top comments (0)