- Book: Observability for LLM Applications — Tracing, Evals, and Shipping AI You Can Trust
- Also by me: Agents in Production — the companion book in The AI Engineer's Library (2-book series)
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
The bill lands on the first of the month. Your provider dashboard shows one number for the whole account. Finance forwards it to you with a subject line that ends in a question mark. You know your agent made the calls. You do not know which agent, which feature, or which step inside the loop turned a two-cent task into a forty-cent one.
That gap is not a billing problem. It is an instrumentation problem. The provider only sees API calls. Your traces see the trajectory: which agent ran, what step it was on, which tool it picked, how many tokens each turn burned. If you tag those spans correctly, the monthly bill stops being a mystery and starts being a GROUP BY.
Here is how to get there.
The provider bill is aggregated. Your traces don't have to be.
An agent turn is rarely one model call. You get a decision, you run a tool, you feed the result back, you loop. Every one of those model calls carries token usage. Your provider sums all of them into a single line item and calls it a day.
The fix is to attach a small, consistent set of attributes to every span your agent emits, so that later you can slice cost along any axis you care about: per agent, per feature, per run, per step. The OpenTelemetry GenAI semantic conventions already define most of the names. You just have to set them.
Tag every span with agent, step, and tool
Two levels matter. The invoke_agent span is the parent — it fires once per task and wraps the whole loop. The chat spans underneath carry the token usage. You want identity on the parent and usage on the children.
from opentelemetry import trace
tracer = trace.get_tracer("agent")
def run_agent(task, feature):
with tracer.start_as_current_span(
"invoke_agent triage-agent"
) as root:
root.set_attribute(
"gen_ai.agent.name", "triage-agent"
)
root.set_attribute(
"gen_ai.agent.version", "2.1.0"
)
# your own dimension for feature rollups
root.set_attribute("app.feature", feature)
return loop(task)
gen_ai.agent.name is the human label. gen_ai.agent.version is the one you skip on the first pass and regret later — when cost jumps overnight, you want to know whether the agent changed. app.feature is not in the spec; it is your own dimension, and it is the one finance actually asks about.
Now the child span. Each model call sets its step, its model, and its token counts.
def chat_step(client, messages, step):
with tracer.start_as_current_span(
"chat claude-sonnet-4-6"
) as span:
span.set_attribute("gen_ai.agent.step", step)
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages,
)
u = resp.usage
span.set_attribute(
"gen_ai.response.model", resp.model
)
span.set_attribute(
"gen_ai.usage.input_tokens",
u.input_tokens,
)
span.set_attribute(
"gen_ai.usage.output_tokens",
u.output_tokens,
)
span.set_attribute(
"gen_ai.usage.cache_read_input_tokens",
getattr(u, "cache_read_input_tokens", 0),
)
span.set_attribute(
"gen_ai.usage.cache_creation_input_tokens",
getattr(u, "cache_creation_input_tokens", 0),
)
return resp
Read usage off the response. Never estimate token counts with a local tokenizer; the provider's number is the one you get billed on, and cache reads and system-prompt overhead make the local guess wrong. For Claude, usage.input_tokens and usage.output_tokens come straight off the message. If you use prompt caching, usage also carries cache_read_input_tokens and cache_creation_input_tokens. Record those too, because cache reads are billed at a fraction of the input rate and ignoring them overstates your cost.
Tool spans get the same treatment for identity, minus the tokens.
def run_tool(name, args):
with tracer.start_as_current_span(
f"execute_tool {name}"
) as span:
span.set_attribute("gen_ai.tool.name", name)
return TOOLS[name](**args)
Turn tokens into money
Tokens are the raw signal. Cost is tokens times a per-model rate. Keep the rate table in code, keyed on the response model, and treat it as data that drifts, because provider prices change and a stale table quietly makes every number wrong.
# USD per 1M tokens. Check current rates before trusting.
PRICING = {
"claude-sonnet-4-6": {
"input": 3.00,
"output": 15.00,
"cache_read": 0.30,
},
"claude-haiku-4-5": {
"input": 1.00,
"output": 5.00,
"cache_read": 0.10,
},
}
def cost_usd(model, in_tok, out_tok, cache_read=0):
# in_tok is the uncached input remainder that
# Anthropic reports in usage.input_tokens; cache
# reads are billed separately, so add, don't subtract.
p = PRICING[model]
return (
in_tok * p["input"]
+ cache_read * p["cache_read"]
+ out_tok * p["output"]
) / 1_000_000
You can attach the computed cost to each chat span as gen_ai.usage.cost so the rollup is a sum, not a join against a pricing table at query time. Most backends already do this multiplication for you: Langfuse, Phoenix, Braintrust, and LangSmith commonly read gen_ai.usage.* and apply their own pricing table. Setting the attribute yourself means your number and theirs agree, and it survives a provider price change their table hasn't caught up to. Treat the rates in the table above as illustrative. Check Anthropic's current published pricing before you trust any number that falls out of this.
Roll up per run, per feature, per step
Once the attributes are on the spans, attribution is aggregation. The parent invoke_agent span is what makes it clean — every child belongs to one task, so per-run cost is the sum of gen_ai.usage.cost across the subtree. Backends compute this for you and show it on the root span.
The dimension finance cares about is the feature, and that is where app.feature earns its place. If your span store is queryable (most export to a warehouse or expose an API), the question "what did the refund flow cost last week" is one grouped sum.
SELECT
attributes['app.feature'] AS feature,
count(DISTINCT trace_id) AS runs,
sum(attributes['gen_ai.usage.cost']) AS usd
FROM spans
WHERE name LIKE 'chat %'
AND timestamp >= now() - INTERVAL 7 DAY
GROUP BY feature
ORDER BY usd DESC;
That single query answers the question the provider dashboard cannot. The dashboard gives you the account total. This gives you per-feature spend, the run count behind it, and the unit cost. Divide usd by runs and you have cost-per-task per feature, which is the number you take into a pricing conversation.
Group by gen_ai.agent.step instead of feature and you get the shape of the loop: how cost accumulates turn by turn. In an agent, input tokens grow every step because the context accretes — the transcript from step one rides along into step four. A run that should take three turns and takes twelve is not twelve times the cost. It is worse, because each later turn carries a fatter prompt.
Find the expensive step
Averages hide the runs that hurt. The bill is not driven by the median task; it is driven by the tail — the trajectory that looped, re-searched, re-read a giant document five times, or handed off to a model that was overkill for the job. Attribution per step is how you find it.
Sort your runs by cost, open the most expensive one, and read it top to bottom. The token counts on the chat spans tell the story. A step where input tokens jump from two thousand to twenty thousand is a step that stuffed something huge into context: a full document, an un-truncated tool result, an entire conversation history. A run with fifteen chat spans when the happy path has four is a loop that did not terminate.
Two rollups make the tail visible without opening traces one at a time:
-- cost per step position, across all runs
SELECT
attributes['gen_ai.agent.step'] AS step,
count(*) AS calls,
avg(attributes['gen_ai.usage.input_tokens'])
AS avg_in_tokens,
sum(attributes['gen_ai.usage.cost']) AS usd
FROM spans
WHERE name LIKE 'chat %'
GROUP BY step
ORDER BY step;
If cost stays flat across steps, your loops are short and context is managed. If it climbs steeply, either you are looping too long or you are not trimming context between turns, and the fix is on the agent side, not the model side. Once you know it is a specific tool result blowing up the prompt, you truncate that result before it re-enters context, and the whole tail flattens.
The cheapest optimization is usually a model swap on one step, not the whole agent. If step zero is a routing decision that a small model handles fine, the per-step rollup shows you exactly that span burning Sonnet money on a Haiku-sized job. You would never see it in the account total.
What the rollup won't tell you
Cost attribution tells you where the money went. It does not tell you whether the money was well spent. A trajectory that costs forty cents and gets the answer right is cheaper than the two-cent one that fails and gets retried by a frustrated user. Token cost is one axis. Whether the agent did the right thing is a separate question, and it needs evals, not accounting. Wire the attribution first, though — you cannot optimize what you cannot attribute, and you cannot attribute what you never tagged.
If you want the full version of this — the complete gen_ai.* attribute set, multi-agent handoff cost rollups, and the sampling rules that keep trace volume affordable — it lives in The AI Engineer's Library. Agents in Production covers building and shipping the agent loop these spans wrap; Observability for LLM Applications covers the tracing, evals, and cost accounting you hang off it. The two are meant to be read together.

Top comments (0)