The "stochastic tax" framing (arXiv:2605.27320, this week) splits agent cost into a one-time design debt and a per-run tax (retries, eval/judge calls, guardrail checks, escalations, revalidation). Most dashboards only show the token line. Here's a tiny, runnable way to split the two from OpenTelemetry GenAI spans you're probably already emitting.
Assume each LLM call is a span with gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, a model name, and a task_id plus a span_role attribute you set to one of: primary, retry, judge, guardrail, escalation, revalidation. (If you don't tag roles yet, that's the first fix — you can't attribute a tax you don't label.)
from collections import defaultdict
# price per 1K tokens (input, output) — fill in your real numbers
PRICES = {
"small": (0.00015, 0.0006),
"frontier": (0.003, 0.015),
}
TAX_ROLES = {"retry", "judge", "guardrail", "escalation", "revalidation"}
def call_cost(span):
pin, pout = PRICES[span["model_tier"]]
return (span["input_tokens"] / 1000) * pin + (span["output_tokens"] / 1000) * pout
def split_by_task(spans):
token_line = defaultdict(float) # the "primary" call cost
tax_line = defaultdict(float) # everything that exists to keep it in bounds
for s in spans:
c = call_cost(s)
if s["span_role"] in TAX_ROLES:
tax_line[s["task_id"]] += c
else: # primary
token_line[s["task_id"]] += c
return token_line, tax_line
def report(spans):
token_line, tax_line = split_by_task(spans)
print(f"{'task':<10}{'token$':>10}{'tax$':>10}{'tax/total':>12}")
for t in sorted(set(token_line) | set(tax_line)):
tok, tax = token_line[t], tax_line[t]
ratio = tax / (tok + tax) if (tok + tax) else 0
print(f"{t:<10}{tok:>10.4f}{tax:>10.4f}{ratio:>11.0%}")
Feed it your exported spans and sort by tax/total. The tasks at the top are where a cheaper model will NOT help — they're tax-dominated (too many retries/escalations), and the fix is removing decisions, not swapping weights. BRANE (arXiv:2605.27361) is the research version of this move: per-query config selection that hit the same accuracy at up to 89% lower cost.
Next steps if you want to go further: emit span_role from your agent framework, push these two series to your metrics backend as agent.cost.token and agent.cost.tax, and alert on tax/total crossing a threshold per agent. I'm building this as a module in FerrumDeck (agent control plane); happy to compare span schemas if you're doing the same.
Repo / span schema: name it in the comments and I'll share the OTel GenAI attribute set I use.
Top comments (1)
This framing is the unlock for agent cost work. The "stochastic tax" name is going to stick — it's the first time I've seen a name for the cost that only exists to keep the primary call honest.
Two practical things I'd add from running similar splits in production:
1) The taxonomy (
primary/retry/judge/guardrail/escalation/revalidation) is right, but the boundaries blur in real traces. A "judge" call that's part of the primary path looks identical to a "judge" call that exists to validate a retry. The split is the intent at the moment of emission, not the structural position in the trace. Worth making the tag assignment manual at the call site (or a tight wrapper) rather than retroactive from a heuristic — retroactive splitting will silently misclassify.2) Once you have the split, the most useful follow-up is the tax rate as a first-class SLO. "X% of total agent cost is tax" is a single number that an on-call can alert on. When the rate climbs (model drift, prompt regression, new guardrails), the on-call can grep the OTel tags and find the offender in minutes. Without that single number, the cost just looks like "agents are expensive" and nothing changes.
The 30-line size is also a real feature — anything that lives inside a vendor's SDK will get forgotten. A tiny splitter you own and can read in a coffee break is the right size for cost observability.
Curious whether the same split works for non-LLM cost in the same agent (vector search, tool calls, sandbox minutes) — those are usually bigger than the LLM cost on real workloads.