DEV Community

Sattyam Jain
Sattyam Jain

Posted on

Separate your agent's "stochastic tax" from its token bill (a 30-line OTel-span cost splitter)

The "stochastic tax" framing (arXiv:2605.27320, this week) splits agent cost into a one-time design debt and a per-run tax (retries, eval/judge calls, guardrail checks, escalations, revalidation). Most dashboards only show the token line. Here's a tiny, runnable way to split the two from OpenTelemetry GenAI spans you're probably already emitting.

Assume each LLM call is a span with gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, a model name, and a task_id plus a span_role attribute you set to one of: primary, retry, judge, guardrail, escalation, revalidation. (If you don't tag roles yet, that's the first fix — you can't attribute a tax you don't label.)

from collections import defaultdict

# price per 1K tokens (input, output) — fill in your real numbers
PRICES = {
    "small": (0.00015, 0.0006),
    "frontier": (0.003, 0.015),
}
TAX_ROLES = {"retry", "judge", "guardrail", "escalation", "revalidation"}

def call_cost(span):
    pin, pout = PRICES[span["model_tier"]]
    return (span["input_tokens"] / 1000) * pin + (span["output_tokens"] / 1000) * pout

def split_by_task(spans):
    token_line = defaultdict(float)   # the "primary" call cost
    tax_line = defaultdict(float)     # everything that exists to keep it in bounds
    for s in spans:
        c = call_cost(s)
        if s["span_role"] in TAX_ROLES:
            tax_line[s["task_id"]] += c
        else:  # primary
            token_line[s["task_id"]] += c
    return token_line, tax_line

def report(spans):
    token_line, tax_line = split_by_task(spans)
    print(f"{'task':<10}{'token$':>10}{'tax$':>10}{'tax/total':>12}")
    for t in sorted(set(token_line) | set(tax_line)):
        tok, tax = token_line[t], tax_line[t]
        ratio = tax / (tok + tax) if (tok + tax) else 0
        print(f"{t:<10}{tok:>10.4f}{tax:>10.4f}{ratio:>11.0%}")
Enter fullscreen mode Exit fullscreen mode

Feed it your exported spans and sort by tax/total. The tasks at the top are where a cheaper model will NOT help — they're tax-dominated (too many retries/escalations), and the fix is removing decisions, not swapping weights. BRANE (arXiv:2605.27361) is the research version of this move: per-query config selection that hit the same accuracy at up to 89% lower cost.

Next steps if you want to go further: emit span_role from your agent framework, push these two series to your metrics backend as agent.cost.token and agent.cost.tax, and alert on tax/total crossing a threshold per agent. I'm building this as a module in FerrumDeck (agent control plane); happy to compare span schemas if you're doing the same.

Repo / span schema: name it in the comments and I'll share the OTel GenAI attribute set I use.

Top comments (1)

Collapse
 
uzoma_uche_3ec83974b4a8a5 profile image
Echo

This framing is the unlock for agent cost work. The "stochastic tax" name is going to stick — it's the first time I've seen a name for the cost that only exists to keep the primary call honest.

Two practical things I'd add from running similar splits in production:

1) The taxonomy (primary / retry / judge / guardrail / escalation / revalidation) is right, but the boundaries blur in real traces. A "judge" call that's part of the primary path looks identical to a "judge" call that exists to validate a retry. The split is the intent at the moment of emission, not the structural position in the trace. Worth making the tag assignment manual at the call site (or a tight wrapper) rather than retroactive from a heuristic — retroactive splitting will silently misclassify.

2) Once you have the split, the most useful follow-up is the tax rate as a first-class SLO. "X% of total agent cost is tax" is a single number that an on-call can alert on. When the rate climbs (model drift, prompt regression, new guardrails), the on-call can grep the OTel tags and find the offender in minutes. Without that single number, the cost just looks like "agents are expensive" and nothing changes.

The 30-line size is also a real feature — anything that lives inside a vendor's SDK will get forgotten. A tiny splitter you own and can read in a coffee break is the right size for cost observability.

Curious whether the same split works for non-LLM cost in the same agent (vector search, tool calls, sandbox minutes) — those are usually bigger than the LLM cost on real workloads.