Token-Cost Attribution From Traces: Per-Feature LLM Spend Without a Rewrite

#llm #observability #costs #tutorial

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Finance asks the question every team eventually hears: "The OpenAI bill was $38,000 last month. What is it paying for?" You open the provider dashboard. It shows you spend by API key and by model. It does not show you that 70% of it came from one summarization feature that three customers use, or that a single enterprise tenant accounts for half the bill.

The provider cannot answer that question. Only you can, because only you know which span was the chat feature and which was the nightly batch job. The good news: if you are already tracing your LLM calls, the data is mostly there. You need two attributes and a query.

You already have the hard part

A typical instrumented LLM call already carries token usage. The OpenTelemetry GenAI conventions give you gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on the span. Most SDK auto-instrumentation sets them for you. That is the expensive part of cost accounting, and you have it.

What you are missing is the dimension to group by. A token count with no feature label rolls up to one number: total spend. To answer "which feature, which tenant," you attach two more attributes at emit time:

app.feature        e.g. "summarize", "chat", "rerank"
app.tenant_id      stable customer/org id

The app.* prefix keeps your custom attributes out of the GenAI namespace, which is reserved for the spec. Datadog, Grafana, Honeycomb, and Langfuse all ingest custom span attributes as long as the prefix stays consistent.

Attach the labels without touching every call site

The instinct is to pass feature down through every function until it reaches the LLM call. That is the rewrite you do not want. Use span attributes set on the active span instead, and set the feature once at the boundary where you know it: the request handler, the job entry point, the route.

OpenTelemetry context propagates down the call tree. A child LLM span can read the feature its parent set, so you label once and the cost rolls up correctly underneath.

from opentelemetry import trace

tracer = trace.get_tracer("app.llm")


def handle_summarize(req, tenant_id):
    with tracer.start_as_current_span(
        "feature.summarize"
    ) as span:
        span.set_attribute("app.feature", "summarize")
        span.set_attribute("app.tenant_id", tenant_id)
        return run_summary(req)  # LLM calls live in here

The LLM call deeper in run_summary does not need to know about the feature. It emits its own span with token usage as usual. At query time you join child spans to the parent's app.feature by trace, or you copy the label down once at emit time:

def emit_llm_span(model, usage, feature, tenant_id):
    with tracer.start_as_current_span("gen_ai.chat") as s:
        s.set_attribute("gen_ai.request.model", model)
        s.set_attribute(
            "gen_ai.usage.input_tokens", usage["in"]
        )
        s.set_attribute(
            "gen_ai.usage.output_tokens", usage["out"]
        )
        s.set_attribute("app.feature", feature)
        s.set_attribute("app.tenant_id", tenant_id)
        cost = usd(model, usage["in"], usage["out"])
        s.set_attribute("app.llm.cost_usd", cost)
        return cost

Copying the label onto the LLM span itself is the pragmatic choice. It costs one attribute write and makes every query a flat group by instead of a trace-level join. For a wide fleet of features, pass it through context; for a handful, copy it down.

Turn tokens into dollars at emit time

Do not wait for the provider's billing endpoint. It settles too slowly to drive a report and gives you no per-span breakdown. Compute the cost yourself from a price table and store it as app.llm.cost_usd.

# USD per 1K tokens (input, output). Example values only,
# not authoritative. Verify against the provider's
# pricing page and date your table.
COSTS = {
    "gpt-4o-2024-11-20": (0.0025, 0.0100),
    "gpt-4o-mini": (0.00015, 0.00060),
    "claude-sonnet-4-5": (0.003, 0.015),
}


def usd(model: str, in_tok: int, out_tok: int) -> float:
    cin, cout = COSTS.get(model, (0.0, 0.0))
    return (in_tok / 1000) * cin + (out_tok / 1000) * cout

Two notes. Date the table and treat it as config, not code, so a price change is a one-line edit and not a deploy. And handle the unknown-model case by returning zero plus logging a warning. A model you forgot to price should show up as a visible gap in the report, not a silent miss.

The per-feature spend report

With app.feature, app.tenant_id, and app.llm.cost_usd on every LLM span, the report is a group by.

Datadog DDQL — spend per feature over the trailing day:

sum:app.llm.cost_usd{*} by {app.feature}
  .rollup(sum, 86400)

PromQL — same thing, if you export the cost as a counter through the OTel Collector:

sum by (app_feature) (
  increase(app_llm_cost_usd_total[24h])
)

SQL — if your spans land in ClickHouse, BigQuery, or any warehouse, this is the report finance actually wants:

SELECT
  attributes['app.feature']   AS feature,
  attributes['app.tenant_id'] AS tenant,
  round(sum(toFloat64(
    attributes['app.llm.cost_usd'])), 2) AS spend_usd,
  count()                     AS calls
FROM otel_spans
WHERE span_name = 'gen_ai.chat'
  AND timestamp >= now() - INTERVAL 30 DAY
GROUP BY feature, tenant
ORDER BY spend_usd DESC

That last query answers the question finance asked. Spend by feature, sliced by tenant, ranked. The summarization feature with three users shows up at the top, and the enterprise tenant eating half the bill has a row with its name on it.

What the report tells you that the bill cannot

A per-feature breakdown changes the conversations you can have.

You can find the feature whose unit economics are upside down — the one costing more per call than the plan it sits behind charges. You can spot a tenant whose usage pattern is abusive or just miscalibrated, and price or rate-limit them on evidence instead of a hunch. You can see which feature a model downgrade would actually save money on, rather than guessing and downgrading the wrong one.

You can also catch a regression. A prompt change that doubled input tokens on one feature is invisible in the total bill if traffic dipped that week. It is obvious in a per-feature trend line. Group your report by day and feature, and a step change in cost-per-call points straight at the deploy that caused it.

The honest limits

Two things this approach will not do, so you size your expectations right.

It will not reconcile to the penny against the provider invoice. Your price table lags real pricing, the provider rounds differently, and cached or batched tokens may be billed at rates your table does not model. Treat the report as a faithful relative breakdown (feature A costs roughly 3x feature B), not an accounting ledger.

And it only covers calls you instrument. A cron job or a worker that calls the model outside your traced paths is a blind spot. The fix is the same boundary discipline as everywhere else: every entry point that can reach the model sets app.feature before it does. Once that habit holds, the report covers your whole spend and the blind spots close.

None of this is a re-architecture. Two attributes, a price table, and a group by. You can ship it this week and hand finance a real answer the next time they ask what the bill is paying for.

If this was useful

Per-feature cost attribution is one slice of a larger discipline: knowing what your LLM system is doing well enough to bill, alert, and debug it. The LLM Observability Pocket Guide covers the full GenAI attribute set, what a healthy span looks like, and how to keep cost and quality queries from going stale every time a model version rotates.