Per-Customer LLM Cost Reports (Without Rearchitecting Your Billing Pipeline)

#observability #llm #devops #cost

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Finance walks over. "How much did Acme Corp cost us in LLM calls last month?" You open your tracing UI. Every span has a request_id. None of them have a customer_id. The product engineer who wired tracing tagged it the way the SDK example tagged it: per request. Customer attribution was Future You's problem. Now you're Future You.

The instinct is to rebuild. Add a customer_id field to every span emission site, redeploy ten services, sync the schema with the data team, ship a migration to the warehouse. Six weeks of work. Finance needs the number tomorrow.

You don't need to rebuild. You need OpenTelemetry baggage, a versioned price table, and one aggregation worker. Maybe 80 lines once you cut the imports.

Why "tag every span with customer_id" is harder than it sounds

The naive answer is: pass customer_id as an argument everywhere the LLM client is called. It dies the moment you actually look at the codebase.

Your synchronous HTTP path knows the customer. It's on the auth context. Easy. The background worker that fires off a summary job 30 seconds later? It gets a job ID, dequeues from Redis, and calls the same LLM client. The customer is two hops away. You have to thread it through the job payload. Now do the same for the retry handler, the prefetcher that warms a cache, the scheduled embedding job, the webhook fan-out, and the eval runner that replays prompts in CI.

Every one of those call sites is a chance to forget. The forgotten ones go into a bucket called "shared" or just don't get tagged at all. Six months later that bucket is 40% of your spend and finance asks why.

The fix is to stop passing it explicitly. Set it once at the boundary. Have the framework propagate it. That's what baggage is for.

OTel baggage as the propagation layer

Baggage in OpenTelemetry is a context-bound key-value bag that rides alongside the trace. Anything in baggage gets carried through every span created in that context, across thread boundaries, across async boundaries, and across process boundaries if you propagate it over the wire.

You set the customer ID once, at the request handler, at the job dequeue, at the cron tick. Then forget about it. Every span the LLM client emits while that context is active inherits the value as an attribute, because you wired a span processor to copy it across.

from opentelemetry import baggage, context, trace
from opentelemetry.sdk.trace import SpanProcessor

class BaggageToAttributesProcessor(SpanProcessor):
    # whitelist the baggage keys you want promoted to span attrs.
    # promoting blindly is a privacy footgun.
    KEYS = ("customer_id", "tenant_id", "billing_account_id")

    def on_start(self, span, parent_context=None):
        ctx = parent_context or context.get_current()
        for key in self.KEYS:
            value = baggage.get_baggage(key, ctx)
            if value is not None:
                span.set_attribute(f"app.{key}", value)

    def on_end(self, span): pass
    def shutdown(self): pass
    def force_flush(self, timeout_millis=30_000): return True

ctx = baggage.set_baggage("customer_id", "cus_8H2k...")
token = context.attach(ctx)
try:
    # any LLM span created from here inherits app.customer_id
    response = llm_client.messages.create(...)
finally:
    context.detach(token)

For HTTP and gRPC, the W3C baggage header propagator carries the value to downstream services automatically. For your job queue, you serialize the baggage into the job payload at enqueue and rehydrate it at dequeue. Eight lines for an enqueue wrapper, eight for the dequeue side. You write it once.

The key discipline: set baggage only at boundaries. The HTTP middleware. The job consumer. The cron entry point. Never inside business logic. If you find yourself calling set_baggage mid-function, you've drifted back into the per-call-site mess you were trying to escape.

The price-table SQL, versioned for replay-safe invoicing

Tagging tells you which customer made a call. It doesn't tell you what the call costs. Costs move. Providers cut prices. You switch models. A new cache discount lands. If you compute cost as tokens * current_price at query time, your last quarter's invoices change every time pricing changes. That's not an invoice. That's a guess.

You need a versioned price table. Every row is keyed by (model, valid_from, valid_to) and contains the per-million-token rate broken out by direction and cache state.

CREATE TABLE llm_price_book (
    model               TEXT       NOT NULL,
    valid_from          TIMESTAMPTZ NOT NULL,
    valid_to            TIMESTAMPTZ NOT NULL,
    input_per_mtok_usd          NUMERIC(10, 6) NOT NULL,
    output_per_mtok_usd         NUMERIC(10, 6) NOT NULL,
    cache_write_per_mtok_usd    NUMERIC(10, 6) NOT NULL,
    cache_read_per_mtok_usd     NUMERIC(10, 6) NOT NULL,
    PRIMARY KEY (model, valid_from)
);

-- example row: Claude pricing snapshot as of Q1 2026
INSERT INTO llm_price_book VALUES (
    'claude-opus-4-7',
    '2026-01-01 00:00:00+00',
    '2999-12-31 23:59:59+00',
    15.00, 75.00, 18.75, 1.50
);

valid_to of the current row points at the year-2999 sentinel until the next price ships, at which point you UPDATE that row and INSERT a new one. The unique constraint stops you from leaving two open intervals overlapping.

Joining usage to price is then a range lookup keyed by the span's wall-clock start time, not by now:

SELECT
    u.customer_id,
    u.ts,
    u.model,
    u.input_tokens,
    u.output_tokens,
    u.cache_read_tokens,
    u.cache_write_tokens,
    (u.input_tokens       * p.input_per_mtok_usd
   + u.output_tokens      * p.output_per_mtok_usd
   + u.cache_write_tokens * p.cache_write_per_mtok_usd
   + u.cache_read_tokens  * p.cache_read_per_mtok_usd) / 1e6
        AS gross_cost_usd
FROM llm_usage_events u
JOIN llm_price_book p
  ON p.model = u.model
 AND u.ts >= p.valid_from
 AND u.ts <  p.valid_to;

Run this for any historical date and you get the cost the customer actually accrued at the price the model actually charged on that day. Re-run it next year and you get the same answer. That's what "replay-safe" means: invoices are stable artifacts, not function-of-now reports.

Gross vs net cost: the 2-number invoice pattern

There's one more wrinkle. Prompt caching changes the math. If you send the same 30k-token system prompt on every request, the first call writes the cache at 1.25x base rate and every subsequent call reads it at 0.1x. Gross cost is what the customer's traffic would have cost without caching, the number you might want to bill on. Net cost is what you actually paid the provider, the number that hits your bank account.

These are different numbers and they serve different audiences.

Finance wants net for the books. Customer Success wants gross for value conversations ("look how much we're saving you"). Sales wants both — gross for the headline, net for the margin model. Pick one and you'll get questioned forever. Report both and the conversation moves on.

SELECT
    customer_id,
    DATE_TRUNC('day', ts) AS day,
    SUM(input_tokens       * p.input_per_mtok_usd
      + output_tokens      * p.output_per_mtok_usd) / 1e6
        AS gross_cost_usd,
    SUM(input_tokens       * p.input_per_mtok_usd
      + output_tokens      * p.output_per_mtok_usd
      + cache_write_tokens * p.cache_write_per_mtok_usd
      + cache_read_tokens  * p.cache_read_per_mtok_usd) / 1e6
        AS net_cost_usd
FROM llm_usage_events u
JOIN llm_price_book p
  ON p.model = u.model AND u.ts >= p.valid_from AND u.ts < p.valid_to
GROUP BY customer_id, DATE_TRUNC('day', ts);

Gross treats every input token as a full-price input token. Net adds the cache lines back in at their discounted rates. The delta between the two columns is the caching saving, which is the number you put on the slide when the customer asks about pricing efficiency.

An 80-line worker that aggregates spans into a per-customer-per-day table

The span data lives in your tracing backend (Tempo, Honeycomb, whatever). You don't want finance querying that. You want a stable, append-only daily_llm_cost table that finance can join against the customer dimension table they already trust.

A small worker reads yesterday's LLM spans, joins them to the price book, and writes summed rows per customer per day. Run it at 02:00 UTC. Done before standup.

import os
from datetime import date, datetime, timedelta, timezone
import psycopg
from opentelemetry.sdk.trace.export import SpanExportResult

DSN = os.environ["WAREHOUSE_DSN"]

INSERT_SQL = """
INSERT INTO daily_llm_cost (
    day, customer_id, model,
    input_tokens, output_tokens,
    cache_read_tokens, cache_write_tokens,
    request_count, retry_count,
    gross_cost_usd, net_cost_usd
)
SELECT
    %(day)s::date AS day,
    s.customer_id,
    s.model,
    SUM(s.input_tokens),
    SUM(s.output_tokens),
    SUM(s.cache_read_tokens),
    SUM(s.cache_write_tokens),
    COUNT(*) FILTER (WHERE NOT s.is_retry),
    COUNT(*) FILTER (WHERE s.is_retry),
    SUM(s.input_tokens  * p.input_per_mtok_usd
      + s.output_tokens * p.output_per_mtok_usd) / 1e6,
    SUM(s.input_tokens       * p.input_per_mtok_usd
      + s.output_tokens      * p.output_per_mtok_usd
      + s.cache_write_tokens * p.cache_write_per_mtok_usd
      + s.cache_read_tokens  * p.cache_read_per_mtok_usd) / 1e6
FROM llm_span_events s
JOIN llm_price_book p
  ON p.model = s.model
 AND s.ts >= p.valid_from
 AND s.ts <  p.valid_to
WHERE s.ts >= %(day_start)s
  AND s.ts <  %(day_end)s
  AND s.customer_id IS NOT NULL
GROUP BY s.customer_id, s.model
ON CONFLICT (day, customer_id, model) DO UPDATE
SET input_tokens       = EXCLUDED.input_tokens,
    output_tokens      = EXCLUDED.output_tokens,
    cache_read_tokens  = EXCLUDED.cache_read_tokens,
    cache_write_tokens = EXCLUDED.cache_write_tokens,
    request_count      = EXCLUDED.request_count,
    retry_count        = EXCLUDED.retry_count,
    gross_cost_usd     = EXCLUDED.gross_cost_usd,
    net_cost_usd       = EXCLUDED.net_cost_usd;
"""

def run(target_day: date) -> None:
    day_start = datetime.combine(target_day, datetime.min.time(),
                                 tzinfo=timezone.utc)
    day_end = day_start + timedelta(days=1)
    with psycopg.connect(DSN) as conn:
        with conn.cursor() as cur:
            cur.execute(INSERT_SQL, {
                "day": target_day,
                "day_start": day_start,
                "day_end": day_end,
            })
        conn.commit()
    print(f"aggregated {target_day}")

if __name__ == "__main__":
    # default: yesterday in UTC. allow CLI override for backfills.
    import sys
    if len(sys.argv) > 1:
        run(date.fromisoformat(sys.argv[1]))
    else:
        run(date.today() - timedelta(days=1))

The ON CONFLICT ... DO UPDATE makes the worker idempotent. Re-running it for the same day reproduces the same row, which matters when you're backfilling after fixing a price-book row or when a late-arriving span shows up in the warehouse.

Schedule it with whatever runs your other ETL: Airflow, a Kubernetes CronJob, a pg_cron row. Five lines of config. Finance gets a stable, replay-safe daily_llm_cost table by 02:30 UTC every morning.

The gotcha: retries inflate per-customer cost without revenue

Now the bit that bites you in month two.

A customer's request times out. Your client retries. The retry hits a different model fallback, succeeds, but the LLM provider already charged you for the first attempt's input tokens. The customer experienced one request. Your bill saw two.

If you sum spans naively, you've billed the customer for cost that didn't deliver value. Finance is happy. Customer Success is not happy when the customer asks why their bill jumped 30% in a month where they used the product the same way.

You need to track retries separately. The aggregation worker above already filters them into a retry_count column. The reporting layer can then surface them in two flavors: total spend (what hit your bank account) versus billable spend (what you'd charge the customer if you were passing through cost).

The piece that's easy to forget: the span needs to know it's a retry. Mark it at the call site:

span.set_attribute("llm.is_retry", attempt > 0)
span.set_attribute("llm.attempt", attempt)

Without that, your retry-count column is always zero and the gotcha stays hidden until the first customer escalation. Set it in the wrapper around your LLM client and you never think about it again.

The result of all this: when finance walks over and asks how much Acme Corp cost last month, you point at a table. The number is stable. It reconciles to the provider invoice within rounding. It separates retry cost from organic cost. It splits gross from net. You didn't rebuild your billing pipeline. You set baggage at three boundaries, kept a price-book table honest, and shipped a worker that fits on one screen.

How does your team handle per-customer LLM cost attribution today: pass IDs through every call, propagate via baggage, or skip it and argue about the invoice once a quarter? Drop your pattern in the comments.

If this was useful

The patterns above (baggage propagation, versioned price tables, gross-vs-net reporting) are the everyday machinery that turns a tracing system into a cost system. The chapter on cost attribution in the LLM Observability Pocket Guide goes deeper on the dimension model, the warehouse schema, and the alerting layer that catches a runaway customer before finance does. If you're standing up tracing for an LLM product and want to skip the year of "we'll fix the attribution later," it's the right shortcut.