Void Stitch

Posted on Jun 5

AI API Cost Attribution in 2026: A Practical Guide to LLM Cost Per Request

#finops #devops

Aggregate AI bills hide which team, feature, and workflow actually created spend.
Request-level attribution starts with a small metadata contract: team, service, feature, environment, provider, model, and trace ID.
You should compute cost from billable token fields plus the active rate card, not from monthly invoices or rough averages.
Proxy gateways help standardize telemetry, but you still need to preserve upstream provider, model, and raw usage fields.
The fastest path to action is ranking cost by boundary: team, feature, model, prompt family, and retry path.

Most FinOps teams can tell you what they spent on AI last month. Fewer can tell you which product feature burned the money, which team owned it, or which retry loop quietly doubled a model bill. That gap gets expensive once your company is spending $5,000 to $50,000 per month on LLM APIs.

If you want engineering teams to act on AI spend, you need cost at the same level where engineers make decisions: the request. That means every traced model call needs enough metadata to answer three questions later: who triggered it, what work it was doing, and how much it cost.

This guide shows how to build that layer in 2026 using provider usage fields, OpenTelemetry-friendly metadata, and a cost pipeline that works across OpenAI, Anthropic, and proxy gateways.

Why aggregate billing stops being useful

A monthly invoice is good for finance reconciliation. It is bad for operations.

An aggregate number cannot tell you:

whether support copilots are cheaper than internal coding agents
whether one prompt version added 35% more output tokens
whether retries, fallbacks, or tool loops are creating most of the overrun
whether one business unit is subsidizing another in the same shared platform budget

The failure mode is familiar. A company sees $18,400 in AI spend for the month, divides by total requests, and concludes the average request cost is fine. Meanwhile, one feature may be costing $0.11 per call and another $0.006 per call. The average hides the problem, so nobody changes prompt design, model routing, or retry policy.

Request-level attribution fixes that. It turns AI spend into something engineers can debug.

The minimum metadata contract for every AI request

Start small. You do not need a giant observability taxonomy on day one. You need enough metadata to group cost by the boundaries that map to real owners.

I recommend attaching these fields to every AI request or span:

team: the owning business or platform team
service.name: the logical service making the call
service.namespace: the team or domain namespace
feature: the product surface, workflow, or use case
environment: production, staging, or test
provider: openai, anthropic, bedrock, vertex, or gateway name
model: the exact billed model identifier
trace_id and request_id: for replay and debugging
customer_tier or workspace_id: if you need tenant chargeback
prompt_version: if prompt changes materially affect spend

According to the OpenTelemetry service semantic conventions, service.name identifies the logical service and service.namespace can distinguish a related group of services such as the owning team. The GenAI metrics conventions also recommend recording provider, model, operation name, and token usage, and they specifically say instrumentation should report billable tokens when both used and billable counts exist.

That last point matters more than most teams realize. If your provider or gateway exposes cached input tokens, reasoning tokens, or discounted billable tokens, your chargeback math should follow the billable totals, not a naive token sum.

A trace payload shape that survives real operations

Here is a practical payload shape for a single model call:

{
  "trace_id": "trc_01JY8M1Y8P5M5Y4B9A7Q2D8R6H",
  "request_id": "req_8b72d",
  "team": "growth-platform",
  "service.name": "assistant-api",
  "service.namespace": "customer-eng",
  "feature": "trial-summary",
  "environment": "production",
  "provider": "openai",
  "model": "gpt-5.4-mini",
  "input_tokens": 18400,
  "cached_input_tokens": 0,
  "output_tokens": 1120,
  "retry_count": 0,
  "latency_ms": 2380,
  "workspace_id": "ws_4821",
  "prompt_version": "trial-summary-v12"
}

The important design choice is that ownership fields live next to billing fields. If cost data and business context land in different systems, attribution becomes a brittle join later.

A good rule is this: if you cannot answer “which team owns this expensive trace?” from one record, the schema is still too thin.

How to compute OpenAI cost per request

For direct OpenAI usage, the basic formula is straightforward:

request_cost = (input_tokens / 1,000,000 * input_rate) + (cached_input_tokens / 1,000,000 * cached_input_rate) + (output_tokens / 1,000,000 * output_rate)

As of June 5, 2026, OpenAI lists standard short-context pricing for gpt-5.4-mini at $0.75 per million input tokens, $0.075 per million cached input tokens, and $4.50 per million output tokens in its API pricing docs.

Example:

model: gpt-5.4-mini
input: 18,400 tokens
cached input: 0 tokens
output: 1,120 tokens

Cost math:

input cost = 18,400 / 1,000,000 * 0.75 = $0.0138
output cost = 1,120 / 1,000,000 * 4.50 = $0.00504
total request cost = $0.01884

That looks small until volume shows up. At 14,000 requests per month, that one feature costs about $263.76.

OpenAI also notes that eligible regional processing endpoints released on or after March 5, 2026 carry a 10% uplift. If a team turns on regional processing for compliance but you do not capture that pricing variant in your rate card, your internal cost model will underreport spend.

How to compute Anthropic cost per request

Anthropic requires the same pattern, but the rate card is different and geography can change price.

As of June 5, 2026, Anthropic lists Claude Sonnet 4.6 at $3 per million input tokens, $0.30 per million cache-read tokens, and $15 per million output tokens in its pricing docs. Anthropic also says inference_geo: "us" applies a 1.1x multiplier for Claude Sonnet 4.6 and later.

Example without cache reads:

model: Claude Sonnet 4.6
input: 32,000 tokens
output: 900 tokens
geography: global

Cost math:

input cost = 32,000 / 1,000,000 * 3 = $0.096
output cost = 900 / 1,000,000 * 15 = $0.0135
total request cost = $0.1095

If the same request runs with US-only inference, multiply by 1.1:

adjusted total = $0.12045

Now add cache reads and the picture changes again. If 28,000 of those input tokens were served from cache and only 4,000 were uncached:

uncached input = 4,000 / 1,000,000 * 3 = $0.012
cached input = 28,000 / 1,000,000 * 0.30 = $0.0084
output = 900 / 1,000,000 * 15 = $0.0135
total = $0.0339

That is why request-level cost should store cache dimensions explicitly. Two calls with the same total token count can have very different actual cost.

Where proxy gateways fit into the model

A gateway can simplify instrumentation because every request passes through one choke point. It can also confuse attribution if you only store the gateway name and lose the upstream provider details.

For each gatewayed request, preserve three layers:

Layer	What to capture	Why it matters
Ownership	`team`, `service`, `feature`, `workspace_id`	Chargeback and accountability
Upstream billing	`provider`, `model`, token usage, cache usage, region	Real cost still follows the provider rate card
Gateway context	gateway name, route, policy, fallback target, retry count	Explains why cost changed

If the gateway emits a computed cost field, keep it. But also keep raw usage and model metadata. Rate cards change, fallback routing changes, and gateways sometimes normalize fields differently across providers. Raw inputs let you re-rate old traffic after a pricing update.

A useful pattern is to maintain an internal rate-card table keyed by:

provider
model
pricing mode
region or inference geography
effective date

That gives you deterministic historical cost calculations instead of whatever the provider charges today.

How to find the top cost boundaries engineers can actually fix

Once every request has an owner and a cost, do not jump straight to dashboards with fifty cuts. Start with the boundaries that produce action.

Rank monthly cost by:

team
feature
model
prompt version
retry or fallback path

Then rank by cost per successful outcome, not only cost per request. A support summary flow that costs $0.03 and resolves a ticket may be healthier than a research agent that costs $0.015 but retries three times and still fails.

The most useful review questions are usually:

Which feature has the highest total spend?
Which feature has the highest cost per success?
Which team saw the largest week-over-week cost jump?
Which prompt version increased output tokens by more than 20%?
Which fallback path triggers most of the expensive requests?

That analysis turns attribution into engineering work: shorten prompts, switch models, cache reusable context, cap retries, or move low-value flows to cheaper tiers.

Common implementation mistakes

A few mistakes show up repeatedly:

Using monthly invoice totals for chargeback instead of request records
Recording model=gateway-default instead of the actual upstream model
Missing feature or team, which makes expensive traces ownerless
Overwriting historical costs when provider prices change
Ignoring cached token pricing and regional multipliers
Tracking only request count, which hides token-heavy outliers

If you fix only one thing this quarter, fix ownerless traffic. Unowned AI spend never gets optimized.

Summary

AI API cost attribution gets operational when every model call carries owner metadata, provider metadata, and billable usage fields in the same record. From there, the math is simple: apply the correct rate card at the request level, preserve pricing variants like cache and geography, and aggregate spend by the boundaries engineers control. If you want a fast sanity check on your trace schema, try the free AI Cost Attribution Auditor.

FAQ

How do I calculate LLM cost per request when a gateway sits in front of OpenAI or Anthropic?

Store the upstream provider, exact model, token usage, and any cache or region fields from the gateway payload. Then rate the request against your internal provider rate card. Do not rely only on the gateway name.

What is the difference between AI API cost attribution and regular cloud chargeback?

Regular cloud chargeback often works at the service or account level. AI API cost attribution has to go deeper because one service can contain many prompts, models, retries, and workflows with very different unit economics.

When should FinOps teams use billable tokens instead of raw token counts?

Use billable tokens whenever the provider or gateway exposes them. Cached tokens, discounted reads, and reasoning-specific fields can make raw token totals a poor proxy for actual spend.

How do I tag AI requests so engineering teams can act on the data?

At minimum, tag team, service, feature, environment, provider, model, trace ID, and prompt version. If you need customer-level chargeback, add workspace or tenant identifiers.

What is the fastest way to reduce AI spend after adding request-level attribution?

Look first at the top three boundaries by total cost and cost per success. In most teams, the fastest wins come from cutting retries, shrinking prompts, caching repeated context, or routing low-stakes work to cheaper models.

DEV Community