- Aggregate AI bills hide which team, feature, and workflow actually created spend.
- Request-level attribution starts with a small metadata contract: team, service, feature, environment, provider, model, and trace ID.
- You should compute cost from billable token fields plus the active rate card, not from monthly invoices or rough averages.
- Proxy gateways help standardize telemetry, but you still need to preserve upstream provider, model, and raw usage fields.
- The fastest path to action is ranking cost by boundary: team, feature, model, prompt family, and retry path.
Most FinOps teams can tell you what they spent on AI last month. Fewer can tell you which product feature burned the money, which team owned it, or which retry loop quietly doubled a model bill. That gap gets expensive once your company is spending $5,000 to $50,000 per month on LLM APIs.
If you want engineering teams to act on AI spend, you need cost at the same level where engineers make decisions: the request. That means every traced model call needs enough metadata to answer three questions later: who triggered it, what work it was doing, and how much it cost.
This guide shows how to build that layer in 2026 using provider usage fields, OpenTelemetry-friendly metadata, and a cost pipeline that works across OpenAI, Anthropic, and proxy gateways.
Why aggregate billing stops being useful
A monthly invoice is good for finance reconciliation. It is bad for operations.
An aggregate number cannot tell you:
- whether support copilots are cheaper than internal coding agents
- whether one prompt version added 35% more output tokens
- whether retries, fallbacks, or tool loops are creating most of the overrun
- whether one business unit is subsidizing another in the same shared platform budget
The failure mode is familiar. A company sees $18,400 in AI spend for the month, divides by total requests, and concludes the average request cost is fine. Meanwhile, one feature may be costing $0.11 per call and another $0.006 per call. The average hides the problem, so nobody changes prompt design, model routing, or retry policy.
Request-level attribution fixes that. It turns AI spend into something engineers can debug.
The minimum metadata contract for every AI request
Start small. You do not need a giant observability taxonomy on day one. You need enough metadata to group cost by the boundaries that map to real owners.
I recommend attaching these fields to every AI request or span:
-
team: the owning business or platform team -
service.name: the logical service making the call -
service.namespace: the team or domain namespace -
feature: the product surface, workflow, or use case -
environment: production, staging, or test -
provider: openai, anthropic, bedrock, vertex, or gateway name -
model: the exact billed model identifier -
trace_idandrequest_id: for replay and debugging -
customer_tierorworkspace_id: if you need tenant chargeback -
prompt_version: if prompt changes materially affect spend
According to the OpenTelemetry service semantic conventions, service.name identifies the logical service and service.namespace can distinguish a related group of services such as the owning team. The GenAI metrics conventions also recommend recording provider, model, operation name, and token usage, and they specifically say instrumentation should report billable tokens when both used and billable counts exist.
That last point matters more than most teams realize. If your provider or gateway exposes cached input tokens, reasoning tokens, or discounted billable tokens, your chargeback math should follow the billable totals, not a naive token sum.
A trace payload shape that survives real operations
Here is a practical payload shape for a single model call:
{
"trace_id": "trc_01JY8M1Y8P5M5Y4B9A7Q2D8R6H",
"request_id": "req_8b72d",
"team": "growth-platform",
"service.name": "assistant-api",
"service.namespace": "customer-eng",
"feature": "trial-summary",
"environment": "production",
"provider": "openai",
"model": "gpt-5.4-mini",
"input_tokens": 18400,
"cached_input_tokens": 0,
"output_tokens": 1120,
"retry_count": 0,
"latency_ms": 2380,
"workspace_id": "ws_4821",
"prompt_version": "trial-summary-v12"
}
The important design choice is that ownership fields live next to billing fields. If cost data and business context land in different systems, attribution becomes a brittle join later.
A good rule is this: if you cannot answer βwhich team owns this expensive trace?β from one record, the schema is still too thin.
How to compute OpenAI cost per request
For direct OpenAI usage, the basic formula is straightforward:
request_cost = (input_tokens / 1,000,000 * input_rate) + (cached_input_tokens / 1,000,000 * cached_input_rate) + (output_tokens / 1,000,000 * output_rate)
As of June 5, 2026, OpenAI lists standard short-context pricing for gpt-5.4-mini at $0.75 per million input tokens, $0.075 per million cached input tokens, and $4.50 per million output tokens in its API pricing docs.
Example:
- model:
gpt-5.4-mini - input: 18,400 tokens
- cached input: 0 tokens
- output: 1,120 tokens
Cost math:
- input cost = 18,400 / 1,000,000 * 0.75 = $0.0138
- output cost = 1,120 / 1,000,000 * 4.50 = $0.00504
- total request cost = $0.01884
That looks small until volume shows up. At 14,000 requests per month, that one feature costs about $263.76.
OpenAI also notes that eligible regional processing endpoints released on or after March 5, 2026 carry a 10% uplift. If a team turns on regional processing for compliance but you do not capture that pricing variant in your rate card, your internal cost model will underreport spend.
How to compute Anthropic cost per request
Anthropic requires the same pattern, but the rate card is different and geography can change price.
As of June 5, 2026, Anthropic lists Claude Sonnet 4.6 at $3 per million input tokens, $0.30 per million cache-read tokens, and $15 per million output tokens in its pricing docs. Anthropic also says inference_geo: "us" applies a 1.1x multiplier for Claude Sonnet 4.6 and later.
Example without cache reads:
- model:
Claude Sonnet 4.6 - input: 32,000 tokens
- output: 900 tokens
- geography: global
Cost math:
- input cost = 32,000 / 1,000,000 * 3 = $0.096
- output cost = 900 / 1,000,000 * 15 = $0.0135
- total request cost = $0.1095
If the same request runs with US-only inference, multiply by 1.1:
- adjusted total = $0.12045
Now add cache reads and the picture changes again. If 28,000 of those input tokens were served from cache and only 4,000 were uncached:
- uncached input = 4,000 / 1,000,000 * 3 = $0.012
- cached input = 28,000 / 1,000,000 * 0.30 = $0.0084
- output = 900 / 1,000,000 * 15 = $0.0135
- total = $0.0339
That is why request-level cost should store cache dimensions explicitly. Two calls with the same total token count can have very different actual cost.
Where proxy gateways fit into the model
A gateway can simplify instrumentation because every request passes through one choke point. It can also confuse attribution if you only store the gateway name and lose the upstream provider details.
For each gatewayed request, preserve three layers:
| Layer | What to capture | Why it matters |
|---|---|---|
| Ownership |
team, service, feature, workspace_id
|
Chargeback and accountability |
| Upstream billing |
provider, model, token usage, cache usage, region |
Real cost still follows the provider rate card |
| Gateway context | gateway name, route, policy, fallback target, retry count | Explains why cost changed |
If the gateway emits a computed cost field, keep it. But also keep raw usage and model metadata. Rate cards change, fallback routing changes, and gateways sometimes normalize fields differently across providers. Raw inputs let you re-rate old traffic after a pricing update.
A useful pattern is to maintain an internal rate-card table keyed by:
- provider
- model
- pricing mode
- region or inference geography
- effective date
That gives you deterministic historical cost calculations instead of whatever the provider charges today.
How to find the top cost boundaries engineers can actually fix
Once every request has an owner and a cost, do not jump straight to dashboards with fifty cuts. Start with the boundaries that produce action.
Rank monthly cost by:
- team
- feature
- model
- prompt version
- retry or fallback path
Then rank by cost per successful outcome, not only cost per request. A support summary flow that costs $0.03 and resolves a ticket may be healthier than a research agent that costs $0.015 but retries three times and still fails.
The most useful review questions are usually:
- Which feature has the highest total spend?
- Which feature has the highest cost per success?
- Which team saw the largest week-over-week cost jump?
- Which prompt version increased output tokens by more than 20%?
- Which fallback path triggers most of the expensive requests?
That analysis turns attribution into engineering work: shorten prompts, switch models, cache reusable context, cap retries, or move low-value flows to cheaper tiers.
Common implementation mistakes
A few mistakes show up repeatedly:
- Using monthly invoice totals for chargeback instead of request records
- Recording
model=gateway-defaultinstead of the actual upstream model - Missing
featureorteam, which makes expensive traces ownerless - Overwriting historical costs when provider prices change
- Ignoring cached token pricing and regional multipliers
- Tracking only request count, which hides token-heavy outliers
If you fix only one thing this quarter, fix ownerless traffic. Unowned AI spend never gets optimized.
Summary
AI API cost attribution gets operational when every model call carries owner metadata, provider metadata, and billable usage fields in the same record. From there, the math is simple: apply the correct rate card at the request level, preserve pricing variants like cache and geography, and aggregate spend by the boundaries engineers control. If you want a fast sanity check on your trace schema, try the free AI Cost Attribution Auditor.
FAQ
How do I calculate LLM cost per request when a gateway sits in front of OpenAI or Anthropic?
Store the upstream provider, exact model, token usage, and any cache or region fields from the gateway payload. Then rate the request against your internal provider rate card. Do not rely only on the gateway name.
What is the difference between AI API cost attribution and regular cloud chargeback?
Regular cloud chargeback often works at the service or account level. AI API cost attribution has to go deeper because one service can contain many prompts, models, retries, and workflows with very different unit economics.
When should FinOps teams use billable tokens instead of raw token counts?
Use billable tokens whenever the provider or gateway exposes them. Cached tokens, discounted reads, and reasoning-specific fields can make raw token totals a poor proxy for actual spend.
How do I tag AI requests so engineering teams can act on the data?
At minimum, tag team, service, feature, environment, provider, model, trace ID, and prompt version. If you need customer-level chargeback, add workspace or tenant identifiers.
What is the fastest way to reduce AI spend after adding request-level attribution?
Look first at the top three boundaries by total cost and cost per success. In most teams, the fastest wins come from cutting retries, shrinking prompts, caching repeated context, or routing low-stakes work to cheaper models.
Top comments (0)