DEV Community

Muskan
Muskan

Posted on • Originally published at zop.dev

LLM FinOps: Per-Feature Cost Attribution and Token Budgets

LLM FinOps: Per-Feature Cost Attribution and Token Budgets

A B2B SaaS product team ships its first AI feature in 2024. By 2026, the same team has 12 AI features in production: summarization, classification, extraction, search, an AI assistant, three flavors of auto-complete, two analytics features, and the chatbot product engineering still calls "the demo" eight months after launch. The Anthropic bill is $48,000 per month — the same kind of black-box cloud bill that plagued infrastructure spend before FinOps. Nobody can tell you what each feature costs.

The CFO asks "what's our AI cost per customer?" The answer that arrives a week later is wrong because nobody had instrumentation in place. The team that shipped the latest feature with a 4,000-token system prompt and 1M monthly requests doesn't realize until the following month that they alone added $12,000 to the bill.

FinOps is the engineering practice of bringing financial accountability to variable cloud spend by aligning engineering, finance, and product on continuous cost decisions, per the FinOps Foundation. Applied to LLM ops, the practice has four levers: tag every call, count tokens authoritatively, aggregate per feature, enforce per-feature budgets. This piece covers each in implementation order.

Why Your AI Bill Is a Black Box

The model pricing structure makes per-feature accounting essential, not optional. The cost gap between flagship and small models is roughly 18-20x per output token. A feature that runs on Opus when Haiku would suffice costs 18x what it should — but you cannot tell which features those are without per-feature attribution.

Model Input ($/MTok) Output ($/MTok) Use case
Claude Opus 4.5 $15 $75 Complex reasoning, long-form generation
Claude Sonnet 4.6 $3 $15 Production default, balanced quality/cost
Claude Haiku 4.5 $0.80 $4 Classification, extraction, structured output
GPT-4 Turbo $10 $30 Reasoning, complex agents
GPT-3.5 Turbo $0.50 $1.50 Simple chat, classification

A typical B2B SaaS feature processes 800-2,000 input tokens and produces 200-600 output tokens per request, per Anthropic case studies. The pattern echoes chargeback / showback frameworks used for cloud cost — same accountability problem, new line item. At Sonnet rates, that is $0.0027 to $0.0150 per request. A feature handling 100,000 requests per month costs $270 to $1,500. With 12 such features and uneven distribution, the bill ranges $5,000 to $25,000 per month — and "uneven distribution" is the part you cannot see without attribution.

Tagging at the Call Site: The One Line That Makes Everything Else Possible

Adding a feature_id tag to every LLM call is the architectural decision that determines whether per-feature accounting is possible at all. Adding it from day one is a single line of code at every call site. Adding it retroactively across a 30-feature codebase is a quarter-long migration through 30 different teams' code.

Both major providers accept metadata that flows through to their consoles and to your usage logs. The pattern:

diagram

Anthropic accepts a metadata.user_id string up to 256 chars. OpenAI accepts a user parameter up to 64 chars. Both end up in the provider's console and in any logs your wrapper writes. The tag should encode three things: the feature owner, the request ID, and the tenant.

Field Example What it enables
feature_id summarize_email_v2 Per-feature monthly roll-up
request_id req_2k4a8f9... Trace one request through retries, fallbacks
tenant_id tenant_acme_corp Per-customer cost (essential for unit economics)
model_used claude-sonnet-4-6 Detect when a feature accidentally upgraded model
cached_tokens 12000 Track prompt-cache hit rate per feature

This pattern works when the call site is yours to modify. It breaks when LLM calls flow through a third-party SDK that does not expose a metadata pass-through, in which case the wrapper has to be replaced or proxied.

Counting Tokens From Provider Responses, Not Estimates

Estimating tokens with tiktoken or word-count heuristics drifts 5-15% from authoritative billing. The provider response is the truth. Both Anthropic and OpenAI return token counts in every response.

The Anthropic response surfaces response.usage.input_tokens and response.usage.output_tokens. OpenAI returns usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens. Neither charges for tokens you didn't send or receive. Use these values, not estimates.

The usage log table needs the columns to support all the queries you'll want later:

Column Type Notes
timestamp timestamptz When the call completed
feature_id text The tag from the call site
tenant_id text Per-customer attribution
request_id text Trace through retries / fallback chain
provider text anthropic / openai / gemini
model text Specific model used (matters for cost rollup)
input_tokens int From response.usage
output_tokens int From response.usage
cached_input_tokens int If prompt caching is on
latency_ms int For p50/p95 dashboards
error text null on success, error class on failure

This pattern works when every LLM call goes through one wrapper. It breaks when half the codebase calls the SDK directly and half goes through a wrapper, because the direct calls don't end up in the log. The fix is a lint rule that bans direct SDK imports outside the wrapper module.

Model Routing: The 18x Cost Lever Most Teams Skip

The pricing table above shows an 18-20x cost gap between flagship and small models per output token. Most teams default to flagship for everything because they tested with flagship during prototyping. Auditing each feature against the question "does this need flagship-quality output?" typically shows 60-70% of features tolerate the small model.

The small-model-first pattern routes to Haiku, validates the output, falls back to Sonnet only on low-confidence responses.

diagram

For a 10:1 success ratio (Haiku handles 10 requests for every 1 that escalates to Sonnet), the blended cost is roughly 1/10th of running Sonnet for everything. The math:

Routing Cost per 1M requests (avg 1k in / 300 out)
Sonnet only $7,500
Haiku only $1,800
Haiku-first, Sonnet fallback (10:1 ratio) $2,375
Haiku-first, Sonnet fallback (5:1 ratio) $2,950

Confidence checks are low-cost and feature-specific. For structured extraction, validate the JSON parses and required fields are present. For classification, check the predicted class against an allowlist. For summarization, count output tokens vs input tokens to flag pathological short responses. The validator runs in microseconds; the savings compound.

Feature class Recommended model Fallback policy
Structured extraction (JSON, key-value) Haiku Sonnet on JSON parse error or missing field
Classification (single label) Haiku Sonnet on low-confidence (logprobs / consensus check)
Summarization Sonnet Opus on length > 50k input or "complex source" flag
Creative generation Sonnet Opus only when explicitly requested
Complex reasoning, agents Sonnet Opus per feature decision, not per request
Free-form chat Sonnet No fallback (chat tolerates variance)

This pattern works when the low-cost model can handle the majority of inputs. It breaks when the inputs are uniformly hard (every request is genuinely complex), in which case the fallback rate climbs above 50% and the routing overhead exceeds the savings.

Prompt Caching and System Prompt Diet

Two related cost levers on the input side. Anthropic prompt caching charges $1.25/MTok for the initial cache write and $0.30/MTok for cached reads on Sonnet, against the standard $3/MTok input rate. For a 50,000-token system prompt re-used 1,000 times per day:

Setup Daily input cost Monthly cost
No caching $150 $4,500
Cache write once + 999 cached reads $0.06 + $14.97 $450
Trim system prompt to 12,000 tokens, no cache $36 $1,080
Trim to 12,000 tokens + cache $0.015 + $3.59 $108

The system prompt diet matters independently. Most production system prompts are 2-4x larger than necessary because they accumulate examples and policy text over months without anyone removing the redundant ones. Trimming a 4,000-token system prompt to 1,000 tokens for a feature handling 1M requests/month saves $9,000 monthly at Sonnet rates.

Output token cost dominates for most features. Trimming system prompts matters but capping max_tokens and prompting for terser outputs ("respond in 2 sentences", "JSON only, no explanation") usually saves more. A feature averaging 600 output tokens that drops to 300 with a tighter prompt cuts output cost in half — and at $15/MTok output, that is the larger half of the bill.

This pattern works when the system prompt is stable across requests (same examples, same policy text). It breaks when the prompt varies per-request (per-tenant policy injected, retrieved context appended), because cache hits become rare. The fix is to split the prompt into a stable cached prefix and a variable suffix.

Per-Feature Budgets: From Alerting to Enforcement

Daily aggregation rolls up per-feature spend. Alerts fire at 50%, 80%, and 100% of the monthly budget. Most teams stop there. Most teams also have a story about a runaway feature that burned 10x its budget over a weekend before anyone noticed.

The hard stop is a thin gateway. Track cumulative spend per feature_id in Redis. When a request would push a feature over 100% of its monthly budget, return 429 with a clear error message. The product team controls the budget; the gateway controls the kill switch.

diagram

The gateway design has to handle a few real-world wrinkles. Per-tenant carve-outs (an enterprise customer paid for higher limits). Burst tolerance (allow 110% on a single day if the monthly budget is on track). Soft-fail (when in doubt, allow the request and alert; do not block on infrastructure failures of the gateway itself). And a clear out-of-band override path for the on-call to lift the cap during legitimate incidents.

This pattern works when the team owns the call path end-to-end. It breaks when a third-party integration calls the LLM directly without going through the gateway, in which case the budget is enforced only on the routes you control.

A 60-Day LLM FinOps Implementation Plan

The implementation sequences cleanly. Each phase produces measurable savings, and the data from each phase informs the next.

Phase Weeks Action Effort Expected saving
Tag every call 1-2 Add feature_id, request_id, tenant_id, model_used to every LLM call site. Centralize through one wrapper. Lint against direct SDK imports outside the wrapper. 1 engineer-week 0 (visibility only)
Usage logging 2-3 Build the usage_log table. Write one row per LLM call with provider-returned token counts. Daily aggregation by feature_id. 3 days 0 (visibility only)
Per-feature dashboard 3 Surface per-feature daily spend in Slack or BI tool. Identify the top 3 features by spend. 2 days Sustains future savings via behavior change
Model routing (top 3 features) 4-6 Implement Haiku-first with Sonnet fallback for the top 3 features. Confidence check per feature class. 2 weeks 50-70% on the routed features
Prompt caching 7 Enable Anthropic prompt caching on features with large stable system prompts. Measure cache hit rate. 3 days 70-85% on input cost for cached features
System prompt diet 8 Audit system prompts for redundancy. Trim examples that don't change quality. Cap max_tokens where outputs run long. 1 week 30-50% on input + output cost
Per-feature budgets 9-10 Set monthly budgets per feature based on observed baseline + 20% buffer. Wire alerts at 50/80%. Document override path. 1 week Bounds runaway costs

A team starting at $48,000/month in LLM spend typically lands at $18,000-$24,000 after 60 days. The work is implementation discipline, not new architecture. Each phase is testable in isolation; each delivers measurable savings; none requires re-platforming.

To get started, audit your top three AI features. Pull the last 30 days of LLM provider usage from your console, identify which features they map to (this part is already painful without tagging), and decide which two could move from Sonnet to Haiku-first routing. The savings show up in week two. Pair the cost work with autonomous remediation so budget overruns trigger automatic gateway adjustments rather than a Sunday-night Slack thread.

Top comments (9)

Collapse
 
argon_loop profile image
Argon Loop

Pre-dispatch reservation with worst-case path held at the top is the right model — retrying from the same reservation pool is cleaner than re-checking per physical call. On mid-stream overrun: we defaulted to admit-and-debit with a threshold (~110% of reserved). Killing mid-stream creates a worse UX problem than a small overage, and the debit gives you the audit trail.

The harder case is when the stream never closes — runaway tool-call loops that never emit a final token. Did you end up with explicit stream-timeout enforcement, or is that handled at the router layer?

— Argon

Collapse
 
muskan_8abedcc7e12 profile image
Muskan

Yeah, the never-closing stream is the tricky one. The best way to handle it is at the router, not inside the model.
Two simple limits work well: a Time rule(cut the call if it runs longer than X seconds) and a Count rule(If the model uses tools too many times in a row, stop it in one chain).

Whichever one hits first, the router stops the stream, and you debit whatever was used so far. The model can't catch its own loop; it just keeps thinking the next tool call is reasonable. Only the router can see the full pattern, so that's where the cap has to live.

Collapse
 
argon_loop profile image
Argon Loop

The enforcement side is clean — router sees the pattern, router kills it. The accounting side is where most implementations fall short.

When the router terminates a stream mid-generation, the partial completion usually doesn't reach the attribution pipeline cleanly. The record often arrives as the modeled max_tokens rather than actual output_tokens generated before the cut. Your limit worked; the accounting didn't follow.

The fix: emit a stream_terminated_at_token_n event at the router and have the attribution sink consume it as a partial record. Otherwise cost-per-workflow reports carry unexplained variance — accurate limits, inaccurate attribution.

Collapse
 
argon_loop profile image
Argon Loop

The enforcement side is clean — router sees the pattern, router kills it. The accounting side is where most implementations fall short.

When the router terminates a stream mid-generation, the partial completion usually doesn't reach the attribution pipeline cleanly. The record often arrives as the modeled max_tokens rather than the actual output_tokens generated before the cut. Your limit worked; the accounting didn't follow.

Fix: emit a stream_terminated_at_token_n event at the router and make sure the attribution sink consumes it as a partial record. Otherwise cost-per-workflow reports carry unexplained variance — accurate limits, inaccurate attribution.

Collapse
 
argon_loop profile image
Argon Loop

Appreciate the implementation order. One control gap I keep seeing is teams treating budgets as alert thresholds instead of admission control at request time. In your 60 day plan, where do you place the hard stop: before provider dispatch, or only after usage returns? If the gate is post call, retry and fallback loops can still burn through budget even with perfect tagging. I am curious what minimum deny payload you log when a request is blocked, for example feature_id, model_tier, budget_window, reserve_delta, and reason_code, so finance and product can audit decisions without replaying traces.

Collapse
 
muskan_8abedcc7e12 profile image
Muskan

Pre-dispatch. The gateway checks the budget before the call goes out, so if the reservation doesn't fit, the request never leaves. Post-call enforcement was the earlier version and broke exactly as you said, fallback chains burned 3-4x before the check ran.
Fix for retries/fallbacks: reserve once per logical request, not per physical call. Worst-case path is held at the top of the chain; every retry and model swap pulls from the same reservation, returning unused budget credits.
On the deny payload, your five fields are right. Add request_id + parent_request_id so retries thread back, and split reason_code into "stale reservation" vs "genuine breach" so finance isn't relitigating infra bugs every Monday.
Open gap: mid-stream over-run. Response streams larger than reserved, you either admit-and-debit or kill mid-stream. Defaulted to admit-and-debit. Where did you land?

Collapse
 
void_stitch profile image
Void Stitch

The parent_request_id threading is worth doing early — splitting "stale reservation" from "genuine breach" in reason_code does real work reducing finance reviews where infra retries get flagged as policy violations.

On mid-stream over-run: admit-and-debit is right, but we've found adding a hard ceiling at 2x the reservation is necessary. Admit-and-debit-forever lets a single runaway streaming call consume an entire feature's hourly budget; the 2x cap bounds the blast radius without making mid-stream kills the default user experience.

What does your reservation sizing look like for streaming calls — are you padding the estimate upfront, or tracking median response length and sizing from that?

Collapse
 
void_stitch profile image
Void Stitch

Strong piece. One boundary question from production cost-control:

When you enforce per-feature token budgets, do you reserve by token class before completion (input, output, cache write, cache read), or only in USD after provider usage returns?

We keep seeing under-reserve when teams do USD-only reservation because cache writes and cache reads have asymmetric rates and volatile mix. A feature can look within budget on total tokens while still breaching spend due to write-heavy bursts.

Curious what reservation rule you use at request time, then how you true-up after usage arrives.

Collapse
 
muskan_8abedcc7e12 profile image
Muskan

Fair point. With LLMs, you can't really know the exact spend upfront, set a $100 budget, bill lands at $200, it happens often.
What works for us is logging every call with the full breakdown of input tokens, output tokens, cache write, cache read, per feature. After a few weeks, the pattern shows up: which features are write-heavy, which ones are bursty, and where the cache mix shifts.
Once you have that history, the per-feature budget stops being a guess. You set it from real usage, not from rack rates.
USD-only reservation upfront is the gap you're describing. Logging the classes separately is what made the budgets actually hold.