The first invoice from a frontier AI provider is rarely the one that surprises a finance team. The fiftieth one is. By the time a company has several AI features in production, the monthly line item has often grown past what anyone budgeted for, and the breakdown has become opaque in a way that traditional software spend rarely is. Nobody on the engineering team can easily explain which feature is consuming which fraction of the bill, or what a twenty-percent reduction would require.
This is not a failure of discipline. It is a structural consequence of how AI systems price, how they consume resources, and how much of their cost is buried in adjacent infrastructure that was not obviously part of the AI stack. Understanding total cost of ownership for AI features is a different exercise than understanding it for SaaS or for internal services, and the finance teams that treat it the same way find themselves without clear levers when cost becomes a problem.
This post is the TCO framework we use with clients when AI spend becomes visible enough that the CFO starts asking questions.
The categories of AI cost
AI spend breaks into five categories. A complete TCO model has to account for all five. Most budgets only see the first.
Direct model costs. The per-token price multiplied by volume. This is the invoice the finance team sees. It is also usually the smallest of the five line items over the full lifecycle of a serious deployment, though it is the only one that gets attention.
Supporting infrastructure. Vector databases, embedding services, orchestration layers, observability tools, caching, queueing, rate limiting. A RAG system is not just a model; it is a small platform. The monthly cost of that platform — including the engineers who keep it running — often exceeds the model invoice.
Data preparation and evaluation. The dataset curation for fine-tuning, the golden sets for evaluation, the human review of samples, the red-team testing before release. These costs are concentrated at the start of a feature’s life and recur every time the underlying model or data changes materially. Teams that skip them pay instead in incidents and rework.
Operational headcount. The engineers who maintain the AI features, the platform team that supports them, the data team that curates inputs, the security team that reviews new capabilities. AI features tend to be staff-intensive in a way that SaaS features are not, because there is no vendor operating them on your behalf. You are the vendor for your own AI systems.
Incident cost. The business impact of a failed or degraded AI system — incorrect customer responses, lost sales, trust damage, regulatory exposure. This is the category that accountants struggle with and that risk teams care about most. It is harder to quantify, but ignoring it is the reason companies that underinvest in evaluation eventually overpay in reputation.
Where the hidden cost usually hides
Four patterns produce most of the cost overruns we see in practice.
Unbounded context growth. A feature launches with short prompts. Over six months, product teams add “helpful” context, system prompts grow, retrieval returns larger chunks, conversation history accumulates. Per-token costs double or triple without any announcement of a price change. The fix is boring and effective: context budget reviews, where every token in the prompt has to justify itself.
Retries and fallbacks. The invoice reflects tokens billed, not tokens useful. A feature that retries on safety filters, falls back to a more expensive model when the cheap one fails, or re-runs when the output format is invalid is paying for failures in addition to successes. At scale, the multiplier can be thirty or forty percent. Instrument retry rates as a cost signal, not just a reliability one.
Agentic workflows. An agent that takes ten tool calls and three planning rounds to answer a question costs roughly ten times what a single-shot model call would. The answers are often better, which is the point, but teams underestimate the cost multiplier. Tracking cost per user-facing outcome, not cost per model call, is the only way to see this clearly.
Over-capable models. The frontier-tier model is necessary for perhaps fifteen percent of the queries a feature handles. The other eighty-five percent could be handled by a mid-tier model for a fraction of the cost. Teams routinely send everything to the frontier tier because it is simpler, then discover six months later that a routing layer would have saved half the bill.
The accounting discipline
A TCO model that is actually useful requires a few operational disciplines that most organizations have to build deliberately.
Attribute cost to features, not teams. A provider invoice rolls up to one account. Useful cost analysis requires knowing that feature X consumed forty-seven percent of the spend this quarter, that feature Y is the fastest-growing line item, and that feature Z is the most expensive per customer interaction. This requires tagging every request with feature identifiers at the gateway layer.
Track cost per outcome, not per call. Cost per model call is a technical metric. Cost per resolved support ticket, cost per approved document, cost per qualified lead — those are business metrics that connect AI spend to value. If you cannot compute these, you cannot tell whether an AI feature is earning its cost.
Review monthly, model quarterly, redesign annually. A monthly review catches drift. A quarterly modeling exercise refreshes the TCO against actual usage and renegotiated rates. An annual redesign asks whether the architecture is still right — whether the routing, the model mix, the retrieval strategy, the caching layer are still fit for purpose. Three cadences, three different questions, each useful on its own.
The levers that actually move the number
When cost reduction becomes a priority, the levers with the highest impact tend to be the same across deployments:
Model routing. Sending simple queries to cheap models and hard queries to expensive ones. The easiest way to cut thirty to fifty percent of a frontier bill, with minimal quality impact when implemented carefully.
Prompt compression. Shorter system prompts, tighter retrieval chunks, deduplicated context. Often removes fifteen to twenty-five percent of input tokens without changing behavior.
Caching. For queries with overlapping contexts or answers, a cache layer that survives for seconds to minutes. Effective ratios vary wildly by workload, but a well-placed cache can remove twenty to sixty percent of calls.
Provider negotiation. Enterprise tiers, committed-use discounts, and regional pricing that are not advertised. At serious volume, this is a budgeted procurement activity, not a one-time conversation.
The strategic question
AI cost, handled well, is boring operational discipline. Handled badly, it becomes a strategic constraint — the reason a product cannot scale to the next tier of customers, or the reason a promising feature gets shut down for reasons the business never understood. The difference is whether the organization has built the instrumentation and the accounting discipline to see its AI spend the way it sees any other significant cost category.
If the honest answer to “what does our AI cost and why?” is a shrug, the fix is not more budget. It is the instrumentation that makes cost visible before it becomes a problem.
Top comments (0)