YutoGPT

Posted on Jul 5

AI API Cost Is Not Per Call. It Is Per Workflow.

#ai #saas #llm #architecture

AI API cost is usually forecast at the wrong unit.

Cost per model call matters, but it is not the economic unit of an agentic product. One user-visible action can fan out into retrieval, planner calls, tool calls, validation, retries, fallback models and an audit trail.

The product looks like one feature. The bill looks like a workflow.

The Wrong Question

Most pricing spreadsheets start with a simple number: cost per model call. That works when an AI feature is a single prompt-response exchange. It breaks down when the product becomes agentic.

In an agent workflow, one user-visible action can trigger a chain:

retrieve context;
call a planner model;
call one or more tools;
validate structured output;
retry after a malformed response;
fall back to a stronger model;
summarize the result;
write an audit trail.

The customer clicked one button. The system may have made ten paid decisions.

That is why teams building AI products often feel surprised by their margins. The average model call is not the problem. The tail workflow is the problem.

The common question is:

How much does this model cost per million tokens?

That matters, but it is not enough.

The better question is:

What does this customer outcome cost across the whole workflow?

For example, "answer a support question" is not a model call. It may include retrieval, classification, routing, a draft answer, a policy check, a tool lookup and a handoff. "Generate a report" may include search, extraction, summarization, chart generation and a final review pass.

If you price those as single calls, your spreadsheet looks clean and your gross margin becomes fiction.

Cost Should Be Attached To The Step

A practical system tracks cost at three levels:

Provider call
Workflow step
User-visible action

Provider-call cost tells you which model was expensive.

Workflow-step cost tells you which part of the product is expensive.

User-action cost tells you whether the business model works.

The third one is the number that matters for pricing. If one "automated research brief" costs 30 cents at the median but 9 dollars at the 99th percentile, the median is not your business risk. The tail is.

Routing Is A Product Decision

Once you think in workflows, model routing changes shape.

Routing is not just "send this prompt to the cheapest provider." It is deciding the minimum sufficient model and context for the current step.

Some steps need a strong model because the cost of being wrong is high. Some steps can use a cheap model because they only classify, extract or format. Some steps should not call a model at all if a deterministic rule is available.

A mature AI API layer should know:

which step is being executed;
what budget remains for the user action;
whether the step is reversible;
whether a fallback is allowed;
what evidence must be written before the next action;
when to stop and return a budget-exhausted state instead of silently continuing.

That is the difference between API access and API operations.

Budget Exhaustion Should Be A Product State

Many teams treat usage limits as infrastructure errors.

They should be product states.

If an agent hits a token budget, tool-call limit or retry ceiling, the system should not behave like something broke. It should return a controlled result:

partial answer with known missing sections;
request for user approval to continue;
degraded cheaper path;
handoff to a human;
queued background run;
explicit "budget exhausted" event.

This is especially important for fixed-price AI products. A flat subscription can work only if the runtime has hard ceilings. Otherwise a few pathological agent runs consume the margin for everyone else.

The Receipt Matters

For agentic workflows, the cost record should be part of the operating record.

Not just "we spent 41,000 tokens." The useful receipt says:

which user action triggered the spend;
which workflow step consumed it;
which model/provider was used;
whether a retry or fallback happened;
what tool calls were made;
what limit or policy allowed the spend;
what result was produced.

That is what lets product, engineering and finance have the same conversation.

Without this, finance sees a bill, engineering sees logs, product sees user outcomes, and nobody can reconcile them cleanly.

What This Changes

When cost is measured per workflow, teams can make better decisions:

price by outcome rather than guesswork;
identify which flows destroy margin;
route low-risk steps to cheaper models;
reserve strong models for high-risk steps;
add approval gates only where the spend or action is irreversible;
sell predictable capacity without pretending usage variance does not exist.

The goal is not to make every AI call cheap. The goal is to make the whole workflow economically legible.

That is where AI infrastructure is going: not just more models, but better control over when, why and how models are used.

The next generation of AI API platforms will not be judged only by how many providers they expose. They will be judged by whether they can answer a harder question:

What did this customer outcome cost, why did it cost that, and what should happen next time?

That is the unit that matters.

Curious how other teams are tracking this: provider call, workflow step, user-visible action, or something else?

Top comments (1)

Alexey Spinov • Jul 5

The three-level split is the right decomposition, and the line about the 99th percentile being the business risk is the part most cost dashboards miss. One thing I would add from watching these tails: the p99 workflow is usually not one expensive step, it is a retry-and-fallback chain firing after a step already failed. A malformed response triggers a retry, the retry escalates to the stronger model, and the audit trail records every paid decision on a workflow that was already doomed three calls ago.

So per-action attribution is the correct unit, but it reads the meter after the money is spent. It tells you next week which outcome cost 9 dollars. The complementary piece is a per-action budget checked before each step: a token or dollar ceiling on the whole user-visible action, so a runaway retry chain trips the cap instead of quietly minting the tail. Attribution explains the p99, a pre-step budget bounds it.

Your last point is the cheapest call of all: the step that should not call a model gets gated out deterministically before it runs, so it never enters the attribution table in the first place. Good framing of the unit.