Bhanu Pratap Singh

Posted on May 16 • Originally published at superml.dev

How to Estimate LLM API Cost Before Shipping Your AI App

#ai #architecture #llm #machinelearning

Most AI app prototypes look cheap.

Then production happens.

A developer tests an LLM feature with 20 prompts, gets a few good responses, and assumes the cost is manageable. But production cost is not based on one prompt. It is based on:

input tokens
output tokens
requests per user
users per day
retry rate
tool calls
prompt caching
conversation history
model choice

That is where many teams get surprised.

The mistake is simple: they estimate the cost of a single API call instead of estimating the cost of the full production workflow.

The basic LLM cost formula

At the simplest level, LLM cost is:

Input cost  = input tokens  × input price
Output cost = output tokens × output price

Total cost = input cost + output cost

But real applications are rarely that simple.

A chat application may include:

system prompt
user message
conversation history
retrieved context
tool schemas
intermediate reasoning steps
final response

So the real token budget looks more like:

Total input tokens =
  system prompt tokens
+ user prompt tokens
+ conversation history tokens
+ retrieved context tokens
+ tool schema tokens
+ intermediate step tokens

And output tokens may include:

assistant response
tool call arguments
intermediate responses
summaries
structured JSON outputs

That is why cost grows faster than expected.

Example: a simple AI assistant

Assume we are building an internal AI assistant.

Each request uses:

Input tokens per request: 3,000
Output tokens per request: 800
Requests per day: 10,000

Daily token volume:

Input tokens/day  = 3,000 × 10,000 = 30,000,000
Output tokens/day =   800 × 10,000 =  8,000,000

Monthly token volume:

Input tokens/month  = 900,000,000
Output tokens/month = 240,000,000

Now multiply that by your provider’s model pricing.

This is where architecture decisions start to matter.

A larger model may give better results, but if the workload is high-volume, the cost difference can become significant. A smaller model may be good enough for classification, routing, extraction, or summarization tasks.

The hidden cost: output tokens

Developers often focus on input tokens.

That is a mistake.

Output tokens are usually more expensive than input tokens for many LLM providers. Also, verbose responses increase latency and cost.

For example, this output style:

{
  "answer": "...",
  "reasoning": "...",
  "sources": [...],
  "confidence": "...",
  "follow_up_questions": [...]
}

may be useful, but it costs more than:

{
  "answer": "...",
  "sources": [...]
}

Structured output is great, but every field has a cost.

For production systems, response design is architecture.

RAG makes the cost calculation harder

In a Retrieval-Augmented Generation system, every request may include retrieved chunks.

Example:

System prompt: 800 tokens
User question: 100 tokens
Retrieved chunks: 5 chunks × 700 tokens = 3,500 tokens
Conversation history: 1,500 tokens
Output reserve: 800 tokens

Total request size:

text 800 + 100 + 3,500 + 1,500 = 5,900 input tokens

A RAG system is not just a vector database problem. It is also a token budgeting problem.

Agentic workflows multiply cost

Agentic AI makes cost estimation even more important.

A simple chat request may be one LLM call.

An agent task may involve:

1. Intent classification
2. Planning
3. Tool selection
4. Tool execution
5. Observation
6. Retry or correction
7. Final answer generation

That means one user request can become 5 to 10 LLM calls.

If each step uses different prompts, context, and outputs, the real cost is:

Cost per user task =
  planning cost
+ tool-selection cost
+ tool-call argument generation cost
+ observation summarization cost
+ retry cost
+ final answer cost

This is why “cost per API call” is the wrong metric for agents.

The better metric is: cost per completed task

Prompt caching can change the economics

If your provider supports prompt caching, repeated prompt sections can be cheaper.

Good candidates for caching:

system prompts
policy instructions
tool definitions
schema descriptions
static business rules
large reusable context blocks

But caching only helps when the repeated portion is stable.

Bad candidates for caching:

user-specific data
frequently changing retrieved chunks
dynamic conversation history
real-time tool outputs

So, when estimating cost, separate your input into:

cacheable tokens
non-cacheable tokens

This gives a more realistic estimate.

What architects should estimate before production

Before shipping an LLM feature, estimate at least these numbers:

Average input tokens per request
Average output tokens per request
Requests per active user per day
Expected daily active users
Retry rate
Cache hit percentage
Number of LLM calls per workflow
Peak traffic multiplier
Monthly token volume
Cost per user
Cost per business transaction

For enterprise systems, also estimate:

cost by use case
cost by department
cost by model
cost by workflow step
cost by failed request
cost with and without caching

This helps prevent surprises later.

A practical example

Suppose we have:

Daily active users: 2,000
Requests per user per day: 15
Input tokens per request: 4,000
Output tokens per request: 700
LLM calls per workflow: 2
Retry rate: 10%

Total daily workflows:

2,000 × 15 = 30,000 workflows/day

Including two LLM calls per workflow:

30,000 × 2 = 60,000 LLM calls/day

Including 10% retry rate:

60,000 × 1.10 = 66,000 effective calls/day

Daily input tokens:

66,000 × 4,000 = 264,000,000 input tokens/day

Daily output tokens:

66,000 × 700 = 46,200,000 output tokens/day

Monthly estimate:

Input tokens/month  = 7.92 billion
Output tokens/month = 1.386 billion

Now the architecture conversation changes.

This is no longer a “small AI feature.” This is a production inference workload.

Ways to reduce LLM cost

The most effective cost controls are usually architectural, not prompt hacks.

Use smaller models for simple steps

Do not use a frontier model for every task.
Use smaller or cheaper models for:

classification
routing
metadata extraction
summarization
guardrail checks
format validation

Use larger models only where reasoning quality matters.

Reduce unnecessary context

More context is not always better.

In RAG systems, sending 15 chunks when 5 are enough increases cost and can reduce answer quality.

Control:

chunk count
chunk size
conversation history
tool schema size
metadata verbosity

Cache stable prompt sections

If the same system prompt, policy text, or tool definitions are reused across requests, caching can reduce cost.

Design prompts with reusable stable sections.

Summarize long conversations

Instead of sending the full conversation every time, maintain a compact memory summary.

Example:

Full conversation history: 12,000 tokens
Compressed summary: 1,200 tokens

That difference matters at scale.

Measure cost per workflow, not cost per call

A single user action may trigger multiple model calls.

Track:

cost per chat response
cost per document processed
cost per support ticket resolved
cost per SQL answer generated
cost per agent task completed

Business-level cost is more useful than API-level cost.

The real lesson

LLM cost estimation is not just finance work. It is architecture work.

Your cost depends on:

model choice
prompt design
context size
RAG strategy
agent loop design
tool schema size
retry handling
caching strategy
output format

That means developers and architects should estimate cost before shipping, not after the bill arrives.

I created a free calculator for this:

👉 LLM Inference Cost Calculator

It helps estimate:

daily cost
monthly cost
cost per user
cost per workflow
token volume
prompt caching impact
multi-model comparison

You can also explore the broader calculator hub here:
👉 SuperML AI Calculators

Final thought

The future of AI architecture will not be only about choosing the best model.

It will be about choosing the right model, for the right step, with the right context, at the right cost.

That is the difference between a demo and a production AI system.

LLM Inference Cost Calculator — Estimate Production LLM API Costs | SuperML — SuperML.dev

Estimate daily and monthly LLM inference costs across OpenAI, Anthropic, Google, Mistral, and Meta models. Includes prompt caching, cost-per-user, token burn-down, and multi-model comparison.