Most AI app prototypes look cheap.
Then production happens.
A developer tests an LLM feature with 20 prompts, gets a few good responses, and assumes the cost is manageable. But production cost is not based on one prompt. It is based on:
input tokens
output tokens
requests per user
users per day
retry rate
tool calls
prompt caching
conversation history
model choice
That is where many teams get surprised.
The mistake is simple: they estimate the cost of a single API call instead of estimating the cost of the full production workflow.
The basic LLM cost formula
At the simplest level, LLM cost is:
Input cost = input tokens × input price
Output cost = output tokens × output price
Total cost = input cost + output cost
But real applications are rarely that simple.
A chat application may include:
system prompt
user message
conversation history
retrieved context
tool schemas
intermediate reasoning steps
final response
So the real token budget looks more like:
Total input tokens =
system prompt tokens
+ user prompt tokens
+ conversation history tokens
+ retrieved context tokens
+ tool schema tokens
+ intermediate step tokens
And output tokens may include:
assistant response
tool call arguments
intermediate responses
summaries
structured JSON outputs
That is why cost grows faster than expected.
Example: a simple AI assistant
Assume we are building an internal AI assistant.
Each request uses:
Input tokens per request: 3,000
Output tokens per request: 800
Requests per day: 10,000
Daily token volume:
Input tokens/day = 3,000 × 10,000 = 30,000,000
Output tokens/day = 800 × 10,000 = 8,000,000
Monthly token volume:
Input tokens/month = 900,000,000
Output tokens/month = 240,000,000
Now multiply that by your provider’s model pricing.
This is where architecture decisions start to matter.
A larger model may give better results, but if the workload is high-volume, the cost difference can become significant. A smaller model may be good enough for classification, routing, extraction, or summarization tasks.
The hidden cost: output tokens
Developers often focus on input tokens.
That is a mistake.
Output tokens are usually more expensive than input tokens for many LLM providers. Also, verbose responses increase latency and cost.
For example, this output style:
{
"answer": "...",
"reasoning": "...",
"sources": [...],
"confidence": "...",
"follow_up_questions": [...]
}
may be useful, but it costs more than:
{
"answer": "...",
"sources": [...]
}
Structured output is great, but every field has a cost.
For production systems, response design is architecture.
RAG makes the cost calculation harder
In a Retrieval-Augmented Generation system, every request may include retrieved chunks.
Example:
System prompt: 800 tokens
User question: 100 tokens
Retrieved chunks: 5 chunks × 700 tokens = 3,500 tokens
Conversation history: 1,500 tokens
Output reserve: 800 tokens
Total request size:
text 800 + 100 + 3,500 + 1,500 = 5,900 input tokens
A RAG system is not just a vector database problem. It is also a token budgeting problem.
Agentic workflows multiply cost
Agentic AI makes cost estimation even more important.
A simple chat request may be one LLM call.
An agent task may involve:
1. Intent classification
2. Planning
3. Tool selection
4. Tool execution
5. Observation
6. Retry or correction
7. Final answer generation
That means one user request can become 5 to 10 LLM calls.
If each step uses different prompts, context, and outputs, the real cost is:
Cost per user task =
planning cost
+ tool-selection cost
+ tool-call argument generation cost
+ observation summarization cost
+ retry cost
+ final answer cost
This is why “cost per API call” is the wrong metric for agents.
The better metric is: cost per completed task
Prompt caching can change the economics
If your provider supports prompt caching, repeated prompt sections can be cheaper.
Good candidates for caching:
system prompts
policy instructions
tool definitions
schema descriptions
static business rules
large reusable context blocks
But caching only helps when the repeated portion is stable.
Bad candidates for caching:
user-specific data
frequently changing retrieved chunks
dynamic conversation history
real-time tool outputs
So, when estimating cost, separate your input into:
cacheable tokens
non-cacheable tokens
This gives a more realistic estimate.
What architects should estimate before production
Before shipping an LLM feature, estimate at least these numbers:
Average input tokens per request
Average output tokens per request
Requests per active user per day
Expected daily active users
Retry rate
Cache hit percentage
Number of LLM calls per workflow
Peak traffic multiplier
Monthly token volume
Cost per user
Cost per business transaction
For enterprise systems, also estimate:
cost by use case
cost by department
cost by model
cost by workflow step
cost by failed request
cost with and without caching
This helps prevent surprises later.
A practical example
Suppose we have:
Daily active users: 2,000
Requests per user per day: 15
Input tokens per request: 4,000
Output tokens per request: 700
LLM calls per workflow: 2
Retry rate: 10%
Total daily workflows:
2,000 × 15 = 30,000 workflows/day
Including two LLM calls per workflow:
30,000 × 2 = 60,000 LLM calls/day
Including 10% retry rate:
60,000 × 1.10 = 66,000 effective calls/day
Daily input tokens:
66,000 × 4,000 = 264,000,000 input tokens/day
Daily output tokens:
66,000 × 700 = 46,200,000 output tokens/day
Monthly estimate:
Input tokens/month = 7.92 billion
Output tokens/month = 1.386 billion
Now the architecture conversation changes.
This is no longer a “small AI feature.” This is a production inference workload.
Ways to reduce LLM cost
The most effective cost controls are usually architectural, not prompt hacks.
- Use smaller models for simple steps
Do not use a frontier model for every task.
Use smaller or cheaper models for:
classification
routing
metadata extraction
summarization
guardrail checks
format validation
Use larger models only where reasoning quality matters.
- Reduce unnecessary context
More context is not always better.
In RAG systems, sending 15 chunks when 5 are enough increases cost and can reduce answer quality.
Control:
chunk count
chunk size
conversation history
tool schema size
metadata verbosity
- Cache stable prompt sections
If the same system prompt, policy text, or tool definitions are reused across requests, caching can reduce cost.
Design prompts with reusable stable sections.
- Summarize long conversations
Instead of sending the full conversation every time, maintain a compact memory summary.
Example:
Full conversation history: 12,000 tokens
Compressed summary: 1,200 tokens
That difference matters at scale.
- Measure cost per workflow, not cost per call
A single user action may trigger multiple model calls.
Track:
cost per chat response
cost per document processed
cost per support ticket resolved
cost per SQL answer generated
cost per agent task completed
Business-level cost is more useful than API-level cost.
The real lesson
LLM cost estimation is not just finance work. It is architecture work.
Your cost depends on:
model choice
prompt design
context size
RAG strategy
agent loop design
tool schema size
retry handling
caching strategy
output format
That means developers and architects should estimate cost before shipping, not after the bill arrives.
I created a free calculator for this:
👉 LLM Inference Cost Calculator
It helps estimate:
daily cost
monthly cost
cost per user
cost per workflow
token volume
prompt caching impact
multi-model comparison
You can also explore the broader calculator hub here:
👉 SuperML AI Calculators
Final thought
The future of AI architecture will not be only about choosing the best model.
It will be about choosing the right model, for the right step, with the right context, at the right cost.
That is the difference between a demo and a production AI system.
Top comments (0)