Most teams compare AI APIs by model quality first and price second.
That is backwards once you have real usage.
The line item that matters is usually not "price per token" by itself. It is:
monthly cost = requests
× (avg input tokens × input price per token)
+ (avg output tokens × output price per token)
+ retries
- cache savings
Here are the five numbers I check before choosing a model.
1. Input/output token ratio
Input and output are priced differently on most APIs.
For chatbots, support agents, code review tools, and report generators, output can dominate the bill because the model writes much more than the user sends.
A cheap-input model can still be expensive if its output price is high and your responses are long.
2. Cache hit rate
If your app repeatedly sends the same system prompt, tool schema, policies, or long context, cached input pricing can change the economics.
This matters most for:
- coding assistants
- support bots with large policy context
- RAG apps with repeated instructions
- internal agents with long tool definitions
If you ignore caching, you may overestimate the monthly cost of larger-context models.
3. Retry rate
The cheapest API is not always the cheapest workflow.
If a low-cost model needs retries, validation cleanup, or a second "fix this JSON" pass, the effective cost goes up fast.
Example:
model A: $0.20 per task, 1 pass
model B: $0.08 per task, but 3 passes often needed
Model B looks cheaper on paper and loses in production.
4. Latency cost
Latency has a money cost even if the API invoice does not show it.
Slow models can reduce conversion, increase queue time, or force you to run more parallel workers.
For user-facing flows, I usually separate models into:
- realtime/chat UX
- background jobs
- batch/offline processing
Those should not always use the same model.
5. Monthly volume bands
At low volume, a more expensive model might be fine if it saves engineering time.
At high volume, tiny per-token differences matter.
A difference of $0.50 per million tokens is irrelevant at 10M tokens/month. It is very relevant at 2B tokens/month.
Quick checklist
Before switching models, estimate:
requests/month
avg input tokens/request
avg output tokens/request
cacheable input %
retry/failure rate
latency requirement
Then compare models by workload, not by headline benchmark score.
I keep a daily-updated pricing table and calculator here if you want current $/1M token numbers across providers:
https://www.aipricing.guru/pricing/
At the moment I’m tracking 89 models across 11 providers, with separate input, cached input, and output pricing where available.
Top comments (0)