Alex Mercer

Posted on May 13 • Originally published at aipricing.guru

AI API cost math: 5 numbers to check before choosing a model

#ai #api #webdev #programming

Most teams compare AI APIs by model quality first and price second.

That is backwards once you have real usage.

The line item that matters is usually not "price per token" by itself. It is:

monthly cost = requests
  × (avg input tokens × input price per token)
  + (avg output tokens × output price per token)
  + retries
  - cache savings

Here are the five numbers I check before choosing a model.

1. Input/output token ratio

Input and output are priced differently on most APIs.

For chatbots, support agents, code review tools, and report generators, output can dominate the bill because the model writes much more than the user sends.

A cheap-input model can still be expensive if its output price is high and your responses are long.

2. Cache hit rate

If your app repeatedly sends the same system prompt, tool schema, policies, or long context, cached input pricing can change the economics.

This matters most for:

coding assistants
support bots with large policy context
RAG apps with repeated instructions
internal agents with long tool definitions

If you ignore caching, you may overestimate the monthly cost of larger-context models.

3. Retry rate

The cheapest API is not always the cheapest workflow.

If a low-cost model needs retries, validation cleanup, or a second "fix this JSON" pass, the effective cost goes up fast.

Example:

model A: $0.20 per task, 1 pass
model B: $0.08 per task, but 3 passes often needed

Model B looks cheaper on paper and loses in production.

4. Latency cost

Latency has a money cost even if the API invoice does not show it.

Slow models can reduce conversion, increase queue time, or force you to run more parallel workers.

For user-facing flows, I usually separate models into:

realtime/chat UX
background jobs
batch/offline processing

Those should not always use the same model.

5. Monthly volume bands

At low volume, a more expensive model might be fine if it saves engineering time.

At high volume, tiny per-token differences matter.

A difference of $0.50 per million tokens is irrelevant at 10M tokens/month. It is very relevant at 2B tokens/month.

Quick checklist

Before switching models, estimate:

requests/month
avg input tokens/request
avg output tokens/request
cacheable input %
retry/failure rate
latency requirement

Then compare models by workload, not by headline benchmark score.

I keep a daily-updated pricing table and calculator here if you want current $/1M token numbers across providers:

https://www.aipricing.guru/pricing/

At the moment I’m tracking 89 models across 11 providers, with separate input, cached input, and output pricing where available.

DEV Community