Stop getting surprise per-token LLM bills: a flat-rate, auto-routing API approach

#llm #api #openai #ai

If you ship anything on top of an LLM API, you've probably had this moment: you check the dashboard at the end of the month and the bill is 3x what you modeled. Nothing broke. Usage just... drifted. A few prompts got chattier, one model started "thinking" more, and your per-token math quietly fell apart.

I've been living in that loop, so I want to lay out why per-token pricing is hard to forecast, and a different billing shape that trades some theoretical savings for a number you can actually predict.

Why per-token spend is so hard to model

Per-token billing looks simple — price × tokens — but three things make it slippery in practice:

1. Output length is not yours to control. max_tokens is a hard ceiling, but the model decides how much of that ceiling it actually uses. Two models given the identical prompt can produce wildly different output lengths, and the verbose one costs you more for the same task.

2. "Reasoning" tokens are invisible until billed. Some newer models (e.g. OpenAI's o-series) emit internal reasoning tokens that you're charged for but don't see in the default response. Your logs show a 200-token answer; the invoice counts the 1,500 tokens it took to get there.

3. Input grows silently. RAG context, longer chat histories, system prompts that accrete over time — input token counts creep up release by release, and without active monitoring nobody notices until the bill does.

Individually these are fine. Together they mean your cost-per-request has a fat tail, and a fat tail is exactly what you can't put in a pricing page or a unit-economics spreadsheet.

The alternative: bill a flat price per call

The idea is simple: instead of charging for whatever tokens happened, charge a flat rate per request within a quality tier you pick. The bill becomes calls × tier_price, full stop. Output length, reasoning tokens, which specific model answered — none of it changes what you pay.

You give up something real here (more on that below), but you gain the one thing per-token can't offer: a cost you can forecast before you ship. For a lot of products — fixed-price SaaS features, internal tools with a budget, anything where you quote a customer a number — that predictability is worth more than shaving a few percent off the theoretical optimum.

Concretely: say a request averages 1,000 tokens and a cheap model bills $0.0005/1K — that's ~$0.0005/call. A flat tier at, say, $0.002/call costs more on that uniform workload. But the moment your requests vary — some 200 tokens, some 8,000, some routed to a frontier model — the per-token average climbs and its variance explodes, while the flat price stays put. Flat pricing wins on the spread, not the average.

Pairing it with auto-routing

Flat-per-call pricing gets more interesting when you stop hand-picking the model. If every request costs the same regardless of which model serves it, you can let a router classify each request's difficulty and send it to the cheapest model that can actually handle it — a small model for "summarize this", a frontier model for "debug this race condition" — without you wiring up that logic or eating a surprise when a hard request lands on an expensive model.

The developer-facing contract stays boring on purpose:

curl https://your-gateway/v1/chat/completions \
  -H "content-type: application/json" \
  -H "authorization: Bearer $KEY" \
  -d '{
    "model": "auto",
    "messages": [{"role":"user","content":"Explain quantum entanglement in one sentence."}]
  }'

It's an OpenAI-compatible /chat/completions body, so existing SDKs work by changing base_url and the key. You send one model name (auto) and let routing + a flat tier price do the rest. Zero migration is part of the pitch — if it needs a rewrite, nobody switches.

When flat pricing does NOT pay off (the honest part)

This isn't a free lunch, and pretending otherwise would be dishonest:

High-volume, short, predictable calls. If your requests are tiny and uniform (think: classification of one-line inputs at scale), per-token on a cheap model will almost certainly beat a flat per-call rate. Flat pricing's value is in variance reduction, and you have no variance.
You've already done the FinOps work. If you have tight token budgets, prompt-length guards, and good observability, you've manufactured your own predictability and a flat tier buys you less.
You need features flat tiers cap. Vision inputs, tool/function-calling, huge outputs — flat per-call tiers often bound these (depending on the gateway) to keep the price honest. If you need them, a metered model fits better.
You have strict latency SLAs. Cost-only routing can pick a cheap-but-slow model. If you're under a hard latency budget, the router needs a latency filter too — extra complexity that eats some of the simplicity win.

The rule of thumb: flat pricing is insurance against variance. The more your per-request cost jumps around — mixed task difficulty, verbose or reasoning-heavy models, growing context — the more that insurance is worth. The flatter your workload already is, the less you need it.

Takeaway

Per-token billing isn't wrong, but it optimizes for a metric (tokens) that isn't the one you're trying to control (a predictable bill). If you're quoting fixed prices to customers, or you just want your LLM line item to stop surprising you, a flat per-call rate — ideally with auto-routing underneath — is worth a look.

I've been building this approach into a small OpenAI-compatible gateway called Modelis (one key for GPT/Claude/Gemini, auto routing, flat per-plan pricing, free tier to try). If you want to kick the tires, it's at modelishub.com. But the billing argument stands on its own — I'd genuinely like to hear where flat-vs-metered breaks for your workload in the comments.

Top comments (1)

Nimesh Dilhara Kulasooriya • Jun 16

Which AI Coding Assistant Should I Invest In? 🤔