There's no "cheapest model." There's a cheapest token shape.

#ai #llm #machinelearning #devops

Every time someone asks how to cut their LLM bill, the first question is "which model is cheapest?"
It's the wrong question. I built a cost simulator to check this properly, and across every scenario I model, the cheapest model is almost always the same tiny one. GPT-5.4 nano wins on raw price basically every time. If that were the whole story, model choice would be trivial and nobody would think about cost at all.
The interesting part isn't which model is cheapest. It's where the money actually goes and that's driven by the shape of your usage, not the name on the model.

The number you're guessing at controls your bill

Take a customer support scenario. Same everything, and I only change one input: average output length.
At 350 output tokens per response, nano costs about $63/month, and the bill is roughly balanced input and output are close to even.
Bump output to 1,400 tokens, the kind of thing you'd get if your responses got a little more verbose and the same scenario jumps to $159/month. Output is now 70% of the bill.
One slider. The number most people wave their hand at ("a few hundred tokens?") just tripled the cost and completely changed what's driving it. And output is the expensive token: on most current models it's priced around 6x the input rate. Guessing low on output length is the most expensive mistake in the estimate.

Same "cheapest model," different driver

Now an agent scenario 1,200 input, 900 output, 1,500 requests/day. nano comes out at about $111/month, output around 52% of it.
Note what happened: the cheapest model didn't change. It's still nano. But the driver did. Support with long replies was output-dominated. The agent, with heavier input and moderate output, sits closer to balanced and retries and unused context start showing up as real line items.
That's the whole point. "Support" and "agent" don't have inherent cost profiles. The token shape you plug in does. Two people running the same agent scenario with different output assumptions get different answers about what to optimize.

The stuff you can't see is the stuff that costs you

On the pricier model in that same agent scenario (Gemini 3.5 Flash), two costs stood out that nobody budgets for:

Retries at a 12% rate: about $54/month
Context you're paying for but not using: about $62/month Wasted context outweighed retries. Neither shows up when you eyeball "tokens times price." Both are real money, every month, quietly.

Model choice is a fixed lever. Shape sets the stakes.

Here's the part that surprised me most. Across both scenarios, at every output setting I tried, the gap between the cheap model and the quality model held at roughly 7x. nano vs Flash: ~7.3x in support, ~7.3x in agent.
So switching models is a fixed multiplier, a known, ~7x lever you can pull once. But your token shape sets the absolute size of the bill you're multiplying. Getting the shape right matters before the model question even becomes interesting.
The order most people use is backwards. They pick a model first, then get surprised by the bill. The bill was set by the shape they never examined.

Run your own shape

I'm not asking you to trust my numbers, they're my assumptions, and the whole point is that assumptions are where this lives. The useful thing is to run your shape: your real output length, your retry rate, your context usage, and see which driver is actually eating your bill. Mine flags them for you per scenario, with a dollar estimate on each.
That's the tool: modelindex.io. Pick a scenario, set your tokens, see where the money goes.
I'd genuinely like to know if the drivers it surfaces match what you see in your own production bills. That's the part I'm least sure generalizes, and the thing I'd most like to be told I'm wrong about.