There's no "cheapest model." There's a cheapest token shape.

ModelIndex — Thu, 02 Jul 2026 21:39:01 +0000

Every time someone asks how to cut their LLM bill, the first question is "which model is cheapest?"
It's the wrong question. I built a cost simulator to check this properly, and across every scenario I model, the cheapest model is almost always the same tiny one. GPT-5.4 nano wins on raw price basically every time. If that were the whole story, model choice would be trivial and nobody would think about cost at all.
The interesting part isn't which model is cheapest. It's where the money actually goes and that's driven by the shape of your usage, not the name on the model.

The number you're guessing at controls your bill

Take a customer support scenario. Same everything, and I only change one input: average output length.
At 350 output tokens per response, nano costs about $63/month, and the bill is roughly balanced input and output are close to even.
Bump output to 1,400 tokens, the kind of thing you'd get if your responses got a little more verbose and the same scenario jumps to $159/month. Output is now 70% of the bill.
One slider. The number most people wave their hand at ("a few hundred tokens?") just tripled the cost and completely changed what's driving it. And output is the expensive token: on most current models it's priced around 6x the input rate. Guessing low on output length is the most expensive mistake in the estimate.

Same "cheapest model," different driver

Now an agent scenario 1,200 input, 900 output, 1,500 requests/day. nano comes out at about $111/month, output around 52% of it.
Note what happened: the cheapest model didn't change. It's still nano. But the driver did. Support with long replies was output-dominated. The agent, with heavier input and moderate output, sits closer to balanced and retries and unused context start showing up as real line items.
That's the whole point. "Support" and "agent" don't have inherent cost profiles. The token shape you plug in does. Two people running the same agent scenario with different output assumptions get different answers about what to optimize.

The stuff you can't see is the stuff that costs you

On the pricier model in that same agent scenario (Gemini 3.5 Flash), two costs stood out that nobody budgets for:

Retries at a 12% rate: about $54/month
Context you're paying for but not using: about $62/month Wasted context outweighed retries. Neither shows up when you eyeball "tokens times price." Both are real money, every month, quietly.

Model choice is a fixed lever. Shape sets the stakes.

Here's the part that surprised me most. Across both scenarios, at every output setting I tried, the gap between the cheap model and the quality model held at roughly 7x. nano vs Flash: ~7.3x in support, ~7.3x in agent.
So switching models is a fixed multiplier, a known, ~7x lever you can pull once. But your token shape sets the absolute size of the bill you're multiplying. Getting the shape right matters before the model question even becomes interesting.
The order most people use is backwards. They pick a model first, then get surprised by the bill. The bill was set by the shape they never examined.

Run your own shape

I'm not asking you to trust my numbers, they're my assumptions, and the whole point is that assumptions are where this lives. The useful thing is to run your shape: your real output length, your retry rate, your context usage, and see which driver is actually eating your bill. Mine flags them for you per scenario, with a dollar estimate on each.
That's the tool: modelindex.io. Pick a scenario, set your tokens, see where the money goes.
I'd genuinely like to know if the drivers it surfaces match what you see in your own production bills. That's the part I'm least sure generalizes, and the thing I'd most like to be told I'm wrong about.

AI Agents Don’t Scale Like Chatbots

ModelIndex — Thu, 19 Feb 2026 13:35:40 +0000

Originally published on Medium:
https://medium.com/@ravi.myakala/ai-agents-dont-scale-like-chatbots-2434e4fbe321

Most LLM cost estimates use something like:

cost = requests * avg_tokens * price_per_token

That works for chat systems.
It breaks for AI agents.

In multi-step agent systems, cost isn’t driven primarily by request volume — it’s driven by execution depth.

Chat Workloads (Linear Scaling)

A typical chat interaction looks like:

User request
   ↓
LLM
   ↓
Response

cost ≈ requests * tokens_per_request

If traffic doubles, cost doubles.
Predictable. Linear.

Agent Workloads (Internal Multiplication)

Now compare that with a tool-using agent:

User task
   ↓
Reasoning step
   ↓
Tool call
   ↓
Reflection
   ↓
Another tool call
   ↓
More reasoning
   ↓
Final output

A single task can trigger multiple LLM invocations.
This internal expansion is the structural difference.

The Real Agent Cost Model

Instead of:

cost ≈ requests * tokens

Agent systems look more like:

cost ≈ (
    tasks
    * execution_depth
    * tokens_per_step
    * retry_multiplier
    * burst_factor
    * price_per_token
)

Where:
execution_depth = number of reasoning/tool steps per task
retry_multiplier = amplification from tool failures
burst_factor = volatility from uneven task complexity

The dominant driver becomes execution depth, not traffic.

Why Teams Underestimate Agent Cost

Common failure points:

Execution Depth Creep
Workflows evolve from 3 steps to 6–8 steps over time.
Retry Amplification
Tool failures add extra reasoning cycles.
Context Accumulation
Memory grows across steps.
Burst Volatility
Some tasks expand far deeper than others.

By the time telemetry shows cost drift, the architecture is already deployed.

A Canonical Agent Scenario

I modeled a canonical multi-step AI agent workload with:

Controlled execution depth
Tool retries
Context accumulation
Burst volatility

Full structural breakdown here:
👉 https://www.modelindex.io/scenarios/ai-agent

The goal isn’t benchmarking models — it’s understanding structural cost behavior before deployment.

Key Takeaway

Chat systems scale with traffic.
Agent systems scale with internal execution depth.
If you’re modeling cost for multi-step workflows, execution depth is the variable you should track first.

Would love to hear how others are forecasting agent cost in production.

DEV Community: ModelIndex