The Price Per Million Tokens Is Lying to You

#ai #llm #benchmark #devtools

About 9 months ago, I was building a RAG system, for those who don't know its a kind of enhanced memory system for AI agents. One of the agentic flows needed semantic similarity, and I had GPT-4o running it because, well, it was OpenAI’s flagship model. Best model, best results, right?

I decided to actually test that assumption. After a few days of systematic testing, I found that a model costing roughly 10x less (GPT-4.1-mini at the time) was giving me equal or better results on that specific task. Not marginally. Noticeably better. On a task I assumed required the most recent, most expensive option.

That experience broke something in how I thought about AI model selection, and I've spent the months since digging into why this happens and how widespread it is.

The pricing page tells you almost nothing.

Every AI provider publishes a price per million tokens. Input tokens, output tokens, maybe a cached rate. Simple enough. But this number is close to meaningless in production because it ignores two things that completely change the math.

First, tokenization. Different models tokenize the same input differently. GPT-5, Claude Sonnet 4.5, Gemini 3.0 Flash etc. Give them the exact same prompt, the exact same input text, and they will produce different token counts. Sometimes the difference is 10-15%. Sometimes it's more. So "price per million tokens" is comparing apples to oranges from the start, because a million tokens from one model does not represent the same amount of work as a million tokens from another.

Second, and this is the bigger one: output volume. This is where reasoning and chain-of-thought models completely blow up the math. A model like DeepSeek Reasoner, gpt-5.2-pro or Claude Opus 4.6 will think through a problem step by step, and that thinking generates tokens. Lots of them. You ask two models the same question, one gives you a 200 token answer, the other gives you 3,000 tokens of reasoning plus a 200 token answer. The second model might be cheaper per million tokens and still cost you 5x more on the actual task.

I've seen this over and over. A model that is "10x cheaper" by the pricing page ends up being more expensive in practice because of how it handles the workload. And a model that looks expensive on paper can be cheaper per task because it's efficient with its tokens.

Why generic benchmarks don't help here

The instinct when choosing a model is to check the leaderboards. MMLU, HumanEval, LMArena, LiveBench. These are useful for understanding general capability. But they tell you nothing about your specific use case.

I'm not being contrarian here. This is just the reality of how these models work. The variables are incredibly subtle. The way you phrase a prompt, the structure of your input, even the position of a comma can change which model performs best. A model that scores 92% on MMLU might score 60% on your classification task while a model that scores 85% on MMLU nails it at 95%.

And none of these benchmarks account for cost. You could be using the "best" model on the leaderboard and spending 10x what you need to, because a model three tiers below it handles your specific workload just as well, if not better.

What actually matters in production

If you're running AI in production, or even just evaluating which model to use for a project, the metrics that matter are:

Accuracy on YOUR task. Not a generic benchmark. Your actual prompts, your actual data, your actual expected outputs.
Real token cost. Not price per million, but what the model actually costs you per task, per call, per pipeline run. This includes input tokens (which vary by tokenizer), output tokens (which vary wildly by model behavior), and any reasoning tokens that get billed.
Latency. Time to first token and total completion time. For agentic workflows or user-facing features, this matters as much as cost.
Consistency. Some models give you brilliant output 70% of the time and garbage the other 30%. Others are boringly reliable. For production, boring and reliable wins every time.

The problem is that getting these numbers requires actually running your workload across multiple models. Not once, not with one prompt, but systematically, on a schedule, with enough variation to get statistically meaningful results. Most teams don't do this because it's tedious and time consuming. They pick the model that "feels right" based on what seems to work and leaderboard rankings, ship it, and never look back.

This is how you end up spending $10k/month on API calls when $2k would give you the same output quality.

The real lesson

The AI model market is moving fast. New models every few weeks. Price cuts, capability jumps, new providers entering. The model that was optimal for your use case three months ago might not be optimal today.

The only way to actually know what works best for you is to test it. On your data, with your prompts, measuring the things that matter for your specific situation. Everything else is guessing.

I learned this the hard way when I found out I was overpaying by 10x on a pipeline I assumed needed a flagship model. Since then, I've made it a practice to re-evaluate model selection whenever a significant new release drops. The cost savings and performance improvements make it worth it every single time.

Bio: Marc Kean Paker is the founder of OpenMark, an AI model benchmarking platform designed to move teams away from leaderboard guessing and toward deterministic, cost-aware model selection.

Top comments (4)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more