Every published "tokens per second" number you've used to pick an inference provider is measured on a workload that doesn't exist in your production system. The leaderboard is wrong, and not in a small way — the rankings invert as context length grows, and the model topping the chart at 200 tokens of prompt can be the slowest one at 50k. If you chose your stack based on Artificial Analysis screenshots, there's a decent chance you optimized for the one scenario you'll never run.
The obvious counter is that benchmarks are still directionally useful — that a provider 3x faster on short prompts is probably at least competitive on long ones, and that the leaderboard captures something real about hardware and software investment. That's fair as far as it goes. A team with no benchmark at all is worse off than a team using a misleading one. But "directionally useful" is doing heavy lifting here when the actual direction reverses past some context threshold, and the threshold sits inside the range where most real RAG and agent workloads live.
Here is the mechanical reason. Transformer inference has two phases that look almost nothing like each other. Prefill processes your entire input context in parallel and is compute-bound — it scales with FLOPs and matrix-multiply throughput. Decode generates one token at a time and is memory-bandwidth-bound — it scales with how fast you can stream the KV cache off HBM. A provider can be world-class at one and mediocre at the other. Groq's LPU is the textbook example: spectacular short-context decode numbers driven by deterministic on-chip SRAM, advantage that narrows as prefill starts dominating total latency on long inputs. Dense H100 fleets sit at the opposite end of that tradeoff.
The KV cache is where it gets worse. Cache size grows linearly with context length per request, which means as your prompts get longer, fewer requests fit on a GPU at once, batch sizes collapse, and effective throughput per request drops nonlinearly. A provider running aggressive batching on short prompts looks fast on the leaderboard precisely because the leaderboard prompt is short. Push it to 32k or 64k tokens at production concurrency and you're measuring a different system. The published number is not a lie, but it is an answer to a question you are not asking.
Then layer in speculative decoding, chunked prefill, prefix caching, and whatever else your provider quietly toggled last week. Each one is a win on some context-length-and-concurrency point and a tax on another. Two providers that publish similar headline numbers can have completely different shapes once you plot latency against your real prompt distribution. The rankings don't just shift — they cross each other, sometimes more than once.
The fix is not glamorous. Measure p50 and p95 at your context length, your output length, your concurrency. Use Artificial Analysis and the LLMPerf repo for a starting frame, then run your own load against your own traffic shape. The hour you spend doing this is the highest-leverage hour in your inference-stack budget, and it is an hour that nobody — not a vendor, not a blog post, not this one — can do for you.
The leaderboards aren't going away, and developers will keep citing them in architecture docs because they're easy to screenshot and put in a slide. But if your workload involves long context, real concurrency, or both — which is to say, if it's a workload at all — the fastest model in the headline is rarely the fastest model in production. Stop optimizing for the demo prompt.
Top comments (0)