DEV Community

Cover image for Your model speed benchmark is measuring the wrong thing
Thousand Miles AI
Thousand Miles AI

Posted on

Your model speed benchmark is measuring the wrong thing

Model speed is not a property of the model. It is a property of the model plus your payload size plus your output format plus whether you're constraining decoding. Most published rankings collapse those four axes into one number, and that number is wrong for almost every production workload. If you picked a model based on tokens/sec on someone's leaderboard, you almost certainly picked the wrong one.

The strongest counter is that benchmarks are useful precisely because they normalize away workload variance — you can't compare models if everyone tests on their own prompt. Fair. But normalization that hides the crossover point is worse than no benchmark at all. A leaderboard that ranks Model A above Model B on 200-token completions, when your production workload generates 2,000 tokens of constrained JSON, has actively misled you. The normalization is the bug.

Here is what is actually going on.

There are two latency clocks. Time-to-first-token (TTFT) and total generation time. They respond to payload size in opposite ways. At short outputs, TTFT dominates perceived latency — the user is waiting for the first character. Models built around aggressive speculative decoding win here, because the draft-verification cycle amortizes over a small number of tokens before the user sees anything. At long outputs, total generation time dominates, and speculative decoding's edge erodes — draft acceptance rates fall as the sequence grows, so the technique that won the short race loses the long one.

This is why the developer who benchmarked Haiku against a larger model on a chatbot turn and picked Haiku is now confused when Haiku is slower on their 4k-token summarization endpoint. Nothing changed about the model. The workload changed, and the workload was always the thing being measured.

Format is the second hidden variable, and it's worse than people think. YAML's indentation and colon syntax inflates token counts relative to the information it encodes. Compact JSON (no whitespace) is usually the densest format for structured data. Markdown sits somewhere in the middle and varies wildly by content. Switching the exact same response from YAML to compact JSON can change your effective throughput by 10-30% without touching the model. If your prompt says "respond in YAML" because it looked cleaner in the docs, you are paying for that decision on every request.

The third variable is constrained decoding. The moment you turn on strict JSON schema, tool-call formatting, or any structured-output mode, the model's sampling step has to mask invalid tokens at every position. That mask is per-token compute overhead that scales with schema complexity and is invisible in the throughput numbers published by labs and third-party benchmarks, because almost no public benchmark separates constrained from unconstrained generation. A model that looks 2x faster than another in the unconstrained leaderboard can be at parity or behind once you turn on a non-trivial schema.

The architectural picture explains why no single model wins all four cells of the matrix. Speculative decoding favors short outputs. MoE routing has a relatively fixed per-token overhead, so MoE models amortize better over long sequences and pull ahead at high token counts. Grouped-query attention reduces KV-cache pressure at large context, helping sustained throughput. These are different architectural bets, and they trade off against each other. There is no model that is the speculatively-decoded short-output king and the MoE long-output king and the constrained-decoding overhead winner. The market has not produced one and probably will not.

Which means the only benchmark that matters is the one you run on your own production-representative payload. Same prompt length distribution. Same output length distribution. Same format. Same schema constraints. Same deployment region. Run it on a sample of real traffic, log p50 and p95 for TTFT and total generation time separately, and only then pick a model. Anything else is cargo-cult evaluation.

The leaderboard isn't lying to you. It's answering a question you didn't ask.

Further reading

Top comments (0)