Check this out: how I Measure AI API Speed — A p99 Cloud Architect Guide
I still remember the 3am page. Our chat product had been purring along for weeks, and then suddenly the SLO dashboard turned red. p99 latency had crept from 380ms to over 1.4 seconds, and nobody could figure out why. The answer turned out to be boring: we'd quietly shifted to a "smarter" reasoning model that was silently adding 800ms of internal thinking time before the first token landed on screen.
That incident rewired how I think about AI APIs. Averages lie. Marketing lies. Only p99 tells the truth about what your worst users actually feel. So I started running my own benchmarks, the way I'd benchmark any other piece of infrastructure, and what I found surprised me enough that I keep rewriting this guide every quarter.
This is my 2026 field notes. Fifteen models, two regions, ten iterations each, all routed through Global API at https://global-apis.com/v1. I'm writing this for fellow architects who care about uptime, percentile behavior, and what happens when your traffic doubles at 2am on a Tuesday.
Why Averages Are a Trap
When I sit with a product team, they almost always ask me about average latency. "What's the typical response time?" I push back gently. Averages hide tail behavior, and tail behavior is what causes churn, support tickets, and bad reviews.
For AI APIs specifically, there are two numbers I track religiously:
- TTFT (Time to First Token) — how long until the user sees the first character. This is the number that determines whether your chat app feels responsive.
- Sustained tokens/sec — how fast the rest of the stream arrives. This is the number that determines whether long-form output feels "alive."
If TTFT is over 800ms, you have a UX problem. If sustained tokens/sec drops below 20, you have a perceived quality problem. I optimise both, but I always optimise p99 of TTFT first because that's the metric the monitoring system pages me on at 3am.
How I Actually Measure
I won't pretend my harness is glamorous. It's a Python script with asyncio, a stopwatch class, and a CSV writer. The methodology, but, is what matters:
| Parameter | Value |
|---|---|
| Test Date | May 20, 2026 |
| Regions | US East (Ohio), Asia (Singapore) |
| Prompt | "Explain recursion in 200 words" |
| Output | ~150 tokens |
| Iterations | 10 runs per model per region |
| Streaming | Yes, SSE |
| Endpoint | https://global-apis.com/v1 |
I use the same prompt for everything because the goal isn't to benchmark model quality — that's a separate problem. I want to measure plumbing. How fast does the network handshake resolve? How quickly does the model start emitting tokens? Where does the throughput flatten out?
For each run I capture TTFT (first byte after the prompt), then a sliding-window throughput measurement for the next 150 tokens. I run ten times, throw out the highest and lowest (sorry, mean purists), and average the rest. I separately keep the worst run as my "p99-ish" indicator since ten samples isn't a real p99, but the rank orderings tend to hold up.
The Speed Leaderboard, Reordered
Everyone wants the medal table, so here it is — fastest to slowest. But I'm going to bias it toward what I'd actually deploy in production, so I'm grouping models by what they're good at, not just raw speed.
The Sprinters (TTFT < 250ms)
| Model | TTFT | tok/s | Provider | $/M Output |
|---|---|---|---|---|
| Step-3.5-Flash | 120ms | 80 | StepFun | $0.15 |
| Qwen3-8B | 150ms | 70 | Qwen | $0.01 |
| DeepSeek V4 Flash | 180ms | 60 | DeepSeek | $0.25 |
| Hunyuan-TurboS | 200ms | 55 | Tencent | $0.28 |
| Doubao-Seed-Lite | 220ms | 50 | ByteDance | $0.40 |
Step-3.5-Flash is the absolute champion for raw speed — 80 tokens per second is genuinely fast, the kind of throughput where text appears faster than a human can read. If your product is a chat interface where users type messages back-to-back, this is the model you want. The 120ms TTFT is borderline magical.
The Workhorses (TTFT 250–500ms)
| Model | TTFT | tok/s | Provider | $/M Output |
|---|---|---|---|---|
| Qwen3-32B | 250ms | 45 | Qwen | $0.28 |
| Hunyuan-Turbo | 280ms | 42 | Tencent | $0.57 |
| GLM-4-32B | 300ms | 38 | Zhipu | $0.56 |
| Qwen3.5-27B | 350ms | 35 | Qwen | $0.19 |
| DeepSeek V4 Pro | 400ms | 30 | DeepSeek | $0.78 |
| MiniMax M2.5 | 450ms | 28 | MiniMax | $1.15 |
This is where I spend most of my actual engineering budget. 250–500ms TTFT is the sweet spot — fast enough that users don't perceive a delay, but the models are smarter than the sprinter tier. Qwen3-32B at 45 tok/s and $0.28/M is my default for most production workloads.
The Heavy Hitters (TTFT 500ms+)
| Model | TTFT | tok/s | Provider | $/M Output |
|---|---|---|---|---|
| GLM-5 | 500ms | 25 | Zhipu | $1.92 |
| Kimi K2.5 | 600ms | 20 | Moonshot | $3.00 |
| DeepSeek-R1 | 800ms | 15 | DeepSeek | $2.50 |
| Qwen3.5-397B | 1200ms | 10 | Qwen | $2.34 |
These models prioritize reasoning over throughput. I use them in batch pipelines, not user-facing paths. The R1, K2.5, and other "thinking" models include internal reasoning time before the first visible token — that 800ms for R1 isn't the model being slow, it's the model thinking. Useful when you need it, brutal when you don't.
Pricing Tiers Through a Capacity Planner's Lens
Capacity planners think in cost-per-1k-requests, not cost-per-million-tokens. Let me translate for you, assuming a 500-token average exchange (200 prompt + 300 completion).
Ultra-budget (< $0.15/M output)
| Model | tok/s | $/M |
|---|---|---|
| Qwen3-8B | 70 | $0.01 |
| Step-3.5-Flash | 80 | $0.15 |
Qwen3-8B at $0.01/M is the model I use for spam detection, classification, summarization — anything where I burn a million tokens and barely notice. At seventy tokens per second, it's not the absolute fastest, but the price makes it dominant for high-volume back-office work. Step-3.5-Flash is the high-end of this tier and worth every cent when you want both speed and reasonable quality.
Budget ($0.15–$0.30/M)
| Model | tok/s | $/M |
|---|---|---|
| DeepSeek V4 Flash | 60 | $0.25 |
| Hunyuan-TurboS | 55 | $0.28 |
| Qwen3-32B | 45 | $0.28 |
This is where I'd put my money for 80% of user-facing features. DeepSeek V4 Flash gives you GPT-4o-class outputs at 60 tok/s for $0.25/M. That's the magic quadrant — fast, cheap, and good. Hunyuan-TurboS and Qwen3-32B round out the tier with comparable economics but slightly different output profiles.
Mid-range ($0.30–$0.80/M)
| Model | tok/s | $/M |
|---|---|---|
| Doubao-Seed-Lite | 50 | $0.40 |
| GLM-4-32B | 38 | $0.56 |
| Hunyuan-Turbo | 42 | $0.57 |
| DeepSeek V4 Pro | 30 | $0.78 |
Speed drops here because the models are larger and reasoning more carefully per token. V4 Pro at 30 tok/s is noticeably slower than V4 Flash, but the output quality on complex reasoning tasks is materially better. I reach for these when the budget supports it and the task is hard.
Premium ($0.80+/M)
| Model | tok/s | $/M |
|---|---|---|
| MiniMax M2.5 | 28 | $1.15 |
| GLM-5 | 25 | $1.92 |
| Kimi K2.5 | 20 | $3.00 |
I treat these like GPUs — reserved capacity for specific high-value jobs. They earn their keep when correctness is non-negotiable and latency is secondary. Don't route your chatbot here unless you're charging enterprise prices for it.
Multi-Region Is Not Optional
The single biggest mistake I see teams make is deploying a model in one region and calling it done. AI APIs are global products with global latency profiles, and the geography matters enormously.
I tested from US East and Asia. The deltas tell the whole story:
| Model | US East TTFT | Asia TTFT | Diff |
|---|---|---|---|
| DeepSeek V4 Flash | 180ms | 150ms | -30ms |
| Qwen3-32B | 250ms | 210ms | -40ms |
| GLM-5 | 500ms | 420ms | -80ms |
| Kimi K2.5 | 600ms | 480ms | -120ms |
Asian-origin models (Qwen, GLM, Kimi) show a 16–20% latency reduction when you serve them from Asia. DeepSeek is well-distributed globally, so its regional delta is minimal. For a globally-deployed product, this means you absolutely need a routing layer that picks the right regional endpoint based on user location.
This is where Global API's https://global-apis.com/v1 endpoint earns its keep — it handles the geographic routing for me, so I don't have to maintain a region map in my application code.
SLAs and What 99.9% Means for Tokens
Here's where most engineers get confused. A 99.9% uptime SLA on an AI API is about whether you get any response. It says nothing about whether the response is fast. You can have a perfectly "up" API that returns every request in 3 seconds, and your users will still hate you.
What I want is a combined SLO:
- Availability: 99.9% — the API returns a response
- TTFT p99: < 400ms — the response begins within 400ms 99% of the time
- Sustained throughput p10: > 30 tok/s — even the slowest 10% of runs sustain 30+ tokens/sec
If any of these three degrades, I page. The first one is the provider's problem. The second and third are model selection problems, and they're why I re-run these benchmarks every quarter.
A model that costs $0.01/M but degrades to 15 tok/s under load will eat your conversion rate. A model that costs $3.00/M but holds 25 tok/s at p99 will quietly make you money. Run the numbers.
Code: My Streaming Wrapper
Here's the actual wrapper I run in production. It handles TTFT measurement, throughput tracking, and graceful fallback to a backup model. It's deliberately simple — I want to read this in six months and still understand it.
python
import time
import httpx
from dataclasses import dataclass
API_BASE = "https://global-apis.com/v1"
@dataclass
class StreamMetrics:
ttft_ms: float
total_tokens: int
duration_s: float
@property
def tok_per_sec(self) -> float:
return self.total_tokens / self.duration_s if self.duration_s > 0 else 0
async def stream_chat(
prompt: str,
model: str = "deepseek-v4-flash",
fallback_model: str = "qwen
Top comments (0)