gentlenode

Posted on Jun 13

How I Measure AI API Speed — A p99 Cloud Architect Guide

#deepseek #machinelearning #api #programming

Check this out: how I Measure AI API Speed — A p99 Cloud Architect Guide

I still remember the 3am page. Our chat product had been purring along for weeks, and then suddenly the SLO dashboard turned red. p99 latency had crept from 380ms to over 1.4 seconds, and nobody could figure out why. The answer turned out to be boring: we'd quietly shifted to a "smarter" reasoning model that was silently adding 800ms of internal thinking time before the first token landed on screen.

That incident rewired how I think about AI APIs. Averages lie. Marketing lies. Only p99 tells the truth about what your worst users actually feel. So I started running my own benchmarks, the way I'd benchmark any other piece of infrastructure, and what I found surprised me enough that I keep rewriting this guide every quarter.

This is my 2026 field notes. Fifteen models, two regions, ten iterations each, all routed through Global API at https://global-apis.com/v1. I'm writing this for fellow architects who care about uptime, percentile behavior, and what happens when your traffic doubles at 2am on a Tuesday.

Why Averages Are a Trap

When I sit with a product team, they almost always ask me about average latency. "What's the typical response time?" I push back gently. Averages hide tail behavior, and tail behavior is what causes churn, support tickets, and bad reviews.

For AI APIs specifically, there are two numbers I track religiously:

TTFT (Time to First Token) — how long until the user sees the first character. This is the number that determines whether your chat app feels responsive.
Sustained tokens/sec — how fast the rest of the stream arrives. This is the number that determines whether long-form output feels "alive."

If TTFT is over 800ms, you have a UX problem. If sustained tokens/sec drops below 20, you have a perceived quality problem. I optimise both, but I always optimise p99 of TTFT first because that's the metric the monitoring system pages me on at 3am.

How I Actually Measure

I won't pretend my harness is glamorous. It's a Python script with asyncio, a stopwatch class, and a CSV writer. The methodology, but, is what matters:

Parameter	Value
Test Date	May 20, 2026
Regions	US East (Ohio), Asia (Singapore)
Prompt	"Explain recursion in 200 words"
Output	~150 tokens
Iterations	10 runs per model per region
Streaming	Yes, SSE
Endpoint	`https://global-apis.com/v1`

I use the same prompt for everything because the goal isn't to benchmark model quality — that's a separate problem. I want to measure plumbing. How fast does the network handshake resolve? How quickly does the model start emitting tokens? Where does the throughput flatten out?

For each run I capture TTFT (first byte after the prompt), then a sliding-window throughput measurement for the next 150 tokens. I run ten times, throw out the highest and lowest (sorry, mean purists), and average the rest. I separately keep the worst run as my "p99-ish" indicator since ten samples isn't a real p99, but the rank orderings tend to hold up.

The Speed Leaderboard, Reordered

Everyone wants the medal table, so here it is — fastest to slowest. But I'm going to bias it toward what I'd actually deploy in production, so I'm grouping models by what they're good at, not just raw speed.

The Sprinters (TTFT < 250ms)

Model	TTFT	tok/s	Provider	$/M Output
Step-3.5-Flash	120ms	80	StepFun	$0.15
Qwen3-8B	150ms	70	Qwen	$0.01
DeepSeek V4 Flash	180ms	60	DeepSeek	$0.25
Hunyuan-TurboS	200ms	55	Tencent	$0.28
Doubao-Seed-Lite	220ms	50	ByteDance	$0.40

Step-3.5-Flash is the absolute champion for raw speed — 80 tokens per second is genuinely fast, the kind of throughput where text appears faster than a human can read. If your product is a chat interface where users type messages back-to-back, this is the model you want. The 120ms TTFT is borderline magical.

The Workhorses (TTFT 250–500ms)

Model	TTFT	tok/s	Provider	$/M Output
Qwen3-32B	250ms	45	Qwen	$0.28
Hunyuan-Turbo	280ms	42	Tencent	$0.57
GLM-4-32B	300ms	38	Zhipu	$0.56
Qwen3.5-27B	350ms	35	Qwen	$0.19
DeepSeek V4 Pro	400ms	30	DeepSeek	$0.78
MiniMax M2.5	450ms	28	MiniMax	$1.15

This is where I spend most of my actual engineering budget. 250–500ms TTFT is the sweet spot — fast enough that users don't perceive a delay, but the models are smarter than the sprinter tier. Qwen3-32B at 45 tok/s and $0.28/M is my default for most production workloads.

The Heavy Hitters (TTFT 500ms+)

Model	TTFT	tok/s	Provider	$/M Output
GLM-5	500ms	25	Zhipu	$1.92
Kimi K2.5	600ms	20	Moonshot	$3.00
DeepSeek-R1	800ms	15	DeepSeek	$2.50
Qwen3.5-397B	1200ms	10	Qwen	$2.34

These models prioritize reasoning over throughput. I use them in batch pipelines, not user-facing paths. The R1, K2.5, and other "thinking" models include internal reasoning time before the first visible token — that 800ms for R1 isn't the model being slow, it's the model thinking. Useful when you need it, brutal when you don't.

Pricing Tiers Through a Capacity Planner's Lens

Capacity planners think in cost-per-1k-requests, not cost-per-million-tokens. Let me translate for you, assuming a 500-token average exchange (200 prompt + 300 completion).

Ultra-budget (< $0.15/M output)

Model	tok/s	$/M
Qwen3-8B	70	$0.01
Step-3.5-Flash	80	$0.15

Qwen3-8B at $0.01/M is the model I use for spam detection, classification, summarization — anything where I burn a million tokens and barely notice. At seventy tokens per second, it's not the absolute fastest, but the price makes it dominant for high-volume back-office work. Step-3.5-Flash is the high-end of this tier and worth every cent when you want both speed and reasonable quality.

Budget ($0.15–$0.30/M)

Model	tok/s	$/M
DeepSeek V4 Flash	60	$0.25
Hunyuan-TurboS	55	$0.28
Qwen3-32B	45	$0.28

This is where I'd put my money for 80% of user-facing features. DeepSeek V4 Flash gives you GPT-4o-class outputs at 60 tok/s for $0.25/M. That's the magic quadrant — fast, cheap, and good. Hunyuan-TurboS and Qwen3-32B round out the tier with comparable economics but slightly different output profiles.

Mid-range ($0.30–$0.80/M)

Model	tok/s	$/M
Doubao-Seed-Lite	50	$0.40
GLM-4-32B	38	$0.56
Hunyuan-Turbo	42	$0.57
DeepSeek V4 Pro	30	$0.78

Speed drops here because the models are larger and reasoning more carefully per token. V4 Pro at 30 tok/s is noticeably slower than V4 Flash, but the output quality on complex reasoning tasks is materially better. I reach for these when the budget supports it and the task is hard.

Premium ($0.80+/M)

Model	tok/s	$/M
MiniMax M2.5	28	$1.15
GLM-5	25	$1.92
Kimi K2.5	20	$3.00

I treat these like GPUs — reserved capacity for specific high-value jobs. They earn their keep when correctness is non-negotiable and latency is secondary. Don't route your chatbot here unless you're charging enterprise prices for it.

Multi-Region Is Not Optional

The single biggest mistake I see teams make is deploying a model in one region and calling it done. AI APIs are global products with global latency profiles, and the geography matters enormously.

I tested from US East and Asia. The deltas tell the whole story:

Model	US East TTFT	Asia TTFT	Diff
DeepSeek V4 Flash	180ms	150ms	-30ms
Qwen3-32B	250ms	210ms	-40ms
GLM-5	500ms	420ms	-80ms
Kimi K2.5	600ms	480ms	-120ms

Asian-origin models (Qwen, GLM, Kimi) show a 16–20% latency reduction when you serve them from Asia. DeepSeek is well-distributed globally, so its regional delta is minimal. For a globally-deployed product, this means you absolutely need a routing layer that picks the right regional endpoint based on user location.

This is where Global API's https://global-apis.com/v1 endpoint earns its keep — it handles the geographic routing for me, so I don't have to maintain a region map in my application code.

SLAs and What 99.9% Means for Tokens

Here's where most engineers get confused. A 99.9% uptime SLA on an AI API is about whether you get any response. It says nothing about whether the response is fast. You can have a perfectly "up" API that returns every request in 3 seconds, and your users will still hate you.

What I want is a combined SLO:

Availability: 99.9% — the API returns a response
TTFT p99: < 400ms — the response begins within 400ms 99% of the time
Sustained throughput p10: > 30 tok/s — even the slowest 10% of runs sustain 30+ tokens/sec

If any of these three degrades, I page. The first one is the provider's problem. The second and third are model selection problems, and they're why I re-run these benchmarks every quarter.

A model that costs $0.01/M but degrades to 15 tok/s under load will eat your conversion rate. A model that costs $3.00/M but holds 25 tok/s at p99 will quietly make you money. Run the numbers.

Code: My Streaming Wrapper

Here's the actual wrapper I run in production. It handles TTFT measurement, throughput tracking, and graceful fallback to a backup model. It's deliberately simple — I want to read this in six months and still understand it.


python
import time
import httpx
from dataclasses import dataclass

API_BASE = "https://global-apis.com/v1"

@dataclass
class StreamMetrics:
    ttft_ms: float
    total_tokens: int
    duration_s: float

    @property
    def tok_per_sec(self) -> float:
        return self.total_tokens / self.duration_s if self.duration_s > 0 else 0

async def stream_chat(
    prompt: str,
    model: str = "deepseek-v4-flash",
    fallback_model: str = "qwen