I Benchmarked 15 AI Models for Speed – Here's What Will Blow Your Mind
So I’m building this little chat app. Nothing fancy, just a side project that might turn into something real. And you know what? Speed was killing me. Not the code – the latency. Every time a user hit send and waited, I could see them bouncing. 100ms extra? Bye bye user. I needed to KNOW which models were actually fast, not just the ones with flashy marketing.
So I did what any indie hacker would do: I spent a weekend benchmarking 15 different models through Global API. I ran them from two regions, measured Time to First Token (TTFT) and tokens per second. And I’m sharing it all here, raw and unfiltered.
TL;DR: DeepSeek V4 Flash is the all-round beast (~60 tok/s, ~180ms TTFT). Step-3.5-Flash is the speed demon at ~80 tok/s. And if you're broke but need speed? Qwen3-8B – $0.01/M output and 70 tok/s. I'm not even joking.
How I Ran the Tests
I wanted real-world results, not some synthetic benchmark. So I used a simple prompt: "Explain recursion in 200 words." Streamed via SSE. Each model got 10 runs, averaged. Heres the setup:
| Parameter | Value |
|---|---|
| Test Date | May 20, 2026 |
| Test Regions | US East (Ohio) and Asia (Singapore) |
| Prompt | "Explain recursion in 200 words" |
| Output Tokens | ~150 per run |
| Iterations | 10, averaged |
| Streaming | Yes (SSE) |
| API | Global API (https://global-apis.com/v1) |
Here's the Python code I used to measure – feel free to steal it:
import time
import requests
import json
def benchmark_model(model, api_key):
url = "https://global-apis.com/v1/chat/completions"
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
payload = {
"model": model,
"messages": [{"role": "user", "content": "Explain recursion in 200 words."}],
"stream": True
}
start = time.time()
response = requests.post(url, json=payload, headers=headers, stream=True)
first_token_time = None
tokens = []
for line in response.iter_lines():
if line:
if line.startswith(b"data: ") and line[6:] != b"[DONE]":
if first_token_time is None:
first_token_time = time.time()
tokens.append(json.loads(line[6:]))
end = time.time()
ttft = (first_token_time - start) * 1000 # ms
elapsed = end - start
tok_per_sec = len(tokens) / elapsed
return ttft, tok_per_sec
api_key = "your-api-key-here"
# Example: DeepSeek V4 Flash
ttft, tps = benchmark_model("deepseek-v4-flash", api_key)
print(f"TTFT: {ttft:.0f}ms | Tokens/s: {tps:.1f}")
Super simple. You can drop any model name from Global API's list.
The Speed Ranking (Fastest to Slowest)
Honestly, I was shocked by some of these results. I’ve put them in a table because I’m a data nerd, but I’ll break it down after.
| Rank | Model | TTFT (ms) | Tokens/sec | Provider | $/M Output |
|---|---|---|---|---|---|
| 1 | Step-3.5-Flash | 120 | 80 | StepFun | $0.15 |
| 2 | DeepSeek V4 Flash | 180 | 60 | DeepSeek | $0.25 |
| 3 | Hunyuan-TurboS | 200 | 55 | Tencent | $0.28 |
| 4 | Qwen3-8B | 150 | 70 | Qwen | $0.01 |
| 5 | Qwen3-32B | 250 | 45 | Qwen | $0.28 |
| 6 | Doubao-Seed-Lite | 220 | 50 | ByteDance | $0.40 |
| 7 | Hunyuan-Turbo | 280 | 42 | Tencent | $0.57 |
| 8 | GLM-4-32B | 300 | 38 | Zhipu | $0.56 |
| 9 | Qwen3.5-27B | 350 | 35 | Qwen | $0.19 |
| 10 | DeepSeek V4 Pro | 400 | 30 | DeepSeek | $0.78 |
| 11 | MiniMax M2.5 | 450 | 28 | MiniMax | $1.15 |
| 12 | GLM-5 | 500 | 25 | Zhipu | $1.92 |
| 13 | Kimi K2.5 | 600 | 20 | Moonshot | $3.00 |
| 14 | DeepSeek-R1 | 800 | 15 | DeepSeek | $2.50 |
| 15 | Qwen3.5-397B | 1200 | 10 | Qwen | $2.34 |
Note: Reasoning models (R1, K2.5) include internal thinking time before first token – that's why TTFT is high. But they're smart.
Speed by Price Tier
Because let's be real – as an indie hacker, I care about speed AND cost. Can't be spending $3 per million tokens on a hobby project.
Ultra-Budget (< $0.15/M)
| Model | tok/s | $/M |
|---|---|---|
| Qwen3-8B | 70 | $0.01 |
| Step-3.5-Flash | 80 | $0.15 |
Qwen3-8B is INSANE. 70 tok/s at literally ONE CENT per million output tokens. For simple stuff – summarization, classification, chatbots that don't need deep reasoning – it's unbeatable. Step-3.5-Flash is the speed king at 80 tok/s, and only $0.15/M. Worth it if you need low latency.
Budget ($0.15–$0.30/M)
| Model | tok/s | $/M |
|---|---|---|
| DeepSeek V4 Flash | 60 | $0.25 |
| Hunyuan-TurboS | 55 | $0.28 |
| Qwen3-32B | 45 | $0.28 |
This is the sweet spot. DeepSeek V4 Flash is my go-to. 60 tok/s, 180ms TTFT, and quality that rivals GPT-4o. For $0.25/M. I mean... just use it.
Mid-Range ($0.30–$0.80/M)
| Model | tok/s | $/M |
|---|---|---|
| Doubao-Seed-Lite | 50 | $0.40 |
| GLM-4-32B | 38 | $0.56 |
| Hunyuan-Turbo | 42 | $0.57 |
| DeepSeek V4 Pro | 30 | $0.78 |
These are bigger models, so speed drops. DeepSeek V4 Pro at 30 tok/s is still decent, but you pay more for quality. Honestly, unless you need the extra reasoning, stick with V4 Flash.
Premium ($0.80+/M)
| Model | tok/s | $/M |
|---|---|---|
| MiniMax M2.5 | 28 | $1.15 |
| GLM-5 | 25 | $1.92 |
| Kimi K2.5 | 20 | $3.00 |
These are for when correctness > speed. Like legal document analysis or code generation where a mistake costs you. But at 20 tok/s? Your users will feel it. Use only if you have to.
Geographic Latency: Where You Run Matters
I tested from two regions to see the network impact. You'd be surprised how much server location matters.
| Model | US East TTFT | Asia TTFT | Diff |
|---|---|---|---|
| DeepSeek V4 Flash | 180ms | 150ms | -30ms |
| Qwen3-32B | 250ms | 210ms | -40ms |
| GLM-5 | 500ms | 420ms | -80ms |
| Kimi K2.5 | 600ms | 480ms | -120ms |
Asian models like Qwen and Kimi are 16-20% faster from Asia. Obvious, right? But DeepSeek V4 Flash is almost the same everywhere – great global distribution. If your users are in Asia, consider Qwen3 models or DeepSeek.
Real-World Impact: TTFT Tells the Story
I built a little chart for myself (sharing it here because why not):
| TTFT | User Perception |
|---|---|
| < 200ms | "Instant" – users stay |
| 200–400ms | "Fast" – acceptable |
| 400–800ms | "Noticeable delay" – you lose some |
| 800ms+ | "Slow" – you lose everyone |
My recommendation: Keep TTFT under 400ms for interactive chat. Use DeepSeek V4 Flash (180ms) or Qwen3-8B (150ms) or Step-3.5-Flash (120ms). Your users will thank you.
The Bottom Line
If you're an indie hacker like me, don't overthink this. For most use cases:
- Need speed + quality? DeepSeek V4 Flash. (60 tok/s, $0.25/M)
- Need raw speed on a budget? Qwen3-8B. (70 tok/s, $0.01/M)
- Need to flex with the fastest? Step-3.5-Flash. (80 tok/s, $0.15/M)
- Building a reasoning app? Accept slower TTFT – R1 or K2.5.
I tested all through Global API – they just give you a single endpoint (https://global-apis.com/v1) and you swap model names. Super easy. If you want to run these benchmarks yourself (and you should, because your use case might differ), grab an API key
Top comments (0)