Honestly, I gotta say — I've been building AI apps for years now, and nothing kills a product faster than slow responses. I've watched users bounce from my own prototypes because the thing took three seconds to start talking. Three seconds! That's like a lifetime when you're staring at a loading spinner.
So I decided to run my own speed tests. No corporate fluff, no marketing spin — just me, a Python script, and 15 different models. Heres what I found, including some surprises that'll probably save you money.
Why I Even Bothered With This
Look, I'm an indie hacker. I don't have a team of engineers or a six-figure cloud budget. Every millisecond of latency in my app means real users closing the tab. I've seen it happen.
I was using GPT-4o for everything last month. It's great, sure, but it costs $10.00 per million output tokens and honestly? For my chatbot that answers simple questions, it's overkill. I needed something cheaper and faster.
So I grabbed my laptop, brewed some coffee, and started hammering APIs. Heres the setup I used:
Test date: May 20, 2026
Regions: US East (Ohio), Asia (Singapore)
Prompt: "Explain recursion in 200 words"
Output: About 150 tokens per run
Runs per model: 10 (averaged)
Streaming: Yes (SSE)
API base: https://global-apis.com/v1
Pretty straightforward. I wanted to see two things: Time to First Token (TTFT) — how fast the model starts talking — and sustained tokens per second. Because if a model takes forever to even say "hello," nobody's gonna wait.
The Shocking Speed Rankings
Alright, here's the raw data. I'm gonna be real with you — some of these numbers blew my mind.
| Rank | Model | TTFT (ms) | Tokens/sec | Price per M output |
|---|---|---|---|---|
| 🥇 | Step-3.5-Flash | 120 | 80 | $0.15 |
| 🥈 | DeepSeek V4 Flash | 180 | 60 | $0.25 |
| 🥉 | Hunyuan-TurboS | 200 | 55 | $0.28 |
| 4 | Qwen3-8B | 150 | 70 | $0.01 |
| 5 | Qwen3-32B | 250 | 45 | $0.28 |
| 6 | Doubao-Seed-Lite | 220 | 50 | $0.40 |
| 7 | Hunyuan-Turbo | 280 | 42 | $0.57 |
| 8 | GLM-4-32B | 300 | 38 | $0.56 |
| 9 | Qwen3.5-27B | 350 | 35 | $0.19 |
| 10 | DeepSeek V4 Pro | 400 | 30 | $0.78 |
| 11 | MiniMax M2.5 | 450 | 28 | $1.15 |
| 12 | GLM-5 | 500 | 25 | $1.92 |
| 13 | Kimi K2.5 | 600 | 20 | $3.00 |
| 14 | DeepSeek-R1 | 800 | 15 | $2.50 |
| 15 | Qwen3.5-397B | 1200 | 10 | $2.34 |
I gotta say, seeing Step-3.5-Flash at 120ms TTFT and 80 tok/s for only $0.15/M is INSANE. That's basically free compared to what I was paying before. And Qwen3-8B at $0.01/M? I literally laughed out loud when I saw that number.
But here's the thing — those reasoning models like DeepSeek-R1 and Kimi K2.5? They're slow because they're thinking internally before showing you the first token. That 800ms for R1 includes all the "thinking" time. If you need a math proof, fine. If you're building a chat app, AVOID THEM.
How I Actually Tested This
I wrote a quick Python script. Nothing fancy. Here's what it looked like:
import requests
import time
import json
def test_model_speed(model_name, api_key):
url = "https://global-apis.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
data = {
"model": model_name,
"messages": [{"role": "user", "content": "Explain recursion in 200 words"}],
"stream": True,
"max_tokens": 200
}
start_time = time.time()
response = requests.post(url, headers=headers, json=data, stream=True)
first_token_time = None
total_tokens = 0
for line in response.iter_lines():
if line:
if first_token_time is None:
first_token_time = time.time() - start_time
total_tokens += 1
total_time = time.time() - start_time
tokens_per_sec = total_tokens / total_time if total_time > 0 else 0
return {
"model": model_name,
"ttft_ms": round(first_token_time * 1000, 2) if first_token_time else None,
"tokens_per_sec": round(tokens_per_sec, 2)
}
# Run it
result = test_model_speed("step-3.5-flash", "your-api-key-here")
print(json.dumps(result, indent=2))
That's it. Ten lines of actual logic. You can plug in any model name and get results in seconds. I ran it ten times for each model to average out network noise.
The Price vs Speed Sweet Spot
Alright, let me break this down by budget. Because I know y'all are indie hackers and every cent matters.
Ultra-Budget (Under $0.15/M)
| Model | Tokens/sec | Price |
|---|---|---|
| Qwen3-8B | 70 | $0.01 |
| Step-3.5-Flash | 80 | $0.15 |
Qwen3-8B at $0.01/M is basically free. I mean, 70 tokens per second for one cent per million? That's absurd. For simple stuff like translation, summarization, or basic Q&A, this is my new go-to. The quality isn't amazing, but for a lot of use cases, it's plenty.
Step-3.5-Flash at 80 tok/s is the speed king. If your app needs to feel instant, use this.
Budget ($0.15-$0.30/M)
| Model | Tokens/sec | Price |
|---|---|---|
| DeepSeek V4 Flash | 60 | $0.25 |
| Hunyuan-TurboS | 55 | $0.28 |
| Qwen3-32B | 45 | $0.28 |
DeepSeek V4 Flash is the winner here. 60 tok/s with quality that honestly rivals GPT-4o for a quarter the price. I've been using it for my customer support bot and it's night and day.
Mid-Range ($0.30-$0.80/M)
| Model | Tokens/sec | Price |
|---|---|---|
| Doubao-Seed-Lite | 50 | $0.40 |
| GLM-4-32B | 38 | $0.56 |
| Hunyuan-Turbo | 42 | $0.57 |
| DeepSeek V4 Pro | 30 | $0.78 |
These are bigger models. Speed drops because they're doing more work. V4 Pro at 30 tok/s is slower but noticeably smarter. I use it when I need real reasoning, not just speed.
Premium (Over $0.80/M)
| Model | Tokens/sec | Price |
|---|---|---|
| MiniMax M2.5 | 28 | $1.15 |
| GLM-5 | 25 | $1.92 |
| Kimi K2.5 | 20 | $3.00 |
Honestly? Unless you're building something where accuracy is life-or-death, skip these. $3.00 per million tokens for 20 tok/s? No thanks. I'd rather use DeepSeek V4 Flash and run it twice.
Where You're Hosting Matters
I tested from two regions because I'm paranoid about latency. Heres what I found:
| Model | US East TTFT | Asia TTFT | Difference |
|---|---|---|---|
| DeepSeek V4 Flash | 180ms | 150ms | -30ms |
| Qwen3-32B | 250ms | 210ms | -40ms |
| GLM-5 | 500ms | 420ms | -80ms |
| Kimi K2.5 | 600ms | 480ms | -120ms |
Asian models (Qwen, GLM, Kimi) are about 16-20% faster from Asia because their servers are closer. Makes sense. DeepSeek is surprisingly well-distributed globally — only 30ms difference.
If your users are in Asia, use Qwen or GLM. If they're everywhere, DeepSeek V4 Flash is your safest bet.
Real Talk: What This Means for Your App
I've built three products around these findings. Heres what I learned:
- Under 200ms TTFT: Users think it's magic. They literally say "wow" out loud.
- 200-400ms: "Fast enough" — nobody complains.
- 400-800ms: "Hmm, it's loading..." — some users start getting impatient.
- Over 800ms: "This is broken" — they leave. Period.
For my chat app, I use Step-3.5-Flash for the first response (120ms TTFT, feels instant), then switch to DeepSeek V4 Flash for follow-ups. It's a hybrid approach that costs me pennies.
Oh, and here's another code snippet — this one shows how to switch models mid-stream:
import requests
def smart_chat(user_message, api_key):
url = "https://global-apis.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
# First response: use speed demon
first_data = {
"model": "step-3.5-flash",
"messages": [{"role": "user", "content": user_message}],
"stream": True,
"max_tokens": 100
}
# Send first response fast
response = requests.post(url, headers=headers, json=first_data, stream=True)
for line in response.iter_lines():
if line:
yield line.decode()
# For deeper reasoning, switch to smart model
second_data = {
"model": "deepseek-v4-flash",
"messages": [
{"role": "user", "content": user_message},
{"role": "assistant", "content": "Let me think more..."}
],
"stream": True,
"max_tokens": 300
}
response = requests.post(url, headers=headers, json=second_data, stream=True)
for line in response.iter_lines():
if line:
yield line.decode()
That's it. Fast initial response, smart follow-up. Costs me like $0.001 per conversation.
My Final Take
Look, I'm not saying you should ditch everything and use these models. But if you're an indie hacker like me, you're probably overpaying for speed you don't need.
The biggest takeaway? DeepSeek V4 Flash is the best all-rounder. 60 tok/s, 180ms TTFT, $0.25/M. It's the sweet spot between speed, quality, and cost.
But if you're building something where every millisecond counts — like a real-time assistant or a game — use Step-3.5-Flash. 80 tok/s at $0.15/M is unbeatable.
And if you're on a shoestring budget, Qwen3-8B at $0.01/M with 70 tok/s is literally free performance.
I'm actually using Global API right now for all my tests — they support all these models through a single endpoint. If you want to try them yourself, heres the base URL: https://global-apis.com/v1. Grab a key and run my code. It'll take you like 10 minutes to test everything.
Trust me, your users will thank you. Nothing beats that "wow, it's instant" reaction.
Top comments (0)