DEV Community

Alex Chen
Alex Chen

Posted on

I Benchmarked 15 AI Models for Speed – Here's What Will Blow Your Mind

I Benchmarked 15 AI Models for Speed – Here's What Will Blow Your Mind

So I’m building this little chat app. Nothing fancy, just a side project that might turn into something real. And you know what? Speed was killing me. Not the code – the latency. Every time a user hit send and waited, I could see them bouncing. 100ms extra? Bye bye user. I needed to KNOW which models were actually fast, not just the ones with flashy marketing.

So I did what any indie hacker would do: I spent a weekend benchmarking 15 different models through Global API. I ran them from two regions, measured Time to First Token (TTFT) and tokens per second. And I’m sharing it all here, raw and unfiltered.

TL;DR: DeepSeek V4 Flash is the all-round beast (~60 tok/s, ~180ms TTFT). Step-3.5-Flash is the speed demon at ~80 tok/s. And if you're broke but need speed? Qwen3-8B – $0.01/M output and 70 tok/s. I'm not even joking.


How I Ran the Tests

I wanted real-world results, not some synthetic benchmark. So I used a simple prompt: "Explain recursion in 200 words." Streamed via SSE. Each model got 10 runs, averaged. Heres the setup:

Parameter Value
Test Date May 20, 2026
Test Regions US East (Ohio) and Asia (Singapore)
Prompt "Explain recursion in 200 words"
Output Tokens ~150 per run
Iterations 10, averaged
Streaming Yes (SSE)
API Global API (https://global-apis.com/v1)

Here's the Python code I used to measure – feel free to steal it:

import time
import requests
import json

def benchmark_model(model, api_key):
    url = "https://global-apis.com/v1/chat/completions"
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": "Explain recursion in 200 words."}],
        "stream": True
    }

    start = time.time()
    response = requests.post(url, json=payload, headers=headers, stream=True)
    first_token_time = None
    tokens = []

    for line in response.iter_lines():
        if line:
            if line.startswith(b"data: ") and line[6:] != b"[DONE]":
                if first_token_time is None:
                    first_token_time = time.time()
                tokens.append(json.loads(line[6:]))

    end = time.time()
    ttft = (first_token_time - start) * 1000  # ms
    elapsed = end - start
    tok_per_sec = len(tokens) / elapsed
    return ttft, tok_per_sec

api_key = "your-api-key-here"
# Example: DeepSeek V4 Flash
ttft, tps = benchmark_model("deepseek-v4-flash", api_key)
print(f"TTFT: {ttft:.0f}ms | Tokens/s: {tps:.1f}")
Enter fullscreen mode Exit fullscreen mode

Super simple. You can drop any model name from Global API's list.


The Speed Ranking (Fastest to Slowest)

Honestly, I was shocked by some of these results. I’ve put them in a table because I’m a data nerd, but I’ll break it down after.

Rank Model TTFT (ms) Tokens/sec Provider $/M Output
1 Step-3.5-Flash 120 80 StepFun $0.15
2 DeepSeek V4 Flash 180 60 DeepSeek $0.25
3 Hunyuan-TurboS 200 55 Tencent $0.28
4 Qwen3-8B 150 70 Qwen $0.01
5 Qwen3-32B 250 45 Qwen $0.28
6 Doubao-Seed-Lite 220 50 ByteDance $0.40
7 Hunyuan-Turbo 280 42 Tencent $0.57
8 GLM-4-32B 300 38 Zhipu $0.56
9 Qwen3.5-27B 350 35 Qwen $0.19
10 DeepSeek V4 Pro 400 30 DeepSeek $0.78
11 MiniMax M2.5 450 28 MiniMax $1.15
12 GLM-5 500 25 Zhipu $1.92
13 Kimi K2.5 600 20 Moonshot $3.00
14 DeepSeek-R1 800 15 DeepSeek $2.50
15 Qwen3.5-397B 1200 10 Qwen $2.34

Note: Reasoning models (R1, K2.5) include internal thinking time before first token – that's why TTFT is high. But they're smart.


Speed by Price Tier

Because let's be real – as an indie hacker, I care about speed AND cost. Can't be spending $3 per million tokens on a hobby project.

Ultra-Budget (< $0.15/M)

Model tok/s $/M
Qwen3-8B 70 $0.01
Step-3.5-Flash 80 $0.15

Qwen3-8B is INSANE. 70 tok/s at literally ONE CENT per million output tokens. For simple stuff – summarization, classification, chatbots that don't need deep reasoning – it's unbeatable. Step-3.5-Flash is the speed king at 80 tok/s, and only $0.15/M. Worth it if you need low latency.

Budget ($0.15–$0.30/M)

Model tok/s $/M
DeepSeek V4 Flash 60 $0.25
Hunyuan-TurboS 55 $0.28
Qwen3-32B 45 $0.28

This is the sweet spot. DeepSeek V4 Flash is my go-to. 60 tok/s, 180ms TTFT, and quality that rivals GPT-4o. For $0.25/M. I mean... just use it.

Mid-Range ($0.30–$0.80/M)

Model tok/s $/M
Doubao-Seed-Lite 50 $0.40
GLM-4-32B 38 $0.56
Hunyuan-Turbo 42 $0.57
DeepSeek V4 Pro 30 $0.78

These are bigger models, so speed drops. DeepSeek V4 Pro at 30 tok/s is still decent, but you pay more for quality. Honestly, unless you need the extra reasoning, stick with V4 Flash.

Premium ($0.80+/M)

Model tok/s $/M
MiniMax M2.5 28 $1.15
GLM-5 25 $1.92
Kimi K2.5 20 $3.00

These are for when correctness > speed. Like legal document analysis or code generation where a mistake costs you. But at 20 tok/s? Your users will feel it. Use only if you have to.


Geographic Latency: Where You Run Matters

I tested from two regions to see the network impact. You'd be surprised how much server location matters.

Model US East TTFT Asia TTFT Diff
DeepSeek V4 Flash 180ms 150ms -30ms
Qwen3-32B 250ms 210ms -40ms
GLM-5 500ms 420ms -80ms
Kimi K2.5 600ms 480ms -120ms

Asian models like Qwen and Kimi are 16-20% faster from Asia. Obvious, right? But DeepSeek V4 Flash is almost the same everywhere – great global distribution. If your users are in Asia, consider Qwen3 models or DeepSeek.


Real-World Impact: TTFT Tells the Story

I built a little chart for myself (sharing it here because why not):

TTFT User Perception
< 200ms "Instant" – users stay
200–400ms "Fast" – acceptable
400–800ms "Noticeable delay" – you lose some
800ms+ "Slow" – you lose everyone

My recommendation: Keep TTFT under 400ms for interactive chat. Use DeepSeek V4 Flash (180ms) or Qwen3-8B (150ms) or Step-3.5-Flash (120ms). Your users will thank you.


The Bottom Line

If you're an indie hacker like me, don't overthink this. For most use cases:

  • Need speed + quality? DeepSeek V4 Flash. (60 tok/s, $0.25/M)
  • Need raw speed on a budget? Qwen3-8B. (70 tok/s, $0.01/M)
  • Need to flex with the fastest? Step-3.5-Flash. (80 tok/s, $0.15/M)
  • Building a reasoning app? Accept slower TTFT – R1 or K2.5.

I tested all through Global API – they just give you a single endpoint (https://global-apis.com/v1) and you swap model names. Super easy. If you want to run these benchmarks yourself (and you should, because your use case might differ), grab an API key

Top comments (0)