gentleforge

Posted on Jun 2

Quick Tip: Speed-Test 15 AI Models in Under 10 Minutes

#python #api #ai #tutorial

Honestly, I gotta say — I've been building AI apps for years now, and nothing kills a product faster than slow responses. I've watched users bounce from my own prototypes because the thing took three seconds to start talking. Three seconds! That's like a lifetime when you're staring at a loading spinner.

So I decided to run my own speed tests. No corporate fluff, no marketing spin — just me, a Python script, and 15 different models. Heres what I found, including some surprises that'll probably save you money.

Why I Even Bothered With This

Look, I'm an indie hacker. I don't have a team of engineers or a six-figure cloud budget. Every millisecond of latency in my app means real users closing the tab. I've seen it happen.

I was using GPT-4o for everything last month. It's great, sure, but it costs $10.00 per million output tokens and honestly? For my chatbot that answers simple questions, it's overkill. I needed something cheaper and faster.

So I grabbed my laptop, brewed some coffee, and started hammering APIs. Heres the setup I used:

Test date: May 20, 2026
Regions: US East (Ohio), Asia (Singapore)
Prompt: "Explain recursion in 200 words"
Output: About 150 tokens per run
Runs per model: 10 (averaged)
Streaming: Yes (SSE)
API base: https://global-apis.com/v1

Pretty straightforward. I wanted to see two things: Time to First Token (TTFT) — how fast the model starts talking — and sustained tokens per second. Because if a model takes forever to even say "hello," nobody's gonna wait.

The Shocking Speed Rankings

Alright, here's the raw data. I'm gonna be real with you — some of these numbers blew my mind.

Rank	Model	TTFT (ms)	Tokens/sec	Price per M output
🥇	Step-3.5-Flash	120	80	$0.15
🥈	DeepSeek V4 Flash	180	60	$0.25
🥉	Hunyuan-TurboS	200	55	$0.28
4	Qwen3-8B	150	70	$0.01
5	Qwen3-32B	250	45	$0.28
6	Doubao-Seed-Lite	220	50	$0.40
7	Hunyuan-Turbo	280	42	$0.57
8	GLM-4-32B	300	38	$0.56
9	Qwen3.5-27B	350	35	$0.19
10	DeepSeek V4 Pro	400	30	$0.78
11	MiniMax M2.5	450	28	$1.15
12	GLM-5	500	25	$1.92
13	Kimi K2.5	600	20	$3.00
14	DeepSeek-R1	800	15	$2.50
15	Qwen3.5-397B	1200	10	$2.34

I gotta say, seeing Step-3.5-Flash at 120ms TTFT and 80 tok/s for only $0.15/M is INSANE. That's basically free compared to what I was paying before. And Qwen3-8B at $0.01/M? I literally laughed out loud when I saw that number.

But here's the thing — those reasoning models like DeepSeek-R1 and Kimi K2.5? They're slow because they're thinking internally before showing you the first token. That 800ms for R1 includes all the "thinking" time. If you need a math proof, fine. If you're building a chat app, AVOID THEM.

How I Actually Tested This

I wrote a quick Python script. Nothing fancy. Here's what it looked like:

import requests
import time
import json

def test_model_speed(model_name, api_key):
    url = "https://global-apis.com/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    data = {
        "model": model_name,
        "messages": [{"role": "user", "content": "Explain recursion in 200 words"}],
        "stream": True,
        "max_tokens": 200
    }

    start_time = time.time()
    response = requests.post(url, headers=headers, json=data, stream=True)
    first_token_time = None
    total_tokens = 0

    for line in response.iter_lines():
        if line:
            if first_token_time is None:
                first_token_time = time.time() - start_time
            total_tokens += 1

    total_time = time.time() - start_time
    tokens_per_sec = total_tokens / total_time if total_time > 0 else 0

    return {
        "model": model_name,
        "ttft_ms": round(first_token_time * 1000, 2) if first_token_time else None,
        "tokens_per_sec": round(tokens_per_sec, 2)
    }

# Run it
result = test_model_speed("step-3.5-flash", "your-api-key-here")
print(json.dumps(result, indent=2))

That's it. Ten lines of actual logic. You can plug in any model name and get results in seconds. I ran it ten times for each model to average out network noise.

The Price vs Speed Sweet Spot

Alright, let me break this down by budget. Because I know y'all are indie hackers and every cent matters.

Ultra-Budget (Under $0.15/M)

Model	Tokens/sec	Price
Qwen3-8B	70	$0.01
Step-3.5-Flash	80	$0.15

Qwen3-8B at $0.01/M is basically free. I mean, 70 tokens per second for one cent per million? That's absurd. For simple stuff like translation, summarization, or basic Q&A, this is my new go-to. The quality isn't amazing, but for a lot of use cases, it's plenty.

Step-3.5-Flash at 80 tok/s is the speed king. If your app needs to feel instant, use this.

Budget ($0.15-$0.30/M)

Model	Tokens/sec	Price
DeepSeek V4 Flash	60	$0.25
Hunyuan-TurboS	55	$0.28
Qwen3-32B	45	$0.28

DeepSeek V4 Flash is the winner here. 60 tok/s with quality that honestly rivals GPT-4o for a quarter the price. I've been using it for my customer support bot and it's night and day.

Mid-Range ($0.30-$0.80/M)

Model	Tokens/sec	Price
Doubao-Seed-Lite	50	$0.40
GLM-4-32B	38	$0.56
Hunyuan-Turbo	42	$0.57
DeepSeek V4 Pro	30	$0.78

These are bigger models. Speed drops because they're doing more work. V4 Pro at 30 tok/s is slower but noticeably smarter. I use it when I need real reasoning, not just speed.

Premium (Over $0.80/M)

Model	Tokens/sec	Price
MiniMax M2.5	28	$1.15
GLM-5	25	$1.92
Kimi K2.5	20	$3.00

Honestly? Unless you're building something where accuracy is life-or-death, skip these. $3.00 per million tokens for 20 tok/s? No thanks. I'd rather use DeepSeek V4 Flash and run it twice.

Where You're Hosting Matters

I tested from two regions because I'm paranoid about latency. Heres what I found:

Model	US East TTFT	Asia TTFT	Difference
DeepSeek V4 Flash	180ms	150ms	-30ms
Qwen3-32B	250ms	210ms	-40ms
GLM-5	500ms	420ms	-80ms
Kimi K2.5	600ms	480ms	-120ms

Asian models (Qwen, GLM, Kimi) are about 16-20% faster from Asia because their servers are closer. Makes sense. DeepSeek is surprisingly well-distributed globally — only 30ms difference.

If your users are in Asia, use Qwen or GLM. If they're everywhere, DeepSeek V4 Flash is your safest bet.

Real Talk: What This Means for Your App

I've built three products around these findings. Heres what I learned:

Under 200ms TTFT: Users think it's magic. They literally say "wow" out loud.
200-400ms: "Fast enough" — nobody complains.
400-800ms: "Hmm, it's loading..." — some users start getting impatient.
Over 800ms: "This is broken" — they leave. Period.

For my chat app, I use Step-3.5-Flash for the first response (120ms TTFT, feels instant), then switch to DeepSeek V4 Flash for follow-ups. It's a hybrid approach that costs me pennies.

Oh, and here's another code snippet — this one shows how to switch models mid-stream:

import requests

def smart_chat(user_message, api_key):
    url = "https://global-apis.com/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    # First response: use speed demon
    first_data = {
        "model": "step-3.5-flash",
        "messages": [{"role": "user", "content": user_message}],
        "stream": True,
        "max_tokens": 100
    }

    # Send first response fast
    response = requests.post(url, headers=headers, json=first_data, stream=True)
    for line in response.iter_lines():
        if line:
            yield line.decode()

    # For deeper reasoning, switch to smart model
    second_data = {
        "model": "deepseek-v4-flash",
        "messages": [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": "Let me think more..."}
        ],
        "stream": True,
        "max_tokens": 300
    }

    response = requests.post(url, headers=headers, json=second_data, stream=True)
    for line in response.iter_lines():
        if line:
            yield line.decode()

That's it. Fast initial response, smart follow-up. Costs me like $0.001 per conversation.

My Final Take

Look, I'm not saying you should ditch everything and use these models. But if you're an indie hacker like me, you're probably overpaying for speed you don't need.

The biggest takeaway? DeepSeek V4 Flash is the best all-rounder. 60 tok/s, 180ms TTFT, $0.25/M. It's the sweet spot between speed, quality, and cost.

But if you're building something where every millisecond counts — like a real-time assistant or a game — use Step-3.5-Flash. 80 tok/s at $0.15/M is unbeatable.

And if you're on a shoestring budget, Qwen3-8B at $0.01/M with 70 tok/s is literally free performance.

I'm actually using Global API right now for all my tests — they support all these models through a single endpoint. If you want to try them yourself, heres the base URL: https://global-apis.com/v1. Grab a key and run my code. It'll take you like 10 minutes to test everything.

Trust me, your users will thank you. Nothing beats that "wow, it's instant" reaction.

DEV Community