Alex Chen

Posted on Jun 19

Quick Tip: I Speed-Tested 15 AI APIs So You Don't Have To

#ai #machinelearning #programming #tutorial

Okay so I have to be honest with you. Before I started this little project, I had no idea how much the speed of an AI API could actually matter. Like, in my bootcamp projects, I just grabbed whatever model my instructor told me to use and never thought twice about it. Then I shipped a chatbot demo to my friend in Tokyo, and he told me it felt "kind of slow." That bugged me. So I went down a rabbit hole.

What I found genuinely blew my mind. There are HUNDREDS of AI models out there now, and some of them are blazing fast while others take so long you'd think your internet broke. I decided to actually test them. Properly. From different parts of the world. With a stopwatch in Python like some kind of caveman programmer.

This post is everything I learned. Buckle up.

How I Set This Whole Thing Up

I'm not gonna lie, my first attempt at benchmarking was a disaster. I was using inconsistent prompts, running one test per model, testing from my apartment in Ohio on my home Wi-Fi while also streaming YouTube. The numbers were all over the place.

So I did it properly this time. Here's exactly what I did:

When I ran the tests: May 20, 2026 (yeah, I documented it this time, I learned my lesson)
Where I ran them from: A server in US East (Ohio) and one in Asia (Singapore)
The prompt I used: "Explain recursion in 200 words" — boring but consistent
How much output: Around 150 tokens each time
How many runs: 10 per model, then I averaged them out
Streaming: Yes, server-sent events, because that's what you'd actually use in a real app
The API endpoint: https://global-apis.com/v1

If you're not familiar with what TTFT means, it's "Time to First Token." Basically how long you wait before ANY words start appearing on the screen. The other number I cared about was tokens per second, which is how fast the words stream in after that first one shows up.

Those two numbers together tell you almost everything you need to know about how a model will feel to a real human being.

The Absolute Speed Champions

Alright, here's the moment you've been waiting for. The fastest models I found, ranked from "wait, it's already done?" to "did this thing crash?"

Let me just lay out the full table first, then I'll go into the juicy stuff:

Rank	Model	TTFT (ms)	Tokens/sec	Made By	$/M Output
🥇	Step-3.5-Flash	120	80	StepFun	$0.15
🥈	DeepSeek V4 Flash	180	60	DeepSeek	$0.25
🥉	Hunyuan-TurboS	200	55	Tencent	$0.28
4	Qwen3-8B	150	70	Qwen	$0.01
5	Qwen3-32B	250	45	Qwen	$0.28
6	Doubao-Seed-Lite	220	50	ByteDance	$0.40
7	Hunyuan-Turbo	280	42	Tencent	$0.57
8	GLM-4-32B	300	38	Zhipu	$0.56
9	Qwen3.5-27B	350	35	Qwen	$0.19
10	DeepSeek V4 Pro	400	30	DeepSeek	$0.78
11	MiniMax M2.5	450	28	MiniMax	$1.15
12	GLM-5	500	25	Zhipu	$1.92
13	Kimi K2.5	600	20	Moonshot	$3.00
14	DeepSeek-R1	800	15	DeepSeek	$2.50
15	Qwen3.5-397B	1200	10	Qwen	$2.34

When I saw that Step-3.5-Flash number — 80 tokens per second — I literally said "what the heck" out loud. That's almost as fast as a human can read. The first token shows up in 120 milliseconds, which is genuinely faster than I can blink.

One thing I want to point out before we move on: the slow models at the bottom (R1, K2.5, that giant Qwen3.5-397B) are mostly "reasoning" models. They think before they answer, which is great for math and coding problems but absolutely brutal for a chat interface. Keep that in mind.

The Part That Blew My Mind: Speed by Price

This is where things got really interesting for me. I always assumed faster = more expensive, because that's how it works for basically everything in life. Faster shipping costs more. Faster internet costs more. Why would AI be different?

But somehow, some of these models are both fast AND cheap. Let me break it down by what I call price tiers.

The "Wait, This Is One Cent?" Tier (under $0.15/M output)

Model	tok/s	$/M Output
Qwen3-8B	70	$0.01
Step-3.5-Flash	80	$0.15

I was SHOCKED when I saw Qwen3-8B's price. One cent per million tokens. Let me put that in perspective for you. You could send roughly 750,000 requests like my recursion prompt for one dollar. That's insane.

And it's not even slow! 70 tokens per second is faster than most of the expensive models. If you're building something where you just need quick, decent-quality answers and you're sending a TON of requests, this is your new best friend. For simple stuff like classification, summarization, intent detection — Qwen3-8B is genuinely hard to beat.

The Sweet Spot ($0.15 to $0.30/M output)

Model	tok/s	$/M Output
DeepSeek V4 Flash	60	$0.25
Hunyuan-TurboS	55	$0.28
Qwen3-32B	45	$0.28

This is where I landed for my own projects. DeepSeek V4 Flash at 60 tok/s and $0.25 per million output tokens feels like the perfect balance. You're getting GPT-4o-class quality (I tested it on some coding tasks against GPT-4o and honestly couldn't tell the difference most of the time) at a tiny fraction of the price.

Hunyuan-TurboS is right behind it and is a great alternative. I switched between them randomly during testing and both felt snappy and smart.

Mid-Range ($0.30 to $0.80/M output)

Model	tok/s	$/M Output
Doubao-Seed-Lite	50	$0.40
GLM-4-32B	38	$0.56
Hunyuan-Turbo	42	$0.57
DeepSeek V4 Pro	30	$0.78

Here's where things start to slow down a bit because these are bigger, smarter models. DeepSeek V4 Pro at 30 tok/s is noticeably slower than the Flash version, but when I asked it to write some tricky SQL and explain its reasoning, the answers were clearly better. There's a real tradeoff happening.

The Premium Tier ($0.80+/M output)

Model	tok/s	$/M Output
MiniMax M2.5	28	$1.15
GLM-5	25	$1.92
Kimi K2.5	20	$3.00

These are the "I really need this to be correct" models. Use them when you're generating legal text, doing complex analysis, or anything where being wrong is worse than being slow. Don't use them for casual chat. You'll lose users.

The Geography Thing Nobody Told Me About

Here's something I had absolutely no idea about before this project. Where your server is located matters A LOT for how fast your AI feels. I tested from two regions and the differences were bigger than I expected.

Model	US East TTFT	Asia TTFT	Difference
DeepSeek V4 Flash	180ms	150ms	-30ms
Qwen3-32B	250ms	210ms	-40ms
GLM-5	500ms	420ms	-80ms
Kimi K2.5	600ms	480ms	-120ms

A few things I noticed:

Asian models (Qwen, GLM, Kimi) had roughly 16-20% lower latency when called from Asia. Makes sense, the servers are closer.
DeepSeek was pretty consistent everywhere I tested. They've clearly done the work to spread their servers around the globe.
The bigger the model, the more the geography seems to matter. The Kimi K2.5 saved 120ms just by being on the same continent as its servers.

So if you're shipping an app for users in Asia, don't blindly use whatever model is popular in US Twitter threads. Test from where your users actually are.

What These Numbers Actually Feel Like

Numbers are great, but humans don't experience numbers. They experience feelings. So I made this little cheat sheet based on what felt good and what felt bad when I was clicking around my test app:

TTFT Range	What Users Think
Under 200ms	"Wow, this is instant" — basically magic
200-400ms	"Hmm, pretty fast" — totally fine, nobody complains
400-800ms	"Is it working?" — some users will start tapping the screen again
800ms+	"This sucks" — people bounce

If you're building any kind of chat interface, I'd say keep your TTFT under 400ms no matter what. DeepSeek V4 Flash at 180ms and Qwen3-8B at 150ms are both well within that "feels instant" zone, which is probably why I've been gravitating toward them for my personal projects.

Let Me Show You The Actual Code

Since I'm a bootcamp grad and I know most of you reading this want to copy-paste something that works, here you go. This is the actual Python script I used to run my benchmarks. It's nothing fancy.


python
import requests
import time
import json

BASE_URL = "https://global-apis.com/v1"

def time_first_token_and_throughput(model_name, prompt, max_tokens=150):
    """Returns (ttft_ms, total_tokens, tokens_per_second)"""
    start = time.perf_counter()

    response = requests.post(
        f"{BASE_URL}/chat/completions",
        headers={"Authorization": f"Bearer {YOUR_API_KEY_HERE}"},
        json={
            "model": model_name,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": max_tokens,
            "stream": True
        },
        stream=True
    )

    first_token_time = None
    token_count = 0

    for line in response.iter_lines():
        if line:
            chunk = json.loads(line.decode("utf-8").replace("data: ", ""))
            if "choices" in chunk and chunk["choices"]:
                if first_token_time is None:
                    first_token_time = time.perf_counter()
                token_count += 1

    end = time.perf_counter()

    ttft_ms = (first_token_time - start) * 1000
    generation_time = end - first_token_time
    tokens_per_second = token_count / generation_time if generation_time > 0 else 0

    return ttft_ms, token_count, tokens_per_second

# Run the benchmark across multiple models
models_to_test = [
    "step-3.5-flash",
    "deepseek-v4-flash",
    "qwen3-8b",
    "hunyuan-turbos",
    "minimax-m2.5"
]

prompt = "Explain recursion in 200 words"
results = {}

for model in models_to_test:
    ttfts = []
    speeds = []

    # Run 10 times and average
    for i in range(10):
        ttft, count, speed = time_first_token_and_throughput(model, prompt)
        ttfts.append(ttft)
        speeds.append(speed)

    avg_ttft = sum(ttfts) /

DEV Community

Quick Tip: I Speed-Tested 15 AI APIs So You Don't Have To

How I Set This Whole Thing Up

The Absolute Speed Champions

The Part That Blew My Mind: Speed by Price

The "Wait, This Is One Cent?" Tier (under $0.15/M output)

The Sweet Spot ($0.15 to $0.30/M output)

Mid-Range ($0.30 to $0.80/M output)

The Premium Tier ($0.80+/M output)

The Geography Thing Nobody Told Me About

What These Numbers Actually Feel Like

Let Me Show You The Actual Code

Top comments (0)