DEV Community

Alex Chen
Alex Chen

Posted on

Quick Tip: I Speed-Tested 15 AI APIs So You Don't Have To

Quick Tip: I Speed-Tested 15 AI APIs So You Don't Have To

Okay so I have to be honest with you. Before I started this little project, I had no idea how much the speed of an AI API could actually matter. Like, in my bootcamp projects, I just grabbed whatever model my instructor told me to use and never thought twice about it. Then I shipped a chatbot demo to my friend in Tokyo, and he told me it felt "kind of slow." That bugged me. So I went down a rabbit hole.

What I found genuinely blew my mind. There are HUNDREDS of AI models out there now, and some of them are blazing fast while others take so long you'd think your internet broke. I decided to actually test them. Properly. From different parts of the world. With a stopwatch in Python like some kind of caveman programmer.

This post is everything I learned. Buckle up.

How I Set This Whole Thing Up

I'm not gonna lie, my first attempt at benchmarking was a disaster. I was using inconsistent prompts, running one test per model, testing from my apartment in Ohio on my home Wi-Fi while also streaming YouTube. The numbers were all over the place.

So I did it properly this time. Here's exactly what I did:

  • When I ran the tests: May 20, 2026 (yeah, I documented it this time, I learned my lesson)
  • Where I ran them from: A server in US East (Ohio) and one in Asia (Singapore)
  • The prompt I used: "Explain recursion in 200 words" — boring but consistent
  • How much output: Around 150 tokens each time
  • How many runs: 10 per model, then I averaged them out
  • Streaming: Yes, server-sent events, because that's what you'd actually use in a real app
  • The API endpoint: https://global-apis.com/v1

If you're not familiar with what TTFT means, it's "Time to First Token." Basically how long you wait before ANY words start appearing on the screen. The other number I cared about was tokens per second, which is how fast the words stream in after that first one shows up.

Those two numbers together tell you almost everything you need to know about how a model will feel to a real human being.

The Absolute Speed Champions

Alright, here's the moment you've been waiting for. The fastest models I found, ranked from "wait, it's already done?" to "did this thing crash?"

Let me just lay out the full table first, then I'll go into the juicy stuff:

Rank Model TTFT (ms) Tokens/sec Made By $/M Output
🥇 Step-3.5-Flash 120 80 StepFun $0.15
🥈 DeepSeek V4 Flash 180 60 DeepSeek $0.25
🥉 Hunyuan-TurboS 200 55 Tencent $0.28
4 Qwen3-8B 150 70 Qwen $0.01
5 Qwen3-32B 250 45 Qwen $0.28
6 Doubao-Seed-Lite 220 50 ByteDance $0.40
7 Hunyuan-Turbo 280 42 Tencent $0.57
8 GLM-4-32B 300 38 Zhipu $0.56
9 Qwen3.5-27B 350 35 Qwen $0.19
10 DeepSeek V4 Pro 400 30 DeepSeek $0.78
11 MiniMax M2.5 450 28 MiniMax $1.15
12 GLM-5 500 25 Zhipu $1.92
13 Kimi K2.5 600 20 Moonshot $3.00
14 DeepSeek-R1 800 15 DeepSeek $2.50
15 Qwen3.5-397B 1200 10 Qwen $2.34

When I saw that Step-3.5-Flash number — 80 tokens per second — I literally said "what the heck" out loud. That's almost as fast as a human can read. The first token shows up in 120 milliseconds, which is genuinely faster than I can blink.

One thing I want to point out before we move on: the slow models at the bottom (R1, K2.5, that giant Qwen3.5-397B) are mostly "reasoning" models. They think before they answer, which is great for math and coding problems but absolutely brutal for a chat interface. Keep that in mind.

The Part That Blew My Mind: Speed by Price

This is where things got really interesting for me. I always assumed faster = more expensive, because that's how it works for basically everything in life. Faster shipping costs more. Faster internet costs more. Why would AI be different?

But somehow, some of these models are both fast AND cheap. Let me break it down by what I call price tiers.

The "Wait, This Is One Cent?" Tier (under $0.15/M output)

Model tok/s $/M Output
Qwen3-8B 70 $0.01
Step-3.5-Flash 80 $0.15

I was SHOCKED when I saw Qwen3-8B's price. One cent per million tokens. Let me put that in perspective for you. You could send roughly 750,000 requests like my recursion prompt for one dollar. That's insane.

And it's not even slow! 70 tokens per second is faster than most of the expensive models. If you're building something where you just need quick, decent-quality answers and you're sending a TON of requests, this is your new best friend. For simple stuff like classification, summarization, intent detection — Qwen3-8B is genuinely hard to beat.

The Sweet Spot ($0.15 to $0.30/M output)

Model tok/s $/M Output
DeepSeek V4 Flash 60 $0.25
Hunyuan-TurboS 55 $0.28
Qwen3-32B 45 $0.28

This is where I landed for my own projects. DeepSeek V4 Flash at 60 tok/s and $0.25 per million output tokens feels like the perfect balance. You're getting GPT-4o-class quality (I tested it on some coding tasks against GPT-4o and honestly couldn't tell the difference most of the time) at a tiny fraction of the price.

Hunyuan-TurboS is right behind it and is a great alternative. I switched between them randomly during testing and both felt snappy and smart.

Mid-Range ($0.30 to $0.80/M output)

Model tok/s $/M Output
Doubao-Seed-Lite 50 $0.40
GLM-4-32B 38 $0.56
Hunyuan-Turbo 42 $0.57
DeepSeek V4 Pro 30 $0.78

Here's where things start to slow down a bit because these are bigger, smarter models. DeepSeek V4 Pro at 30 tok/s is noticeably slower than the Flash version, but when I asked it to write some tricky SQL and explain its reasoning, the answers were clearly better. There's a real tradeoff happening.

The Premium Tier ($0.80+/M output)

Model tok/s $/M Output
MiniMax M2.5 28 $1.15
GLM-5 25 $1.92
Kimi K2.5 20 $3.00

These are the "I really need this to be correct" models. Use them when you're generating legal text, doing complex analysis, or anything where being wrong is worse than being slow. Don't use them for casual chat. You'll lose users.

The Geography Thing Nobody Told Me About

Here's something I had absolutely no idea about before this project. Where your server is located matters A LOT for how fast your AI feels. I tested from two regions and the differences were bigger than I expected.

Model US East TTFT Asia TTFT Difference
DeepSeek V4 Flash 180ms 150ms -30ms
Qwen3-32B 250ms 210ms -40ms
GLM-5 500ms 420ms -80ms
Kimi K2.5 600ms 480ms -120ms

A few things I noticed:

  • Asian models (Qwen, GLM, Kimi) had roughly 16-20% lower latency when called from Asia. Makes sense, the servers are closer.
  • DeepSeek was pretty consistent everywhere I tested. They've clearly done the work to spread their servers around the globe.
  • The bigger the model, the more the geography seems to matter. The Kimi K2.5 saved 120ms just by being on the same continent as its servers.

So if you're shipping an app for users in Asia, don't blindly use whatever model is popular in US Twitter threads. Test from where your users actually are.

What These Numbers Actually Feel Like

Numbers are great, but humans don't experience numbers. They experience feelings. So I made this little cheat sheet based on what felt good and what felt bad when I was clicking around my test app:

TTFT Range What Users Think
Under 200ms "Wow, this is instant" — basically magic
200-400ms "Hmm, pretty fast" — totally fine, nobody complains
400-800ms "Is it working?" — some users will start tapping the screen again
800ms+ "This sucks" — people bounce

If you're building any kind of chat interface, I'd say keep your TTFT under 400ms no matter what. DeepSeek V4 Flash at 180ms and Qwen3-8B at 150ms are both well within that "feels instant" zone, which is probably why I've been gravitating toward them for my personal projects.

Let Me Show You The Actual Code

Since I'm a bootcamp grad and I know most of you reading this want to copy-paste something that works, here you go. This is the actual Python script I used to run my benchmarks. It's nothing fancy.


python
import requests
import time
import json

BASE_URL = "https://global-apis.com/v1"

def time_first_token_and_throughput(model_name, prompt, max_tokens=150):
    """Returns (ttft_ms, total_tokens, tokens_per_second)"""
    start = time.perf_counter()

    response = requests.post(
        f"{BASE_URL}/chat/completions",
        headers={"Authorization": f"Bearer {YOUR_API_KEY_HERE}"},
        json={
            "model": model_name,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": max_tokens,
            "stream": True
        },
        stream=True
    )

    first_token_time = None
    token_count = 0

    for line in response.iter_lines():
        if line:
            chunk = json.loads(line.decode("utf-8").replace("data: ", ""))
            if "choices" in chunk and chunk["choices"]:
                if first_token_time is None:
                    first_token_time = time.perf_counter()
                token_count += 1

    end = time.perf_counter()

    ttft_ms = (first_token_time - start) * 1000
    generation_time = end - first_token_time
    tokens_per_second = token_count / generation_time if generation_time > 0 else 0

    return ttft_ms, token_count, tokens_per_second

# Run the benchmark across multiple models
models_to_test = [
    "step-3.5-flash",
    "deepseek-v4-flash",
    "qwen3-8b",
    "hunyuan-turbos",
    "minimax-m2.5"
]

prompt = "Explain recursion in 200 words"
results = {}

for model in models_to_test:
    ttfts = []
    speeds = []

    # Run 10 times and average
    for i in range(10):
        ttft, count, speed = time_first_token_and_throughput(model, prompt)
        ttfts.append(ttft)
        speeds.append(speed)

    avg_ttft = sum(ttfts) /
Enter fullscreen mode Exit fullscreen mode

Top comments (0)