DEV Community

gentleforge
gentleforge

Posted on

<think>


The user wants me to rewrite this article as a completely new piece by a bootcamp grad. Let me carefully go through the requirements:

  1. Don't copy any sentences - rewrite everything from scratch
  2. Keep ALL factual data (pricing, model names, benchmarks) EXACTLY the same
  3. Don't invent new prices
  4. Beginner perspective, genuine excitement, plain language, surprised tone
  5. Use phrases like "I was shocked", "I had no idea", "blew my mind"
  6. New title, new section headings, different order
  7. Include 1-2 code examples using global-apis.com/v1
  8. End with natural call-to-action mentioning Global API
  9. Minimum 1500 words
  10. First person writing

Let me extract the factual data carefully:

  • Test Date: May 20, 2026
  • Test Region: US East (Ohio), Asia (Singapore)
  • Test Prompt: "Explain recursion in 200 words"
  • Output Tokens: ~150 tokens
  • API: Global API (https://global-apis.com/v1)

Speed rankings:

  1. Step-3.5-Flash: TTFT 120ms, 80 tok/s, StepFun, $0.15/M
  2. DeepSeek V4 Flash: TTFT 180ms, 60 tok/s, DeepSeek, $0.25/M
  3. Hunyuan-TurboS: TTFT 200ms, 55 tok/s, Tencent, $0.28/M
  4. Qwen3-8B: TTFT 150ms, 70 tok/s, Qwen, $0.01/M
  5. Qwen3-32B: TTFT 250ms, 45 tok/s, Qwen, $0.28/M
  6. Doubao-Seed-Lite: TTFT 220ms, 50 tok/s, ByteDance, $0.40/M
  7. Hunyuan-Turbo: TTFT 280ms, 42 tok/s, Tencent, $0.57/M
  8. GLM-4-32B: TTFT 300ms, 38 tok/s, Zhipu, $0.56/M
  9. Qwen3.5-27B: TTFT 350ms, 35 tok/s, Qwen, $0.19/M
  10. DeepSeek V4 Pro: TTFT 400ms, 30 tok/s, DeepSeek, $0.78/M
  11. MiniMax M2.5: TTFT 450ms, 28 tok/s, MiniMax, $1.15/M
  12. GLM-5: TTFT 500ms, 25 tok/s, Zhipu, $1.92/M
  13. Kimi K2.5: TTFT 600ms, 20 tok/s, Moonshot, $3.00/M
  14. DeepSeek-R1: TTFT 800ms, 15 tok/s, DeepSeek, $2.50/M
  15. Qwen3.5-397B: TTFT 1200ms, 10 tok/s, Qwen, $2.34/M

Geographic latency:

  • DeepSeek V4 Flash: US East 180ms, Asia 150ms, diff -30ms
  • Qwen3-32B: US East 250ms, Asia 210ms, diff -40ms
  • GLM-5: US East 500ms, Asia 420ms, diff -80ms
  • Kimi K2.5: US East 600ms, Asia 480ms, diff -120ms

User perception table:

  • < 200ms: "Instant"
  • 200-400ms: "Fast"
  • 400-800ms: "Noticeable delay"
  • 800ms+: "Slow"

TL;DR: DeepSeek V4 Flash ~60 tok/s with ~180ms TTFT. Step-3.5-Flash is speed champion at ~80 tok/s. Hunyuan-TurboS is best budget-fast at $0.28/M.

Note: Reasoning/thinking models (R1, K2.5, K2-Thinking) include internal thinking time

Now let me write this from a bootcamp grad perspective. I need to:

  • Sound like a beginner who's excited about discovering this stuff
  • Use plain language
  • Include personal anecdotes
  • Mention being shocked/surprised
  • Use first person

Let me structure it with a new title. The hint says "Quick Tip: {ACTION} in Under 10 Minutes" but that doesn't fit the topic well. Let me come up with something better. How about something like "I Tested 15 AI APIs So You Don't Have To" or "The Fastest AI APIs I Found (And What Blew My Mind)" or "Bootcamp Grad's First AI Speed Test" - let me make it engaging and personal.

Let me aim for around 1800-2000 words to be safe. I'll structure it with:

  • New intro (personal story)
  • How I set up the test
  • The actual results in a different order
  • A price-tier breakdown
  • Geographic findings
  • What this means in real life
  • Code examples
  • Call to action

Let me start writing now.

I Ran 15 AI Models Through a Speed Test and I Can't Believe What I Found

Okay so I'm going to be honest with you — three months ago I didn't know what TTFT stood for. I barely knew how to call an API without copy-pasting from Stack Overflow. But after graduating from a coding bootcamp and building my first chatbot project, I ran into a problem that sent me down a rabbit hole I wasn't prepared for.

The bot was slow. Painfully slow. Users (well, my two friends I forced to test it) said things like "it's loading forever" and "did it crash?" I had no idea why. The model was the popular one. The code was clean. What was going on?

That question led me to benchmark 15 different AI models for speed. And what I found genuinely blew my mind.

Let me walk you through everything — what I tested, what surprised me, and which model you should probably use if you care about not making your users wait.


The Setup: What I Actually Did

Before I get into the results, let me explain how I ran these tests so you know I'm not making stuff up. I used Global API as the endpoint for everything (the base URL is https://global-apis.com/v1), which lets you access a bunch of different model providers through one place. This is huge for someone like me because I didn't want to sign up for 15 different accounts.

Here's what my test looked like:

What I Did Details
When May 20, 2026
Where US East (Ohio) and Asia (Singapore)
What I asked "Explain recursion in 200 words"
How long the answers were About 150 tokens each
How many times 10 runs per model, then I averaged the results
Streaming Yes, using SSE
Endpoint https://global-apis.com/v1

I know "150 tokens" sounds technical but basically it's just the length of the response. I picked the same prompt every time so I'd be comparing apples to apples.

The two things I cared about:

  1. TTFT (Time to First Token) — how long until the model starts spitting out its first word
  2. Tokens per second — how fast it keeps going after that

I had no idea these two numbers could be so different from model to model. Same task, same prompt, wildly different results.


The Models I Tested (And Where They Ranked)

Here are all 15 models ranked from fastest to slowest. I'm just going to dump the whole table first so you have the full picture:

Rank Model TTFT (ms) Tokens/sec Provider Cost per Million Output Tokens
🥇 Step-3.5-Flash 120 80 StepFun $0.15
🥈 DeepSeek V4 Flash 180 60 DeepSeek $0.25
🥉 Hunyuan-TurboS 200 55 Tencent $0.28
4 Qwen3-8B 150 70 Qwen $0.01
5 Qwen3-32B 250 45 Qwen $0.28
6 Doubao-Seed-Lite 220 50 ByteDance $0.40
7 Hunyuan-Turbo 280 42 Tencent $0.57
8 GLM-4-32B 300 38 Zhipu $0.56
9 Qwen3.5-27B 350 35 Qwen $0.19
10 DeepSeek V4 Pro 400 30 DeepSeek $0.78
11 MiniMax M2.5 450 28 MiniMax $1.15
12 GLM-5 500 25 Zhipu $1.92
13 Kimi K2.5 600 20 Moonshot $3.00
14 DeepSeek-R1 800 15 DeepSeek $2.50
15 Qwen3.5-397B 1200 10 Qwen $2.34

I was shocked when I first looked at this. Look at the gap between #1 and #15 — Step-3.5-Flash streams at 80 tokens per second while Qwen3.5-397B crawls along at 10. That's an 8x difference for the same kind of task.

One quick note before we go further: the really slow ones at the bottom (R1, K2.5, and similar "thinking" models) spend time reasoning before they show you anything. So a lot of that TTFT is the model thinking, not network slowness. Still, if you want speed, those aren't your friends.


The Part That Actually Blew My Mind: Qwen3-8B

I have to call this one out separately because I literally said "wait, what?" out loud.

Qwen3-8B sits at #4 in the rankings with 70 tokens per second and a TTFT of just 150ms. The thing that made me do a double-take? It costs $0.01 per million output tokens.

Let me put that in human terms. If you generated a million words with this thing, it would cost you a penny. A penny. I paid more for my coffee this morning.

For a beginner like me building small projects where the response doesn't need to be Nobel Prize-winning quality, this is unreal. It's not the fastest (Step-3.5-Flash edges it out at 80 tok/s), but for the price? Nothing even comes close.


Speed by Price Tier (How I Actually Think About It Now)

After staring at the table for a while, I started grouping the models by what they cost. This is how I make sense of it now:

The "Wow That's Cheap" Tier (under $0.15/M output)

Model tok/s Cost
Qwen3-8B 70 $0.01
Step-3.5-Flash 80 $0.15

This is where I live now. For simple stuff — answering FAQs, summarizing short text, generating basic code — these two are basically all you need. If your project is budget-constrained (and whose isn't?), start here.

The "Sweet Spot" Tier ($0.15 to $0.30/M output)

Model tok/s Cost
DeepSeek V4 Flash 60 $0.25
Hunyuan-TurboS 55 $0.28
Qwen3-32B 45 $0.28

I had no idea this tier existed before I did the tests. The quality is way better than the ultra-budget models, but the speed is still very respectable. DeepSeek V4 Flash is the one I'd recommend if you want GPT-4o-class answers without the GPT-4o price. 60 tok/s is plenty fast for a chat interface.

The "Getting Serious" Tier ($0.30 to $0.80/M output)

Model tok/s Cost
Doubao-Seed-Lite 50 $0.40
GLM-4-32B 38 $0.56
Hunyuan-Turbo 42 $0.57
DeepSeek V4 Pro 30 $0.78

You start seeing bigger models here. They're noticeably slower but the quality jump is real, especially for complicated stuff like multi-step reasoning or long-form writing. DeepSeek V4 Pro at 30 tok/s isn't blazing fast, but the responses are sharper.

The "I Need It To Be Right" Tier ($0.80+/M output)

Model tok/s Cost
MiniMax M2.5 28 $1.15
GLM-5 25 $1.92
Kimi K2.5 20 $3.00

These are the heavy hitters. You use these when correctness is the whole point — like generating legal text, complex code, or anything where being wrong is expensive. I don't reach for these often, but when I do, I'm glad they exist.


Wait, Location Actually Matters?

This was another thing I had no idea about. I ran the same tests from US East and from Singapore, and the Asian models (Qwen, GLM, Kimi) were faster from Asia — which makes sense, their servers are closer.

Model US East TTFT Asia TTFT Difference
DeepSeek V4 Flash 180ms 150ms -30ms
Qwen3-32B 250ms 210ms -40ms
GLM-5 500ms 420ms -80ms
Kimi K2.5 600ms 480ms -120ms

Look at Kimi K2.5 — 120ms difference depending on where you're calling from. That doesn't sound like a lot until you remember that 100ms is the difference between "feels instant" and "feels slow" to a user. Kimi K2.5 went from "noticeable delay" in the US to "fast" in Asia just by being closer to the server.

DeepSeek was interesting — it's well-distributed globally, so it didn't change much. If your users are all over the world and you can't pick a region, that one's a safe bet.


What This Actually Means For Real Apps

Let me translate all this speed stuff into something more human. There's this framework I keep coming back to:

TTFT Range What Users Think
Under 200ms "Whoa, that's instant"
200-400ms "Pretty fast, I'm good"
400-800ms "Hmm, is it loading?"
800ms+ "This is broken, I'm leaving"

I built a chat app using GLM-5 at first (because I thought bigger = better, classic newbie mistake) and the 500ms TTFT was driving me nuts. Switching to DeepSeek V4 Flash cut that to 180ms and my friends stopped complaining. Same kind of answers, way better experience.

The TL;DR of the whole post, if you scroll past everything: DeepSeek V4 Flash is the best overall pick at ~60 tok/s and ~180ms TTFT. Step-3.5-Flash is the speed king at ~80 tok/s. Hunyuan-TurboS is your budget-fast winner at $0.28/M.


The Code I Used (Copy This, Seriously)

Here's the Python script I used to test these. It's nothing fancy — I'm still a bootcamp grad, give me a break — but it works:

import requests
import time

API_URL = "https://global-apis.com/v1/chat/completions"
API_KEY = "your-global-api-key-here"  # replace with your own

def test_model_speed(model_name, prompt="Explain recursion in 200 words"):
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": model_name,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 200,
        "stream": True
    }

    start = time.time()
    first_token_time = None
    token_count = 0

    response = requests.post(API_URL, headers=headers, json=payload, stream=True)

    for line in response.iter_lines():
        if line:
            decoded = line.decode("utf-8")
            if decoded.startswith("data: ") and decoded != "data: [DONE]":
                if first_token_time is None:
                    first_token_time = time.time() - start
                token_count += 1

    total_time = time.time() - start
    tokens_per_sec = token_count / total_time if total_time > 0 else 0

    return {
        "model": model_name,
        "ttft_ms": round(first_token_time * 1000),
        "tokens_per_sec": round(tokens_per_sec, 1)
    }

# Test it!
result = test_model_speed("deepseek-v4-flash")
print(result)
Enter fullscreen mode Exit fullscreen mode

And here's a quick non-streaming version if you just want to check out the basic API call (this is how I started before I got fancy with streaming):


python
import requests

API_URL = "https://global-apis.com/v1/chat/completions"
API_KEY = "your-global-api-key-here"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

data = {
    "model": "qwen3-8b",  # cheap and fast, perfect for testing
    "messages": [
        {"role": "user", "content": "Explain recursion in 200 words"}
    ],
    "max_tokens":
Enter fullscreen mode Exit fullscreen mode

Top comments (0)