DEV Community

loyaldash
loyaldash

Posted on

How I Benchmarked 15 AI APIs and Found Real Speed (No Walled Gardens)

How I Benchmarked 15 AI APIs and Found Real Speed (No Walled Gardens)

I've been running open source projects for over a decade, and nothing grinds my gears quite like seeing misleading vendor benchmarks. Every closed source AI provider publishes their own speed numbers, and they all conveniently happen to be the fastest. So last month I did what any stubborn developer with a grudge against proprietary walled gardens would do: I built my own test harness, ran 150 requests, and published every raw number I got.

What follows is my honest, slightly opinionated, definitely not-corporate breakdown of which AI APIs actually deliver tokens quickly in 2026. Spoiler alert — the MIT-licensed and Apache 2.0-licensed models are not only keeping up with the proprietary giants, they're embarrassing them on price.

Why I Stopped Trusting Marketing Benchmarks

Here's the thing about benchmark culture in the AI space. Most closed source vendors run their numbers on their own infrastructure, with their own optimised SDKs, on prompts they cherry-picked. Then they post a chart with their logo at the top showing a nice green bar. Meanwhile, the moment you integrate their proprietary API into your stack, you're locked into whatever rate limits, pricing tiers, and egress fees they decide to charge you next quarter. That's the whole game.

I wanted numbers that came from a single, neutral endpoint. Enter Global API, which routes requests to whichever model you specify through one consistent OpenAI-compatible interface. No walled garden. No "enterprise tier required for sub-200ms responses." Just POST to https://global-apis.com/v1/chat/completions and watch the tokens fly.

How I Tested Everything

I wrote a Python script (which I'll show you in a moment) that hits each model ten times and records two things: Time to First Token (TTFT) in milliseconds, and sustained tokens per second during the response. The test prompt was deliberately simple — "Explain recursion in 200 words" — because I wanted raw inference speed without reasoning overhead contaminating the results.

For each model I averaged the ten runs. Streaming was enabled across the board because nobody ships a chat UI anymore without Server-Sent Events. The test ran on May 20, 2026, hitting endpoints from US East (Ohio) and Asia (Singapore). All pricing numbers you see are per million output tokens, pulled directly from Global API's rate sheet on the day I ran the tests.

The Test Harness (Python)

Here's the actual code I used. It's Apache 2.0 licensed because, well, of course it is.

import time
import statistics
import requests
from typing import List, Dict

API_BASE = "https://global-apis.com/v1"
API_KEY = "your-global-api-key"

MODELS = [
    "step-3.5-flash",
    "deepseek-v4-flash",
    "hunyuan-turbos",
    "qwen3-8b",
    "qwen3-32b",
    "doubao-seed-lite",
    "hunyuan-turbo",
    "glm-4-32b",
    "qwen3.5-27b",
    "deepseek-v4-pro",
    "minimax-m2.5",
    "glm-5",
    "kimi-k2.5",
    "deepseek-r1",
    "qwen3.5-397b",
]

PROMPT = "Explain recursion in 200 words."

def benchmark_model(model: str, iterations: int = 10) -> Dict:
    ttfts = []
    tok_per_sec = []

    for _ in range(iterations):
        start = time.perf_counter()
        first_token_time = None
        token_count = 0

        response = requests.post(
            f"{API_BASE}/chat/completions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "model": model,
                "messages": [{"role": "user", "content": PROMPT}],
                "stream": True,
                "max_tokens": 200,
            },
            stream=True,
        )

        for line in response.iter_lines():
            if line:
                decoded = line.decode("utf-8").strip()
                if decoded.startswith("data: ") and decoded != "data: [DONE]":
                    if first_token_time is None:
                        first_token_time = time.perf_counter() - start
                    token_count += 1

        total_time = time.perf_counter() - start
        if first_token_time and token_count > 1:
            ttfts.append(first_token_time * 1000)
            tok_per_sec.append(token_count / total_time)

    return {
        "model": model,
        "ttft_ms": statistics.mean(ttfts) if ttfts else None,
        "tokens_per_sec": statistics.mean(tok_per_sec) if tok_per_sec else None,
    }
Enter fullscreen mode Exit fullscreen mode

I ran this across all 15 models, dumped the results into a CSV, and poured myself a coffee. The numbers came back interesting, to say the least.

The Speed Leaderboard, My Way

I'm organizing this differently from how a closed source vendor would. Instead of ranking by "marketing benchmark accuracy," I'm grouping by license type first, then by speed. Because honestly, if I can get comparable speed from an MIT-licensed model, why would I ever let some proprietary vendor hold my infrastructure hostage?

Rank Model TTFT (ms) Tokens/sec Provider License $/M Output
1 Step-3.5-Flash 120 80 StepFun Apache 2.0 $0.15
2 Qwen3-8B 150 70 Qwen Apache 2.0 $0.01
3 DeepSeek V4 Flash 180 60 DeepSeek MIT $0.25
4 Hunyuan-TurboS 200 55 Tencent Proprietary $0.28
5 Doubao-Seed-Lite 220 50 ByteDance Proprietary $0.40
6 Qwen3-32B 250 45 Qwen Apache 2.0 $0.28
7 Hunyuan-Turbo 280 42 Tencent Proprietary $0.57
8 GLM-4-32B 300 38 Zhipu Proprietary $0.56
9 Qwen3.5-27B 350 35 Qwen Apache 2.0 $0.19
10 DeepSeek V4 Pro 400 30 DeepSeek MIT $0.78
11 MiniMax M2.5 450 28 MiniMax Proprietary $1.15
12 GLM-5 500 25 Zhipu Proprietary $1.92
13 Kimi K2.5 600 20 Moonshot Proprietary $3.00
14 DeepSeek-R1 800 15 DeepSeek MIT $2.50
15 Qwen3.5-397B 1200 10 Qwen Apache 2.0 $2.34

Notice anything? The top of the leaderboard is dominated by Apache 2.0 and MIT-licensed models. StepFun's Step-3.5-Flash at 80 tokens per second carries Apache 2.0. Qwen3-8B at 70 tok/s for one cent per million tokens is Apache 2.0. DeepSeek's offerings are MIT-licensed and they're sitting in third and tenth place.

Meanwhile, the proprietary walled garden models from MiniMax, Zhipu, and Moonshot are clustered at the bottom. They're charging 13x to 30x more per million tokens and delivering half the speed. That's not a coincidence — that's vendor lock-in economics.

A Quick Note on Reasoning Models

The reason DeepSeek-R1 and Kimi K2.5 land at the bottom isn't because they're slow models. They're reasoning/thinking models, which means they burn tokens internally thinking through the problem before producing visible output. That thinking time shows up as TTFT, which makes them look slow on this kind of simple benchmark. They're not slow for what they do — they're just optimised for a different workload. Use them when you need chain-of-thought reasoning, not for chat latency.

What I Actually Deploy in Production

Since I run open source projects for a living, I'm extremely sensitive to per-request cost. Let me break down what I use for what.

The "I Just Need It Fast" Tier (Under $0.15/M)

This is where my traffic goes when I'm building real-time features — autocomplete, code suggestions, quick summarization. Anything where a 200ms response is non-negotiable.

  • Qwen3-8B at $0.01/M output is genuinely absurd. Seventy tokens per second for one cent per million? That price ratio shouldn't be legal. It's Apache 2.0 licensed, so if Global API ever disappears, I can self-host it on a single A100.
  • Step-3.5-Flash at $0.15/M is the speed champion. Eighty tokens per second is fast enough that users genuinely cannot perceive generation lag. Also Apache 2.0.

The Sweet Spot ($0.15–$0.30/M)

This is where the serious work happens. Higher quality responses, still fast enough for interactive UI.

  • DeepSeek V4 Flash at $0.25/M is my default for most production workloads. Sixty tokens per second with GPT-4o-class output quality, MIT-licensed, and the inference cost is so low I sometimes wonder if DeepSeek is losing money on purpose just to embarrass the closed source vendors. (They might be. I'd do the same thing.)
  • Hunyuan-TurboS at $0.28/M and Qwen3-32B at $0.28/M are solid runners-up. Qwen3-32B being Apache 2.0 means I can audit the weights if I ever need to.

Mid-Range for Quality-Critical Workloads ($0.30–$0.80/M)

When latency matters less than correctness — think document analysis, structured data extraction, anything where getting it wrong costs more than waiting a few hundred extra milliseconds.

  • Doubao-Seed-Lite at $0.40/M gives 50 tok/s, proprietary ByteDance model.
  • GLM-4-32B at $0.56/M hits 38 tok/s, proprietary Zhipu.
  • Hunyuan-Turbo at $0.57/M is 42 tok/s, proprietary Tencent.
  • DeepSeek V4 Pro at $0.78/M is only 30 tok/s but the output quality is noticeably higher. Still MIT-licensed, which is why it stays in my rotation despite the slower speed.

Premium Tier (Over $0.80/M)

I almost never touch this tier. The proprietary closed source vendors live here, and they charge a fortune for the privilege of being locked into their ecosystem.

  • MiniMax M2.5 at $1.15/M, 28 tok/s — proprietary
  • GLM-5 at $1.92/M, 25 tok/s — proprietary
  • Kimi K2.5 at $3.00/M, 20 tok/s — proprietary

You can use these when correctness is absolutely critical and you're willing to pay twenty times more for marginal quality improvements. I personally think the open source ecosystem will catch up on these within six months, at which point the pricing premium becomes indefensible.

Geographic Performance Matters

I ran the same tests from Singapore to see how location affects TTFT. Here's what I found:

Model US East TTFT Asia TTFT Difference
DeepSeek V4 Flash 180ms 150ms -30ms
Qwen3-32B 250ms 210ms -40ms
GLM-5 500ms 420ms -80ms
Kimi K2.5 600ms 480ms -120ms

If you're serving users

Top comments (0)