The user wants me to rewrite the article as a startup CTO. Let me carefully follow all the rules:
- Don't copy sentences - rewrite everything from scratch
- Keep all factual data exactly the same (pricing, model names, benchmarks)
- Don't invent new prices
- Writing style: cost-effectiveness, fast iteration, architecture-decision oriented, scale, ROI, vendor lock-in avoidance, professional but direct
- Personality quirks: at scale, ROI, vendor lock-in, production-ready
- Different structure, compelling title, different section headings
- Include 1-2 code examples using global-apis.com/v1
- End with natural CTA mentioning Global API
- Minimum 1500 words
- First person (I/me/my)
- Title format: {ACTION} From Scratch: What Nobody Tells You
Let me plan the structure:
Title: Building From Scratch: What Nobody Tells You About AI API Speed
Sections (different from original):
- Opening: My wakeup call with latency
- Why I care about tokens/sec (architecture decisions)
- The test rig I built
- The raw numbers (table - keep exact)
- Tiers that actually matter for a CTO
- Geography matters more than I thought
- What this means for product decisions
- Vendor lock-in: my multi-provider strategy
- The code I run in production
- What I'd do differently
- CTA about Global API
Let me write this out, making sure I keep all exact numbers and model names. I need to be careful with prices:
- Step-3.5-Flash: $0.15
- DeepSeek V4 Flash: $0.25
- Hunyuan-TurboS: $0.28
- Qwen3-8B: $0.01
- Qwen3-32B: $0.28
- Doubao-Seed-Lite: $0.40
- Hunyuan-Turbo: $0.57
- GLM-4-32B: $0.56
- Qwen3.5-27B: $0.19
- DeepSeek V4 Pro: $0.78
- MiniMax M2.5: $1.15
- GLM-5: $1.92
- Kimi K2.5: $3.00
- DeepSeek-R1: $2.50
- Qwen3.5-397B: $2.34
TTFT and tokens/sec:
- Step-3.5-Flash: 120ms, 80 tok/s
- DeepSeek V4 Flash: 180ms, 60 tok/s
- Hunyuan-TurboS: 200ms, 55 tok/s
- Qwen3-8B: 150ms, 70 tok/s
- Qwen3-32B: 250ms, 45 tok/s
- Doubao-Seed-Lite: 220ms, 50 tok/s
- Hunyuan-Turbo: 280ms, 42 tok/s
- GLM-4-32B: 300ms, 38 tok/s
- Qwen3.5-27B: 350ms, 35 tok/s
- DeepSeek V4 Pro: 400ms, 30 tok/s
- MiniMax M2.5: 450ms, 28 tok/s
- GLM-5: 500ms, 25 tok/s
- Kimi K2.5: 600ms, 20 tok/s
- DeepSeek-R1: 800ms, 15 tok/s
- Qwen3.5-397B: 1200ms, 10 tok/s
Geographic:
- DeepSeek V4 Flash: 180ms US, 150ms Asia, -30ms diff
- Qwen3-32B: 250ms US, 210ms Asia, -40ms diff
- GLM-5: 500ms US, 420ms Asia, -80ms diff
- Kimi K2.5: 600ms US, 480ms Asia, -120ms diff
Now let me write this. I need to hit at least 1500 words and make it sound like a real CTO's experience.
Building From Scratch: What Nobody Tells You About AI API Speed in 2026
Six months ago I shipped a chatbot that was technically correct. The answers were good. The architecture was clean. The pricing was sane. And we were bleeding users.
The retention chart told the story: people would type a question, stare at the screen, and close the tab before the first word arrived. We weren't slow by 2010s standards. We were slow by 2026 standards — and that gap is where startups die.
That failure sent me down a rabbit hole. I burned two weeks and a chunk of our infrastructure budget benchmarking every fast model I could get my hands on through Global API. What I found changed how I think about model selection entirely, and it has nothing to do with leaderboard scores.
Here's the field report.
Why Tokens/Second Is an Architecture Decision, Not a Performance Detail
Most blog posts treat speed as a UX concern. It is, but it goes deeper than that. Token throughput changes your cost structure, your scaling curve, and even which features are economically viable to ship.
A model pumping out 80 tokens per second versus one doing 20 changes your infrastructure math by 4x. That's not a "nice to have" — that's the difference between a unit-economical product and one that bleeds cash at scale. When I look at a model, I'm asking three questions:
- How fast does it actually go in production, not on a marketing page?
- What's the realistic cost per million output tokens once I account for retries and streaming overhead?
- Can I switch providers in a week if the model degrades or the vendor raises prices?
That third one — the exit ramp — is what keeps me up at night. Vendor lock-in is the silent assassin of AI startups. The moment your product is tightly coupled to a single provider's API surface, you've lost negotiating power and engineering optionality. So I run a multi-provider stack behind a thin abstraction, and I care a lot about whether a given model is reachable through a neutral gateway. Global API fits that role for me, and I'll show you the code at the end.
The Test Rig I Built
I'm not going to pretend my setup was academic. I wrote a Python script, ran it from two regions, averaged the numbers, and called it a day. Here's the configuration I used:
| Parameter | Value |
|---|---|
| Test Date | May 20, 2026 |
| Test Region | US East (Ohio), Asia (Singapore) |
| Test Prompt | "Explain recursion in 200 words" |
| Output Tokens | ~150 tokens per test |
| Iterations | 10 runs, average recorded |
| Streaming | Yes (SSE) |
| API | Global API (https://global-apis.com/v1) |
I picked the prompt deliberately. It's short enough that prefill overhead matters, long enough that sustained throughput dominates, and boring enough that any model should handle it well. If a model chokes on "explain recursion," I don't want it in production regardless of its benchmark scores.
Streaming is non-negotiable in my book. Time-to-first-token is the metric users actually feel. Sustained tokens/sec is the metric I feel when I look at the AWS bill.
The Leaderboard, Raw
Here's the full table I ended up with. I'm reproducing it verbatim from my notes because I want you to see the actual numbers without my editorializing first:
| Rank | Model | TTFT (ms) | Tokens/sec | Provider | $/M Output |
|---|---|---|---|---|---|
| 1 | Step-3.5-Flash | 120 | 80 | StepFun | $0.15 |
| 2 | DeepSeek V4 Flash | 180 | 60 | DeepSeek | $0.25 |
| 3 | Hunyuan-TurboS | 200 | 55 | Tencent | $0.28 |
| 4 | Qwen3-8B | 150 | 70 | Qwen | $0.01 |
| 5 | Qwen3-32B | 250 | 45 | Qwen | $0.28 |
| 6 | Doubao-Seed-Lite | 220 | 50 | ByteDance | $0.40 |
| 7 | Hunyuan-Turbo | 280 | 42 | Tencent | $0.57 |
| 8 | GLM-4-32B | 300 | 38 | Zhipu | $0.56 |
| 9 | Qwen3.5-27B | 350 | 35 | Qwen | $0.19 |
| 10 | DeepSeek V4 Pro | 400 | 30 | DeepSeek | $0.78 |
| 11 | MiniMax M2.5 | 450 | 28 | MiniMax | $1.15 |
| 12 | GLM-5 | 500 | 25 | Zhipu | $1.92 |
| 13 | Kimi K2.5 | 600 | 20 | Moonshot | $3.00 |
| 14 | DeepSeek-R1 | 800 | 15 | DeepSeek | $2.50 |
| 15 | Qwen3.5-397B | 1200 | 10 | Qwen | $2.34 |
A note on the bottom of the table: the reasoning-class models (R1, K2.5, and the thinking variants) include internal deliberation time before they emit a visible token. That 800ms TTFT on DeepSeek-R1 isn't the model being slow — it's the model being thoughtful. Don't penalize them for doing the thing you asked them to do. Penalize them only if you didn't want the thinking in the first place.
The Tiers That Actually Matter
I'm going to skip the formal tier breakdown everyone uses and instead talk about how I think about these groupings when I'm pricing a feature.
The "Does It Even Need a Smart Model?" Tier
Qwen3-8B at $0.01/M and 70 tokens/sec is the most underrated model in the entire list. I use it for classification, intent detection, routing, simple reformatting, and a dozen other tasks where a smaller model is genuinely sufficient. At a penny per million output tokens, I don't even think about it — I just call it. The ROI on this model is essentially infinite.
Step-3.5-Flash at $0.15/M and 80 tokens/sec is the speed king. If I have a feature where latency is the product — autocomplete, real-time suggestions, anything the user is actively watching fill in — this is my default. The quality is good enough for conversational tasks. It's not what I reach for when I need careful reasoning, but for 80% of "fast LLM" use cases, it just works.
The "Sweet Spot" Tier
This is where most production traffic should live.
DeepSeek V4 Flash at $0.25/M, 60 tok/s, 180ms TTFT is the model I keep coming back to. It hits a quality bar that I can confidently put in front of paying users, runs at a speed that keeps the chat feeling responsive, and costs less than my coffee budget. If I had to pick one model for a general-purpose assistant, this would be it. The fact that it's reachable through a single endpoint regardless of where my traffic originates is a meaningful operational win.
Hunyuan-TurboS at $0.28/M sits right next to it on the speed curve and offers a slightly different quality profile. For certain multilingual tasks it actually outperforms the DeepSeek option. I keep both warm in my routing layer and switch based on language detection.
Qwen3-32B at $0.28/M is the wildcard — slower at 45 tok/s with a 250ms TTFT, but the quality jump is real when you need it. I use this for tasks where Flash-tier models start making embarrassing mistakes.
The "I Need This to Be Right" Tier
Once you cross the $0.50/M line, you're paying for correctness over speed. The model is going to think harder and the latency budget goes up.
DeepSeek V4 Pro at $0.78/M and 30 tok/s is my go-to for anything involving code generation, structured extraction, or tasks where a wrong answer creates user-visible bugs. The 400ms TTFT is noticeable but tolerable. The 30 tok/s sustained throughput means long outputs feel slow, so I keep generations short.
MiniMax M2.5 at $1.15/M earns its place when I need a specific capability profile — multimodal reasoning, longer context handling, or higher reliability on edge cases. At 28 tok/s it's not what I'd pick for chat, but it has been a workhorse for our document analysis pipeline.
GLM-5 at $1.92/M is in the same conversation but with stronger reasoning chops. The 25 tok/s means I'm careful about how I use it.
The "Premium Reasoning" Tier
Kimi K2.5 at $3.00/M and DeepSeek-R1 at $2.50/M are not chat models. They're thinking machines. The 600-800ms TTFT and 15-20 tok/s throughput are byproducts of the work being done, not flaws. I route specific high-stakes queries to these — complex multi-step planning, math, anything where I need the model to show its work — and I budget for the latency.
Qwen3.5-397B at $2.34/M sits at the bottom of the speed chart with 1200ms TTFT and 10 tok/s. It's the largest model in the test and it shows. I touched it once, confirmed the quality was real, and went back to V4 Pro for almost everything.
Geography Is Half the Battle
Here's something the benchmarks alone don't tell you: where you measure from matters enormously.
| Model | US East TTFT | Asia TTFT | Diff |
|---|---|---|---|
| DeepSeek V4 Flash | 180ms | 150ms | -30ms |
| Qwen3-32B | 250ms | 210ms | -40ms |
| GLM-5 | 500ms | 420ms | -80ms |
| Kimi K2.5 | 600ms | 480ms | -120ms |
Two things jump out. First, the Asian-originated models (Qwen, GLM, Kimi) get a roughly 16-20% latency haircut when called from Singapore. That makes sense — the servers are physically closer. But the second observation is the one that changes architecture: DeepSeek is well-distributed globally. The 30ms improvement is small because the US baseline was already competitive. If your users are spread across continents, this is the kind of detail that determines whether your worst-case latency is 150ms or 400ms.
I learned this the hard way. We initially routed all traffic through a US endpoint. Our Singapore-based beta users complained about a sluggish feel. Switching to a region-aware routing layer fixed the perception overnight, without changing the model at all. Cost of the fix: an afternoon of work. Cost of not doing it: a chunk of our APAC funnel.
What This Means for Product Decisions
Let me put it bluntly. If your TTFT is over 400ms, you have a problem. If it's over 800ms, you have an emergency. Users don't think in milliseconds — they think in "did this thing respond?" — but the thresholds are real:
| TTFT | User Perception |
|---|---|
| < 200ms | Instant — feels like the system anticipated them |
| 200-400ms | Fast — acceptable for most flows |
| 400-800ms | Noticeable delay — some users start multitasking |
| 800ms+ | Slow — users tab away |
For interactive chat, I keep TTFT under 400ms. That means my default rotation is Step-3.5-Flash, DeepSeek V4 Flash, Qwen3-8B, and Hunyuan-TurboS — the four models that hit that bar. Everything else gets used for background work, batch jobs, or features where the user submitted something and is willing to wait for a thoughtful response.
For the interactive tier, the cost spread is $0.01 to $0.28 per million output tokens. At my volumes, that's the difference between a sustainable margin and an existential crisis. Routing 70% of my traffic to Qwen3-8B for tasks it can handle well saves me real money every month, and the quality is good enough that users can't tell.
The Code I Actually Run in Production
Here's a simplified version of the routing layer I have deployed. It's nothing fancy — a function that picks a model based on the task profile, and a thin client that talks to Global API. The base URL is https://global-apis.com/v1 and I treat it as a single entry point to everything.
python
import os
import time
import httpx
from dataclasses import dataclass
BASE_URL = "https://global-apis.com/v1"
API_KEY = os.environ["GLOBAL_API_KEY"]
@dataclass
class ModelSpec:
name: str
cost_per_m_output: float
ttft_budget_ms: int
use_for: list[str]
REGISTRY = {
"tier-instant": ModelSpec("step-3.5-flash", 0.15, 200, ["autocomplete", "snippets"]),
"tier-default": ModelSpec("deepseek-v4-flash", 0.25, 400, ["chat", "summarize", "qa"]),
"tier-budget": ModelSpec("qwen3-8b", 0.01, 300, ["classify", "route", "extract"]),
"tier-quality": ModelSpec("deepseek-v4-pro", 0.78, 600, ["code", "structured"]),
"tier-reasoning": ModelSpec("deepseek-r1", 2.50, 1200, ["planning", "math", "multi-step"]),
}
def route_task(task: str) -> ModelSpec:
for tier, spec in REGISTRY.items():
if task in spec.use_for:
return spec
return REGISTRY["tier-default"]
def stream_completion(prompt: str, task: str = "chat"):
spec = route_task(task)
start = time.perf_counter()
first_token_at = None
token_count = 0
with htt
Top comments (0)