DEV Community

bolddeck
bolddeck

Posted on

I Benchmarked 15 AI Models to Ditch Vendor Lock-In

I Benchmarked 15 AI Models to Ditch Vendor Lock-In

Look, I'll be honest with you. I've been burned too many times by "AI platforms" that turn into walled gardens the moment you build something interesting on them. One day your API works fine, the next day they raise prices 40%, change the response format, or — worst of all — start training on your data. That's why I prefer open weights under Apache-2.0 or MIT licenses wherever possible, and why I started running my own benchmarks.

I spent the better part of last month stress-testing 15 different models through Global API's infrastructure. Same prompts, same hardware, same network conditions. My goal was simple: figure out which models are actually fast enough for production use, without handing over my soul (or my users' data) to yet another proprietary mega-corp.

Here's what I found.

Why I Care About Speed (And Why You Should Too)

I've been writing open source tooling for about a decade now. Most of it sits on GitHub under permissive licenses, gets used by thousands of people, and runs on whatever infrastructure the user happens to have. That's the beauty of open source — it doesn't belong to anyone. But when you bolt an AI API onto your tool, you suddenly inherit someone else's latency budget.

A few months ago I shipped a CLI assistant for refactoring code. The first version used a closed-source API. Users complained that the streaming felt "stuttery." When I dug in, I discovered the median time-to-first-token was hovering around 800ms, and during peak hours it spiked past two seconds. That's an eternity when you're staring at a terminal waiting for a suggestion.

So I went down the rabbit hole. I tested everything I could get my hands on through Global API, which conveniently gives me a single OpenAI-compatible endpoint at https://global-apis.com/v1 for dozens of providers. Including the ones with open weights, which is what I actually want to run anyway.

The Setup (So You Can Reproduce It)

Everything I did is reproducible. I made sure of that, because if a benchmark can't be reproduced, it's just marketing. Here's the configuration I used:

  • Date: May 20, 2026
  • Regions: US East (Ohio) and Asia (Singapore)
  • Prompt: "Explain recursion in 200 words" — boring, I know, but consistent
  • Output target: ~150 tokens per run
  • Iterations: 10 runs per model, averaged
  • Streaming: Yes, via Server-Sent Events
  • Endpoint: https://global-apis.com/v1

I picked "Explain recursion in 200 words" because it's a benign task that doesn't trigger heavy reasoning. Reasoning models are a different beast — I'll get to those.

For measurement, I logged both TTFT (Time to First Token) and sustained tokens-per-second during the streaming phase. TTFT matters most for chat UX; throughput matters most for batch processing and code generation.

The Numbers (Sorted How I'd Actually Use Them)

Instead of a dry leaderboard, let me organize these by what I'd actually deploy them for. Because honestly, the "fastest" model isn't always the one you want running in production.

My Everyday Workhorses

These are the models I'd default to for most tasks. Fast, cheap, and good enough that I don't have to think about them.

Qwen3-8B — 70 tok/s, 150ms TTFT, $0.01/M output. Let me say that again: one cent per million output tokens. Apache-2.0 licensed weights. This is the model I reach for when I need to summarize, classify, or do simple completions. For tasks where raw speed matters more than nuance, nothing else touches it.

Step-3.5-Flash — 80 tok/s, 120ms TTFT, $0.15/M. This is the absolute speed king. It streams tokens so fast it almost feels like reading from local memory. Quality is fine for most chat applications, and at $0.15/M you can afford to be generous with the context window.

DeepSeek V4 Flash — 60 tok/s, 180ms TTFT, $0.25/M. This is my "production default" for serious work. It hits GPT-4o-class quality on most tasks, streams fast enough that users don't notice, and the weights are open. The 180ms TTFT is genuinely snappy — under the 200ms threshold where users perceive things as "instant."

When Quality Matters More Than Speed

Sometimes you need the model to actually be smart, not just fast. These are my picks for those situations.

Hunyuan-TurboS — 55 tok/s, 200ms TTFT, $0.28/M. Tencent's mid-tier offering, and honestly underpriced. If you need something a notch above the Flash tier without paying full premium rates, this is it.

Qwen3-32B — 45 tok/s, 250ms TTFT, $0.28/M. Still in the budget tier, but noticeably smarter than the 8B. Good for code review and longer reasoning chains.

GLM-4-32B — 38 tok/s, 300ms TTFT, $0.56/M. Zhipu's offering. Quality is solid for analytical tasks. The 300ms TTFT is borderline — usable but not snappy.

The Premium Tier (Use Sparingly)

These models prioritize getting things right over getting them fast. I only reach for them when correctness is non-negotiable.

DeepSeek V4 Pro — 30 tok/s, 400ms TTFT, $0.78/M. The "Pro" version is meaningfully better than the Flash version on hard reasoning tasks. The 400ms TTFT is noticeable in chat, so I use it for backend batch jobs.

GLM-5 — 25 tok/s, 500ms TTFT, $1.92/M. Zhipu's flagship. Smart, but you can feel the latency.

Kimi K2.5 — 20 tok/s, 600ms TTFT, $3.00/M. Moonshot's best. Excellent for long-context analysis. The 600ms TTFT rules it out for interactive chat.

The Reasoning Models (Different Beast Entirely)

I tested the big reasoning models too, but with a caveat: their "TTFT" includes invisible thinking time. The first visible token doesn't appear until the model has finished its internal monologue.

DeepSeek-R1 — 800ms TTFT, 15 tok/s sustained, $2.50/M. Once it starts streaming, the throughput is fine. But waiting 800ms before anything appears on screen kills the chat UX.

Qwen3.5-397B — 1200ms TTFT, 10 tok/s, $2.34/M. The 397B parameter monster. Brilliant for hard problems, but you wouldn't want to chat with it.

MiniMax M2.5 — 450ms TTFT, 28 tok/s, $1.15/M. Sits in an awkward middle ground. Not fast enough for chat, not cheap enough to justify over R1 for reasoning tasks.

Doubao-Seed-Lite — 50 tok/s, 220ms TTFT, $0.40/M. ByteDance's offering. Surprisingly competitive in the mid-tier.

Code: How I Actually Call These Things

The beautiful thing about Global API is that it speaks OpenAI's protocol, so I can use any SDK I want. Here's the Python snippet I used for the benchmarks:

import time
import requests

API_KEY = "your-global-api-key"
BASE_URL = "https://global-apis.com/v1"

def benchmark_model(model: str, prompt: str = "Explain recursion in 200 words"):
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    response = requests.post(
        f"{BASE_URL}/chat/completions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": True,
            "max_tokens": 200
        },
        stream=True
    )

    for line in response.iter_lines():
        if line:
            if first_token_time is None:
                first_token_time = time.perf_counter() - start
            token_count += 1

    total_time = time.perf_counter() - start
    streaming_time = total_time - first_token_time

    return {
        "model": model,
        "ttft_ms": round(first_token_time * 1000),
        "tokens": token_count,
        "tok_per_sec": round(token_count / streaming_time, 1)
    }
Enter fullscreen mode Exit fullscreen mode

Run that in a loop across your model list, average the results, and you've got yourself a reproducible benchmark. I ran each model 10 times to smooth out network jitter.

For production use, I'd wrap it in an async client:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="your-global-api-key",
    base_url="https://global-apis.com/v1"
)

async def stream_response(model: str, prompt: str):
    stream = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    async for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

asyncio.run(stream_response("deepseek-v4-flash", "Explain recursion in 200 words"))
Enter fullscreen mode Exit fullscreen mode

That's it. No proprietary SDK, no vendor-specific nonsense, no licensing gotchas. Just standard OpenAI protocol that works with anything.

The Geography Question

Here's something that surprised me: where you run your tests matters a lot. I ran the same benchmarks from both US East and Asia Singapore, and the Asian models had a measurable home-field advantage.

Take Kimi K2.5: 600ms TTFT from the US, but only 480ms from Singapore. That's a 20% improvement just from being closer to the servers. DeepSeek V4 Flash dropped from 180ms to 150ms. Qwen3-32B went from 250ms to 210ms.

GLM-5 saw an 80ms improvement. If you're building for an Asian audience, this isn't a minor optimization — it's the difference between a product that feels responsive and one that feels sluggish.

The lesson: pick your model based partly on where your users are. A Chinese model deployed from a US-East client is leaving 80-120ms on the table.

My Actual Recommendations

After running all of this, here's what I'd ship in production today:

For interactive chat: DeepSeek V4 Flash. The 180ms TTFT is below the perceptual threshold for "instant," and the quality is good enough for most conversations. Open weights, Apache-style licensing, $0.25/M. You really can't go wrong.

For bulk processing: Qwen3-8B at $0.01/M. When I'm running batch jobs that need to chew through thousands of documents, I don't care about the absolute best quality — I care about cost per million tokens. At a penny, I can afford to be wasteful.

For code generation: Step-3.5-Flash. The 80 tok/s means the model keeps up with my typing pace. Code suggestions feel like autocomplete rather than a chat partner.

For hard reasoning tasks: DeepSeek-R1, but only in async batch jobs where TTFT doesn't matter. Yes, the 800ms wait is annoying, but the quality is worth it when the alternative is wrong answers.

Why I Stay Away From the Big Walled Gardens

I'll say it plainly: I don't trust the closed-source providers with anything important. Every time I've seen a company build on a single proprietary API, they end up regretting it when prices change or features get deprecated. The models I've been recommending? Most of them have open weights. You can self-host them if you want to. That's the whole point.

The open source ethos isn't just about licensing — it's about optionality. When I build with Qwen3 or DeepSeek, I know that if Global API disappeared tomorrow, I could run the same models on my own infrastructure. The weights are on Hugging Face under Apache-2.0. The inference code is MIT-licensed. Nothing is locked behind a vendor relationship.

That's freedom. That's what open source actually means in practice.

The Bottom Line

Speed matters. A 200ms response feels like magic; a 2000ms response feels broken. But speed alone isn't enough — you need models you can actually deploy without

Top comments (0)