DEV Community

eagerspark
eagerspark

Posted on

I Stress-Tested 4 Chinese LLMs in Production — Here's What Won

I Stress-Tested 4 Chinese LLMs in Production — Here's What Won

Six months ago I was burning through about $14k a month on OpenAI. Then I started poking at the Chinese open-weight ecosystem as a backup. What happened next wasn't a graceful migration — it was me realizing I'd been overpaying for months.

This is the field report. If you're a technical founder, an engineering lead, or anyone making architecture decisions about LLM spend, I want to save you the 200 hours of testing I did. I'm going to walk through DeepSeek, Qwen, Kimi, and GLM — not as benchmarks in a sterile lab, but as production workhorses I actually shipped code against.

All of this was run through Global API's unified endpoint, which I'll touch on at the end because it changed how I think about vendor lock-in entirely.


The $0.01 Question That Started Everything

Our trigger event was dumb. I needed to classify 2 million customer support tickets and route them to the right team. The task was simple — could a model pick from 12 categories reliably? GPT-4o handled it fine, but at our volume, it would've cost about $4,000/month in inference alone. For a classification job. I felt sick.

A friend pinged me: "Have you tried Qwen3-8B?" I hadn't. We wired it up through Global API, ran the same classification, and the total bill came out to roughly $40. Not $4,000. Forty dollars.

That kicked off what I now call "the rotation" — a process where I started running every model I could get my hands on through the same evaluation harness. The four families that consistently rose to the top were DeepSeek, Qwen, Kimi, and GLM. Here's the bottom line up front:

  • If you want pure price-to-performance for English workloads, DeepSeek V4 Flash at $0.25/M output is absurdly good.
  • If you need breadth — multimodal, vision, a dozen model sizes for different jobs — Qwen is the only real answer.
  • If your product lives or dies on reasoning quality, Kimi K2.5 at $3.00/M is worth every cent.
  • If you serve Chinese-language users, GLM-5 at $1.92/M and the smaller GLM-4-9B at $0.01/M are the play.

That last one matters more than people outside of China realize. If you're shipping to Mainland Chinese customers, the difference between a native-trained model and a translated Western model isn't subtle — it's the difference between a product people use and a product they tolerate.


The Cheat Sheet I Keep Open in My Browser

Before I get into the war stories, here's the table I have pinned in Notion. These are the numbers I actually quote when my CEO asks "why are we switching models again?"

Dimension DeepSeek Qwen Kimi GLM
Developer DeepSeek (幻方) Alibaba (阿里) Moonshot AI (月之暗面) Zhipu AI (智谱)
Output price range $0.25-$2.50/M $0.01-$3.20/M $3.00-$3.50/M $0.01-$1.92/M
My daily driver V4 Flash @ $0.25/M Qwen3-32B @ $0.28/M K2.5 @ $3.00/M GLM-5 @ $1.92/M
Code generation Top tier Strong Strong Decent
Chinese quality Strong Strong Best in class Best in class
English quality Best in class Strong Strong Strong
Reasoning Strong Strong Best in class Strong
Raw speed Fastest Fast Slower Fast
Vision / multimodal Limited Yes (VL, Omni) No Yes (GLM-4.6V)
Context window 128K 128K 128K 128K
OpenAI-compatible API Yes Yes Yes Yes

Notice that last row. This is the part vendors don't tell you: every one of these providers speaks the OpenAI protocol. That means the switching cost between them is basically zero, provided you architect correctly. I'll come back to this.


DeepSeek: The One That Made Me Reconsider My Whole Stack

I want to be honest about my DeepSeek bias. After three months of running it in production, it's now my default for roughly 70% of inference calls. Not because it's the best at everything — it's not — but because at $0.25/M output, V4 Flash hits a sweet spot that I haven't found anywhere else.

The lineup I actually use:

  • DeepSeek V4 Flash — $0.25/M. My default. Coding, content, summarization, the boring 80% of LLM work that powers most apps.
  • V3.2 — $0.38/M. Their newest architecture. Slightly better quality, slightly more expensive. I use it for tasks where I want a touch more polish.
  • V4 Pro — $0.78/M. The "I actually care about this output" tier. Marketing copy, customer-facing emails.
  • R1 (Reasoner) — $2.50/M. Math, logic, anything where getting the wrong answer costs more than the inference. The chain-of-thought reasoning here is genuinely impressive.
  • Coder — $0.25/M. Code-specific tuning, same price as Flash.

What I love: the price-to-performance ratio is bonkers. V4 Flash clocks around 60 tokens/second in our setup, which is competitive with anything I've measured from Western providers. HumanEval and MBPP scores put it in the same conversation as GPT-4o, and frankly our internal evals on coding tasks showed it edging out the more expensive model on a few prompts.

What I don't love: vision is basically absent. If your product needs to look at images, you're not staying on DeepSeek alone. The other thing — and this is more philosophical — is that Chinese-language output from DeepSeek is good, but GLM and Kimi are noticeably more native. If you're building for Chinese consumers, you'll feel the difference.

Here's the actual snippet I use to swap in DeepSeek. I keep this in a config file:

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

def call_deepseek_flash(prompt: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-v4-flash",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
    )
    return response.choices[0].message.content

result = call_deepseek_flash("Explain quantum computing in 100 words")
print(result)
Enter fullscreen mode Exit fullscreen mode

The whole thing took me about six minutes to set up. That OpenAI compatibility is doing a lot of heavy lifting for us.


Qwen: The Model Family I Wish Existed Three Years Ago

Alibaba has built something I genuinely didn't think was possible: a model family where I can find a sensible option at literally every price point. If you've ever been frustrated by the gap between "tiny model that's too dumb" and "big model that costs too much," Qwen solves that.

Here's my actual shortlist from their catalog:

  • Qwen3-8B — $0.01/M. The $0.01 model. I'm still slightly in disbelief that this works at all, but for simple classification, extraction, and routing tasks, it's shockingly competent.
  • Qwen3-32B — $0.28/M. My second-most-used model. The general-purpose workhorse. For tasks where DeepSeek V4 Flash is too risky and I need a bit more reliability.
  • Qwen3-Coder-30B — $0.35/M. Code generation specialist. Good when I'm working on something tricky and want a second opinion.
  • Qwen3-VL-32B — $0.52/M. Vision-language. This is what I reach for when DeepSeek can't help because there's an image involved.
  • Qwen3-Omni-30B — $0.52/M. The one that handles audio, video, and images. I haven't deployed this in production yet, but I'm watching it closely.
  • Qwen3.5-397B — $2.34/M. Enterprise reasoning. When the model needs to actually think hard.

The width of the catalog is the real story. I have a routing layer in our backend that picks the cheapest Qwen model that can handle a given task with acceptable quality. For some prompts, that means the $0.01/M 8B. For others, it means the $2.34/M 397B. The economic value of being able to do this — of not having to use the same model for everything — is hard to overstate.

The weakness: the naming. I have a running joke with my team that every time Alibaba announces a new model, I have to spend 20 minutes figuring out how it relates to the previous one. Qwen3 vs Qwen3.5 vs Qwen3.6, the 8B/32B/30B/35B/397B family — it's a lot. I'd pay extra for a clearer versioning scheme.

There's also a mid-tier English quality issue. Qwen is good in English. It's not DeepSeek-good. If you can measure the difference and your users can measure the difference, you notice it. If they can't, save the money.

One pricing note: some of the Qwen3.6 models are priced higher than I'd expect. Qwen3.6-35B at $1/M output feels steep when DeepSeek V4 Pro is $0.78/M and is, in my experience, slightly better. Watch those tiers carefully.

Here's a quick Qwen swap, same pattern:

def call_qwen_32b(prompt: str) -> str:
    response = client.chat.completions.create(
        model="Qwen/Qwen3-32B",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5,
    )
    return response.choices[0].message.content

# Used for: general content, summaries, structured extraction
result = call_qwen_32b("Write a Python function to merge two sorted lists")
Enter fullscreen mode Exit fullscreen mode

Kimi: When You Can't Afford to Be Wrong

Kimi is the model I have a love-hate relationship with. I love the quality. I hate the bill.

Kimi doesn't pretend to be cheap. Their whole pitch is reasoning quality, and the pricing reflects that:

  • K2.5 — $3.00/M. The current flagship. Where I go when the answer has to be right.
  • The rest of the family sits in the $3.00-$3.50/M range. There is no "budget Kimi" option.

I use Kimi sparingly. Specifically, I use it for:

  1. Math-heavy reasoning in our analytics product.
  2. Multi-step agentic workflows where a wrong intermediate answer cascades into garbage downstream.
  3. Benchmarking. I always have one Kimi call in my eval suite to anchor what "good reasoning" looks like.

The honest truth: on raw reasoning benchmarks, Kimi is the best of the four. If you're building something where the user will notice if the model gets a hard problem wrong — legal, financial, medical-adjacent, complex code review — K2.5 is the answer. At $3.00/M output, you pay for that quality, but if the alternative is a wrong answer that costs you a customer, the math works out.

The weakness I run into: speed. Kimi is the slowest of the four. For real-time user-facing features where latency matters more than perfect reasoning, I don't reach for Kimi. I also don't use Kimi for cost-sensitive bulk processing — the price just doesn't fit.

I haven't shipped a Kimi-specific code snippet to share here because, frankly, my Kimi calls are wrapped in the same generic client and selected by my router when the task profile says "reasoning-heavy, cost-tolerant." That's the architecture lesson — don't hardcode a vendor. Let the routing layer pick.


GLM: The Underdog I Didn't Expect to Recommend

GLM-5 is the model I want to talk about for a second, because I think it gets undersold in Western developer discourse.

Zhipu AI has put together something genuinely good. The lineup:

  • GLM-4-9B — $0.01/M. Yes, another $0.01 model. Pairs nicely with Qwen3-8B as a budget option in my routing layer.
  • GLM-5 — $1.92/M. The flagship. Outstanding at Chinese-language tasks, competitive on English, and has vision via the GLM-4.6V variant.

Where GLM shines: Chinese-language generation. If your product is consumed by Chinese users — actual Mainland Chinese users, not just "we support Unicode" — GLM and Kimi are in a class of their own. DeepSeek and Qwen are good. GLM and Kimi sound like a native speaker wrote it. The difference matters for trust.

The vision support is also worth highlighting. GLM-4.6V handles image understanding, which gives GLM a multimodal story that DeepSeek and Kimi both lack.

My main use case: any feature where a Chinese user reads the output. Customer support replies to Chinese-language tickets, marketing copy for the Chinese market, internal documentation translation that needs to read naturally. I route all of that to GLM-5.

The weakness: English quality is good but not best-in-class. For pure English workloads, DeepSeek V4 Flash and V4 Pro are better values. Also, the ecosystem is smaller — fewer community examples, less Stack Overflow coverage. You'll be reading the docs more often.


The Architecture Lesson That Actually Matters

Here's what I want you to take away from all of this, beyond the model comparisons. The most important decision I made wasn't picking a model. It was picking an abstraction layer.

Every model above — DeepSeek, Qwen, Kimi, GLM — is OpenAI-compatible. They all accept the same chat completions format. They all return the same response structure. The only thing that changes is the model name and the base URL.

That means I can write a single client wrapper, point it at a unified endpoint, and swap models in and out without rewriting application code. Here's roughly what that looks like:


python
from openai import OpenAI
from typing import Literal

ModelName = Literal[
    "deepseek-v4-flash",
    "deepseek-v4-pro",
    "Qwen/Qwen3-8B",
    "Qwen/Qwen3-32B",
    "kimi-k2.5",
    "glm-5",
]

class ModelRouter
Enter fullscreen mode Exit fullscreen mode

Top comments (0)