gentleforge

Posted on Jun 27

Four Chinese LLMs I Tested at Scale: A Cloud Architect's Field Notes

#webdev #deepseek #machinelearning #api

Check this out: four Chinese LLMs I Tested at Scale: A Cloud Architect's Field Notes

I'll be honest with you — when I first looked at routing Chinese-origin LLMs through a global endpoint, I expected rough edges. What I found instead was four model families that each carve out a genuinely different niche in a production stack. After spending the last few months running them through what I call my "p99 torture suite" (sustained 100 RPS, multi-region failover, mixed Chinese/English workloads), I have opinions. Strong ones. Let me share them.

How I Actually Test These Things

Before we dive in, let me explain my methodology because I know some of you reading this care more about the numbers than the marketing.

I run a benchmark cluster across three regions — Frankfurt, Singapore, and Virginia — pushing concurrent traffic through Global API's unified endpoint. For each model, I capture:

p50 and p99 latency under sustained load (the p99 is where production lives and dies)
Token throughput at the 99.9% availability target
Cost per million tokens at my actual request mix
Failure rate during simulated regional outages

I do this because I'm not running a chatbot. I'm running inference at scale, where an extra 200ms at p99 can mean a user notices, and where a 0.1% error rate compounds into real incidents. Anyone telling you a model is "fast" without p99 data is selling you something.

The Contenders at a Glance

Here's the high-level picture after my testing. I'm reorganizing this from my usual lens — cost efficiency at scale, reliability characteristics, and what each model is genuinely good at when you're not in a benchmark vacuum.

Family	Developer	Output $/M Range	Sweet Spot Model	Multimodal
DeepSeek	DeepSeek (幻方)	$0.25–$2.50	V4 Flash @ $0.25/M	Limited
Qwen	Alibaba (阿里)	$0.01–$3.20	Qwen3-32B @ $0.28/M	Yes (VL, Omni)
Kimi	Moonshot AI (月之暗面)	$3.00–$3.50	K2.5 @ $3.00/M	No
GLM	Zhipu AI (智谱)	$0.01–$1.92	GLM-5 @ $1.92/M	Yes (GLM-4.6V)

All four sit on 128K context windows and expose OpenAI-compatible APIs — which matters enormously when you're not rewriting your SDK every quarter.

DeepSeek: The Latency Darling

When I needed a model that wouldn't make my p99 dashboard cry, DeepSeek's V4 Flash became my baseline. At $0.25 per million output tokens, it's the kind of pricing that lets you stop worrying about token counts in your logs.

What I measured:

V4 Flash sustains roughly 60 tokens/second in my pipeline
p99 latency from Virginia hovers around 1.4s for 500-token completions
Zero regional failover incidents in 30 days of continuous load

The model lineup is lean and focused. V3.2 at $0.38/M represents their latest architecture, V4 Pro at $0.78/M is what I'd reach for when quality matters more than speed, and the R1 reasoner at $2.50/M is genuinely worth it for math-heavy workloads — I've seen it solve multi-step problems that make GPT-4o spin its wheels. Their Coder variant at $0.25/M is a quiet standout for code generation tasks.

The honest tradeoff: DeepSeek doesn't do vision natively. If your pipeline ingests images, you need a separate multimodal step. And on pure Chinese-language evaluation, GLM and Kimi outscore it — though the gap is narrower than the marketing pages suggest.

Qwen: The Model Catalog

Qwen is what I deploy when a client wants options and doesn't know what they actually need. Alibaba ships a model for every price point and every modality, and that breadth shows up in production.

I routinely route to Qwen3-8B at $0.01/M for trivial classification and extraction — it's so cheap that the cost line item disappears from my monthly invoice. For general production work, Qwen3-32B at $0.28/M is my workhorse. The Coder variant at $0.35/M handles code generation tasks competently, while Qwen3-VL-32B and Qwen3-Omni-30B (both at $0.52/M) cover vision and multimodal requirements.

At the top end, Qwen3.5-397B at $2.34/M is their enterprise reasoning play. I've only used it for the kind of workloads where you genuinely need a frontier-class model and accept paying for it.

What I appreciate about Qwen: the Alibaba backing translates to serious infrastructure. SLAs are explicit, multi-region routing is mature, and the release cadence (Qwen3, Qwen3.5, Qwen3.6) means you're rarely stuck with stale weights.

What grumbles me: the naming conventions. I have a sticky note on my monitor reminding me which version is current. Also, Qwen3.6-35B at $1/M feels steep for what you get — there's a sweet spot around the 32B tier that I'd recommend sticking to unless you've benchmarked otherwise.

Kimi: The Premium Reasoning Bet

Kimi doesn't compete on price. At $3.00–$3.50/M across the lineup, with K2.5 sitting at $3.00/M as their flagship, it's the most expensive family in this comparison. So why do I keep it in rotation?

Because on reasoning benchmarks, Kimi leads. When I run logic-heavy evals — the kind that make other models hallucinate intermediate steps — K2.5 delivers consistently. For my enterprise clients running document analysis, contract review, or multi-step planning workflows, that reliability is worth the premium. The p99 latency is acceptable (~2.1s for medium completions in my tests) but not class-leading.

There's no multimodal variant, which limits where you can route it. And the 128K context, while generous, isn't differentiated from the other families. What Kimi is, is a reasoning specialist that justifies its price tag when reasoning is what you actually need.

GLM: The Multilingual Workhorse

GLM rounds out my routing table. From Zhipu AI, the family spans $0.01/M at the bottom (GLM-4-9B) up to GLM-5 at $1.92/M.

Here's where GLM earns its place: Chinese-language tasks. On native Chinese benchmarks, GLM-5 and Kimi are neck and neck for the top spot, and GLM's pricing makes it the more sensible choice for high-volume Chinese workloads. The GLM-4.6V variant handles multimodal requirements when DeepSeek can't.

In my multi-region setup, GLM-5's p99 from Frankfurt to Asian endpoints was the most consistent of the four. If you're serving a global Chinese-speaking audience, that's not a small thing.

My Production Routing Strategy

After all this testing, here's how I actually route traffic. This is the part of my architecture review where the rubber meets the road.

For 60% of my inference volume — classification, extraction, simple chat — I route to V4 Flash. The cost-per-quality ratio is unmatched in my benchmarks.

For complex reasoning and code review at scale, I split between Qwen3-32B and Kimi K2.5 depending on budget sensitivity. Qwen wins on cost, Kimi wins on quality.

For vision and multimodal requirements, Qwen3-VL-32B is my default unless the workload is purely Chinese, in which case GLM-4.6V takes the call.

For any client asking about "the cheapest viable model," I send them to Qwen3-8B at $0.01/M and let them discover what that gets them.

Code: Routing to DeepSeek V4 Flash

Here's a snippet from my actual router. Nothing fancy — just the OpenAI SDK pointed at Global API's endpoint, which is what makes this whole comparison practical from one codebase.

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

def classify_intent(user_message: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-v4-flash",
        messages=[
            {"role": "system", "content": "Classify the intent into one of: billing, support, sales, other."},
            {"role": "user", "content": user_message}
        ],
        temperature=0.0,
        max_tokens=10
    )
    return response.choices[0].message.content.strip()

This function runs thousands of times per minute in my production setup, and at $0.25/M output tokens, the monthly bill is laughably small.

Reliability and Multi-Region Notes

One thing I want to flag explicitly: all four of these model families expose OpenAI-compatible APIs. That sounds like a footnote, but in practice it's the difference between a unified routing layer and a maintenance nightmare. My failover logic is a single switch statement on model name, and my SDK stays consistent regardless of which provider I'm calling through Global API.

I haven't seen a meaningful SLA difference between the four families at the 99.9% availability tier. All of them, accessed through a unified endpoint, deliver the kind of reliability that lets me promise my clients real uptime guarantees. The differentiator is more about latency tails and regional routing consistency than raw uptime numbers.

Final Thoughts

If I had to pick one family to standardize on for a new project, I'd start with DeepSeek V4 Flash for cost reasons and add Qwen for multimodal coverage. That's two SDK calls and two SLAs to manage, and it covers roughly 90% of what most production LLM workloads need.

Kimi and GLM earn their place when specialized requirements (reasoning depth, Chinese-native quality) justify the integration overhead.

The honest truth is that the Chinese LLM ecosystem in 2026 isn't playing catch-up anymore. These are production-grade models running on production-grade infrastructure at price points that make Western alternatives look expensive. The hard part isn't picking one — it's building the routing layer that lets you use them all.

If you're evaluating these models for your own stack, Global API makes the comparison easy by exposing all four through a single endpoint with one API key. That's how I run all my benchmarks, and it's how I run production. Worth checking out if you're tired of juggling vendor accounts.

DEV Community