gentleforge

Posted on Jun 26

Four Chinese AI Model Families: A Backend Engineer's Bench Report

#python #machinelearning #tutorial #programming

Last month I did something that probably qualifies as a mild obsession: I routed every LLM call in my side project through four different Chinese model families to see which ones actually held up under real workload. Not toy prompts. Real HTTP requests, real latency budgets, real tokens billed at the end of the month. The candidates were DeepSeek, Qwen, Kimi, and GLM, all accessed through a single unified endpoint at global-apis.com/v1.

If you're a backend engineer staring at API pricing pages at 1am, this is the post I wish I'd had three weeks ago. I'll show you what I ran, what broke, and which model I ended up trusting for which job. Fwiw, I'm not affiliated with any of these vendors — just tired of guessing.

The Quick-Reference Matrix

Before I ramble, here's the cheat sheet. Same numbers across the board because I copy-pasted from the same pricing JSON. The star ratings are my subjective take after ~50 hours of mix-and-match production traffic.

Dimension	DeepSeek	Qwen	Kimi	GLM
Vendor	DeepSeek (幻方)	Alibaba (阿里)	Moonshot AI (月之暗面)	Zhipu AI (智谱)
Output $/M	$0.25 – $2.50	$0.01 – $3.20	$3.00 – $3.50	$0.01 – $1.92
Budget Pick	V4 Flash ($0.25)	Qwen3-8B ($0.01)	N/A — premium only	GLM-4-9B ($0.01)
Daily Driver	V4 Flash ($0.25)	Qwen3-32B ($0.28)	K2.5 ($3.00)	GLM-5 ($1.92)
Code Quality	★★★★★	★★★★	★★★★	★★★
Chinese NLP	★★★★	★★★★	★★★★★	★★★★★
English NLP	★★★★★	★★★★	★★★★	★★★★
Reasoning	★★★★	★★★★	★★★★★	★★★★
Latency	★★★★★	★★★★	★★★	★★★★
Multimodal	Limited	Yes (VL, Omni)	No	Yes (GLM-4.6V)
Context	128K	128K	128K	128K
OpenAI-compat	Yes	Yes	Yes	Yes

The TL;DR: DeepSeek V4 Flash is the price-to-performance champ at $0.25/M, Qwen is the swiss-army knife with eleven blades, Kimi is the closest thing to a reasoning savant, and GLM dominates whenever the input is in Chinese.

Why I Even Bothered

My side project is a doc-summarization API that handles about 8M tokens per day, roughly 70% English and 30% Chinese. I was paying GPT-4o prices and feeling poor. So I started asking a very backend-engineer question: which model, if I swapped it in, would my users not notice?

The unified endpoint at global-apis.com/v1 made this easy — one OpenAI-compatible client, four model strings, identical request shape. Under the hood, the routing is just an HTTP header away, but my application code didn't have to change. That's the part I appreciated most, because RFC 7231 is great but rewriting your inference layer every quarter is not.

DeepSeek: The Cheap, Fast Workhorse

I want to start here because DeepSeek V4 Flash is the model that made me mutter "wait, that's it?" when the bill came in.

The Models I Actually Touched

Model	Output $/M	What I Used It For
V4 Flash	$0.25	Default chat, code completion, summarization
V3.2	$0.38	When I wanted the slightly newer arch
V4 Pro	$0.78	When output quality actually mattered
R1 (Reasoner)	$2.50	Math-heavy chains, multi-hop logic
Coder	$0.25	Pure code generation, no chat fluff

The V4 Flash at $0.25/M is the punchline. At that price, I could run an entire RAG pipeline on a Raspberry Pi budget and still sleep at night. The deeper story is that it benchmarks at HumanEval and MBPP numbers I would have called GPT-4o territory a year ago. Imo, this is the model most people should be running by default and only escalating when the task demands it.

What I liked:

Speed: V4 Flash hits roughly 60 tokens/sec on the endpoint I used. My p95 latency dropped by ~40% compared to my previous provider.
English quality: Honestly, indistinguishable from the more expensive Western models for my workloads.
Code generation: Probably the strongest of the four. I ran it on a private benchmark of 200 LeetCode-style prompts and it aced them.

What I didn't:

No vision: Pure text. If you need image understanding, look elsewhere.
Chinese is good, not great: GLM and Kimi edged it on Chinese-language benchmarks I trust.
Skinny catalog: Compared to Qwen, there aren't many size options. You get small, medium, and "go pay more."

Code: The Switch I Made at 2am

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{
        "role": "user",
        "content": "Explain quantum computing in 100 words"
    }]
)
print(response.choices[0].message.content)

That was the entire migration. Same client class, same method signature, just a different model string. If you've ever migrated from one LLM provider to another, you know this is the dream.

Qwen: The Toolbox That Never Runs Out

Qwen is what happens when Alibaba decides to ship a model for every conceivable use case, then ships three more the following month. The Qwen3 family is broad — like, suspiciously broad. I counted at least nine distinct variants in the catalog.

What I Actually Pulled Into My Project

Model	Output $/M	Use Case
Qwen3-8B	$0.01	Cheapest possible completions
Qwen3-32B	$0.28	My new general-purpose workhorse
Qwen3-Coder-30B	$0.35	When DeepSeek was rate-limited
Qwen3-VL-32B	$0.52	Image understanding
Qwen3-Omni-30B	$0.52	Audio + video + image in one
Qwen3.5-397B	$2.34	Heavy reasoning, enterprise-y stuff

The Qwen3-8B at $0.01/M is, charitably, a miracle. Uncharitably, it's barely a language model. But for classification, intent detection, and "extract the JSON from this blob" tasks, it's basically free and good enough.

What I liked:

Breadth of catalog: Whatever you need, there's a Qwen for it. Vision, audio, omni-modal — all in one family.
Alibaba's infrastructure: I didn't have a single outage in three weeks. The 99.9% SLA they advertise appears to be real.
Active shipping cadence: Qwen3.5 dropped while I was running these tests. Qwen3.6 showed up the next week. They're not sitting still.

What I didn't:

Naming is a mess: Qwen3-32B vs Qwen3.5-397B vs Qwen3.6-35B. I had to keep a Notion page just to track which model was which.
English is fine, not stellar: Good, but DeepSeek V4 Flash beat it on my internal English eval by a noticeable margin.
Some price tiers feel off: Qwen3.6-35B at $1/M output is, imo, hard to justify when DeepSeek V4 Pro is $0.78 and arguably better for my workloads.

Code: When I Needed Vision

response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-32B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url",
             "image_url": {"url": "https://example.com/photo.jpg"}}
        ]
    }]
)

This is the one place I genuinely missed not having a DeepSeek equivalent. Qwen's VL model handled OCR and scene description without complaint.

Kimi: The Reasoner That Costs You

Kimi is the family that hurts the wallet but wins the benchmarks. K2.5 at $3.00/M is the model I reach for when the task actually requires thinking — multi-step math, logical chains, anything where the answer needs to be correct, not plausible.

What I Tested

Kimi's catalog is narrower than Qwen's but every model in it is positioned as a premium reasoning model. K2.5 is the workhorse at $3.00/M output. The upper end of the family stretches to $3.50/M, which I used for a few math-heavy evals and immediately felt the invoice.

The honest truth: Kimi is the only family where I could tell the difference on a 50-step agentic loop. The model didn't lose the thread halfway through, didn't contradict itself, and produced reasoning chains I would have written myself. That's worth $3.00/M if your workload is genuinely reasoning-heavy.

What I liked:

Reasoning is genuinely best-in-class across the four families I tested.
Chinese comprehension is excellent — Moonshot's training shows.
Stable under load: No weird completions, no truncation, no "as an AI model" disclaimers.

What I didn't:

Price: $3.00/M for the everyday model is hard to swallow. You're paying GPT-4o money and you should know that.
No multimodal: Pure text. If you need vision, route to Qwen or GLM.
Slower: Roughly 30–40% slower than DeepSeek V4 Flash on identical prompts.

Code: When Reasoning Actually Matters

response = client.chat.completions.create(
    model="moonshot-v1-32k",  # or whatever the current K2.5 alias is
    messages=[{
        "role": "user",
        "content": """
        A train leaves Shanghai at 9:00 going 120 km/h.
        Another leaves Beijing at 10:00 going 150 km/h.
        The distance between cities is 1,213 km.
        When do they meet? Walk through your reasoning.
        """
    }],
    temperature=0.2
)

That's the kind of prompt where Kimi earns its price tag. For "summarize this article," it's overkill.

GLM: The Chinese-Language Dark Horse

GLM from Zhipu AI is the family I underestimated the most. GLM-5 at $1.92/M is positioned as the "best overall" but the real story is what happens when you throw Chinese text at it.

What I Touched

Model	Output $/M	Use Case
GLM-4-9B	$0.01	Budget Chinese classification
GLM-5	$1.92	Premium Chinese + general
GLM-4.6V	(varies)	Vision tasks

The 30% of my traffic that was Chinese got routed to GLM

DEV Community

Four Chinese AI Model Families: A Backend Engineer's Bench Report

The Quick-Reference Matrix

Why I Even Bothered

DeepSeek: The Cheap, Fast Workhorse

The Models I Actually Touched

Code: The Switch I Made at 2am

Qwen: The Toolbox That Never Runs Out

What I Actually Pulled Into My Project

Code: When I Needed Vision

Kimi: The Reasoner That Costs You

What I Tested

Code: When Reasoning Actually Matters

GLM: The Chinese-Language Dark Horse

What I Touched

Top comments (0)