I Spent a Week Comparing DeepSeek, Qwen, Kimi, and GLM
Last Tuesday my invoice from OpenAI hit $847. That was the moment I decided to take Chinese open-weight models seriously, not as a curiosity, but as actual production candidates.
Three months later I've run roughly 12 million tokens through DeepSeek, Qwen, Kimi, and GLM via Global API's unified endpoint. I built a small benchmark harness, threw real workloads at them, and started keeping score. This is what I learned, and more importantly, what I'd actually ship to production.
If you only have thirty seconds, here's the gist: DeepSeek V4 Flash is the price-to-performance champion at $0.25/M output. Qwen has the broadest catalog and genuinely good multimodal options. Kimi K2.5 dominates pure reasoning benchmarks but costs $3.00/M. GLM is the dark horse for Chinese-language work and has a surprisingly competent vision line. Everything else is nuance, and I have opinions about the nuance.
The Setup
Before diving into the comparison, fwiw, here's the harness I built. It's nothing fancy, just a wrapper that hits Global API's OpenAI-compatible endpoint and logs latency, token counts, and cost. Under the hood this is RFC 7231-compliant HTTP, but I won't bore you with that.
from openai import OpenAI
import time, tiktoken
client = OpenAI(
api_key="ga_xxxxxxxxxxxx",
base_url="https://global-apis.com/v1"
)
def benchmark(model: str, prompt: str, runs: int = 5):
enc = tiktoken.get_encoding("cl100k_base")
results = []
for _ in range(runs):
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
elapsed = time.perf_counter() - start
out_tokens = len(enc.encode(response.choices[0].message.content))
tps = out_tokens / elapsed
results.append({
"model": model,
"tokens": out_tokens,
"seconds": round(elapsed, 2),
"tps": round(tps, 1)
})
return results
One endpoint, four model families. That's the appeal of a unified gateway. No juggling keys, no rate-limiting gymnastics, no surprise billing from four different vendors.
The Contenders at a Glance
Here's the matrix I ended up pinning to my monitor. All prices are per million output tokens, pulled from Global API's pricing page.
| Feature | DeepSeek | Qwen | Kimi | GLM |
|---|---|---|---|---|
| Developer | DeepSeek (幻方) | Alibaba (阿里) | Moonshot AI (月之暗面) | Zhipu AI (智谱) |
| Price Range | $0.25-$2.50/M | $0.01-$3.20/M | $3.00-$3.50/M | $0.01-$1.92/M |
| Top Pick | V4 Flash @ $0.25/M | Qwen3-32B @ $0.28/M | K2.5 @ $3.00/M | GLM-5 @ $1.92/M |
| Code Generation | 5/5 | 4/5 | 4/5 | 3/5 |
| Chinese | 4/5 | 4/5 | 5/5 | 5/5 |
| English | 5/5 | 4/5 | 4/5 | 4/5 |
| Reasoning | 4/5 | 4/5 | 5/5 | 4/5 |
| Speed | 5/5 | 4/5 | 3/5 | 4/5 |
| Vision/Multimodal | Limited | Yes (VL, Omni) | No | Yes (GLM-4.6V) |
| Context Window | 128K | 128K | 128K | 128K |
| API Compatibility | OpenAI | OpenAI | OpenAI | OpenAI |
The 128K context is table stakes now. Every vendor cleared that bar, so I stopped tracking it as a differentiator. What matters is what's done with those tokens.
DeepSeek: The Value Workhorse
I'll start here because, imo, DeepSeek is the model most teams should default to.
I migrated a customer support summarization pipeline from GPT-4o-mini to DeepSeek V4 Flash two months ago. Monthly cost dropped from $312 to $41. Quality scores from my eval suite actually went up by 3%. I still don't fully understand how that's possible, but I'm not complaining.
The Lineup
| Model | Output $/M | What I Use It For |
|---|---|---|
| V4 Flash | $0.25 | Default everything |
| V3.2 | $0.38 | When I want the latest architecture |
| V4 Pro | $0.78 | Production paths where quality matters |
| R1 (Reasoner) | $2.50 | Math, logic chains, anything requiring chain-of-thought |
| Coder | $0.25 | Code-specific tasks, same price as Flash |
What Works
The price-to-performance ratio is genuinely absurd. V4 Flash at $0.25/M is the same price tier as the cheapest GPT-4o-mini variant, but the output quality feels closer to GPT-4o. On HumanEval and MBPP, DeepSeek's coding models are consistently at the top of the open-weight leaderboards.
Speed is the second thing I noticed. V4 Flash consistently hits ~60 tokens/sec on my benchmark, which is the fastest of any model I tested. For chat UIs where latency is UX, that matters more than people admit.
English quality is solid. I ran it through my usual battery of MMLU subsets, and it lands within noise of Western flagship models.
What Doesn't
Vision is the obvious gap. DeepSeek doesn't have a native multimodal model that I've found, so anything image-related goes to Qwen or GLM.
Chinese-language quality is "very good" rather than "best in class." For pure Chinese understanding or generation, Kimi and GLM both edge it out. The difference is small, maybe 2-4% on benchmarks, but it exists.
Model variety is narrower than Qwen. If you need a 70B parameter sweet spot or a 200B beast, DeepSeek doesn't have one.
Code Example
from openai import OpenAI
client = OpenAI(
api_key="ga_xxxxxxxxxxxx",
base_url="https://global-apis.com/v1"
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Explain quantum computing in 100 words"}]
)
print(response.choices[0].message.content)
This is the exact pattern I use in production. Swap deepseek-v4-flash for any other model name and it just works.
Qwen: The Swiss Army Knife
If DeepSeek is a scalpel, Qwen is the entire operating room. Alibaba ships so many models that I genuinely lost track during testing.
The Lineup
| Model | Output $/M | What I Use It For |
|---|---|---|
| Qwen3-8B | $0.01 | Classification, routing, ultra-cheap preprocessing |
| Qwen3-32B | $0.28 | My general-purpose default for non-critical paths |
| Qwen3-Coder-30B | $0.35 | When I want bigger context than DeepSeek Coder |
| Qwen3-VL-32B | $0.52 | Image understanding |
| Qwen3-Omni-30B | $0.52 | Audio + video + image in one call |
| Qwen3.5-397B | $2.34 | When I genuinely need the big brain |
What Works
The breadth is unmatched. Qwen3-8B at $0.01/M is so cheap that I use it as a routing classifier to decide whether to escalate to a larger model. That single trick cut my average inference cost by 40%.
Vision models are genuinely good. Qwen3-VL-32B handles OCR, chart understanding, and document parsing better than I expected. Omni is even more interesting, since it takes audio, video, and image in one request. I haven't found a Western equivalent at that price.
The Alibaba infrastructure backing is real. I've never had an outage hit a Qwen model, and latency is consistently under 2 seconds for first-token on 32B variants.
What Doesn't
Naming conventions are a mess. Qwen3, Qwen3.5, Qwen3.6, with variant letters I still don't fully understand. I keep a sticky note on my monitor. This is annoying enough that I'd dock them a point if I were scoring.
English quality is solid but not exceptional. On creative writing benchmarks, it lands slightly behind DeepSeek. Not by much, but consistently.
Some models are overpriced. Qwen3.6-35B at roughly $1/M feels steep when Qwen3-32B is $0.28/M and almost as good. I'm not sure who the target buyer is.
Kimi: The Reasoning Specialist
Kimi is the model I reach for when I'm stuck. It thinks harder than the others, and you can see the difference in the output.
The Lineup
| Model | Output $/M | What I Use It For |
|---|---|---|
| K2.5 | $3.00 | Hard reasoning, research synthesis |
| K2-Pro | $3.50 | Maximum reasoning quality |
What Works
The reasoning benchmarks are real. On GPQA, MATH, and the harder MMLU subsets, K2.5 outperforms everyone else in this comparison by 5-8%. If you have a task where getting the right answer matters more than speed or cost, Kimi is the answer.
Chinese-language quality is best-in-class, tied with GLM in my tests. For long-form Chinese content generation, Kimi has a slightly more natural voice.
Context handling at 128K is reliable. I dropped a 90K-token legal document in once, and it actually cited clauses correctly when I asked questions. That alone saved me a week's worth of manual review.
What Doesn't
Speed. Kimi is the slowest of the four, by a lot. K2.5 clocks around 25 tokens/sec on my benchmark, roughly half of DeepSeek's speed. For interactive UIs, that latency is noticeable.
Price is the real issue. $3.00/M for K2.5 means you cannot default to this model on cost-sensitive workloads. I only route to Kimi when other models fail my eval suite.
No vision. Like DeepSeek, Kimi doesn't have a multimodal variant. For image tasks, you're going to Qwen or GLM.
The smaller models aren't compelling. There's no "Kimi Flash" equivalent at $0.25/M, so you're stuck paying premium pricing or routing elsewhere.
GLM: The Underrated Wildcard
GLM is the model I underestimated. I started testing it as an afterthought, and it's now handling two production pipelines.
The Lineup
| Model | Output $/M | What I Use It For |
|---|---|---|
| GLM-4-9B | $0.01 | Cheap Chinese-language routing |
| GLM-5 | $1.92 | Best overall GLM option |
What Works
Chinese-language quality is genuinely top-tier. For pure Chinese tasks, GLM-5 ties or slightly beats Kimi in my internal evals. If your users are primarily Chinese-speaking, this matters.
GLM-4.6V is a surprisingly competent vision model. For document understanding and OCR-heavy tasks in Chinese, it's my pick over Qwen3-VL.
Pricing on the small model is absurd. GLM-4-9B at $0.01/M is one of the cheapest models on any provider. I use it as a classifier.
What Doesn't
Code generation is the weakest of the four. On HumanEval, GLM lands 5-10 points below DeepSeek and Qwen. Not unusable, just not where I'd send critical code paths.
English quality is good but not exceptional. For English-heavy workloads, GLM feels a half-step behind the alternatives.
Documentation and ecosystem are thinner. I had to dig through Zhipu's repos to find good examples. Qwen's ecosystem is more mature.
The Real Workloads
Synthetic benchmarks are nice, but here's what actually shipped to production after testing:
Workload 1: Customer support summarization → DeepSeek V4 Flash. The 3% quality bump and 87% cost reduction are why this is no longer a question.
Workload 2: Document Q&A on Chinese contracts → GLM-5. Beat Kimi by 2% on my eval suite at 36% lower cost. No brainer.
Workload 3: Multimodal receipt parsing → Qwen3-VL-32B. The only one of the four that does OCR well at this price point.
Workload 4: Math and logic chains → Kimi K2.5. Worth every penny when the answer has to be right.
Workload 5: Routing classifier → Qwen3-8B at $0.01/M. Too cheap to bother comparing alternatives.
My Actual Recommendations
If you're a backend engineer picking a default model today, here's what I'd do:
Default: DeepSeek V4
Top comments (0)