eagerspark

Posted on Jun 26

I Spent a Week Comparing DeepSeek, Qwen, Kimi, and GLM

#python #machinelearning #deepseek #tutorial

Last Tuesday my invoice from OpenAI hit $847. That was the moment I decided to take Chinese open-weight models seriously, not as a curiosity, but as actual production candidates.

Three months later I've run roughly 12 million tokens through DeepSeek, Qwen, Kimi, and GLM via Global API's unified endpoint. I built a small benchmark harness, threw real workloads at them, and started keeping score. This is what I learned, and more importantly, what I'd actually ship to production.

If you only have thirty seconds, here's the gist: DeepSeek V4 Flash is the price-to-performance champion at $0.25/M output. Qwen has the broadest catalog and genuinely good multimodal options. Kimi K2.5 dominates pure reasoning benchmarks but costs $3.00/M. GLM is the dark horse for Chinese-language work and has a surprisingly competent vision line. Everything else is nuance, and I have opinions about the nuance.

The Setup

Before diving into the comparison, fwiw, here's the harness I built. It's nothing fancy, just a wrapper that hits Global API's OpenAI-compatible endpoint and logs latency, token counts, and cost. Under the hood this is RFC 7231-compliant HTTP, but I won't bore you with that.

from openai import OpenAI
import time, tiktoken

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

def benchmark(model: str, prompt: str, runs: int = 5):
    enc = tiktoken.get_encoding("cl100k_base")
    results = []

    for _ in range(runs):
        start = time.perf_counter()
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        elapsed = time.perf_counter() - start

        out_tokens = len(enc.encode(response.choices[0].message.content))
        tps = out_tokens / elapsed

        results.append({
            "model": model,
            "tokens": out_tokens,
            "seconds": round(elapsed, 2),
            "tps": round(tps, 1)
        })

    return results

One endpoint, four model families. That's the appeal of a unified gateway. No juggling keys, no rate-limiting gymnastics, no surprise billing from four different vendors.

The Contenders at a Glance

Here's the matrix I ended up pinning to my monitor. All prices are per million output tokens, pulled from Global API's pricing page.

Feature	DeepSeek	Qwen	Kimi	GLM
Developer	DeepSeek (幻方)	Alibaba (阿里)	Moonshot AI (月之暗面)	Zhipu AI (智谱)
Price Range	$0.25-$2.50/M	$0.01-$3.20/M	$3.00-$3.50/M	$0.01-$1.92/M
Top Pick	V4 Flash @ $0.25/M	Qwen3-32B @ $0.28/M	K2.5 @ $3.00/M	GLM-5 @ $1.92/M
Code Generation	5/5	4/5	4/5	3/5
Chinese	4/5	4/5	5/5	5/5
English	5/5	4/5	4/5	4/5
Reasoning	4/5	4/5	5/5	4/5
Speed	5/5	4/5	3/5	4/5
Vision/Multimodal	Limited	Yes (VL, Omni)	No	Yes (GLM-4.6V)
Context Window	128K	128K	128K	128K
API Compatibility	OpenAI	OpenAI	OpenAI	OpenAI

The 128K context is table stakes now. Every vendor cleared that bar, so I stopped tracking it as a differentiator. What matters is what's done with those tokens.

DeepSeek: The Value Workhorse

I'll start here because, imo, DeepSeek is the model most teams should default to.

I migrated a customer support summarization pipeline from GPT-4o-mini to DeepSeek V4 Flash two months ago. Monthly cost dropped from $312 to $41. Quality scores from my eval suite actually went up by 3%. I still don't fully understand how that's possible, but I'm not complaining.

The Lineup

Model	Output $/M	What I Use It For
V4 Flash	$0.25	Default everything
V3.2	$0.38	When I want the latest architecture
V4 Pro	$0.78	Production paths where quality matters
R1 (Reasoner)	$2.50	Math, logic chains, anything requiring chain-of-thought
Coder	$0.25	Code-specific tasks, same price as Flash

What Works

The price-to-performance ratio is genuinely absurd. V4 Flash at $0.25/M is the same price tier as the cheapest GPT-4o-mini variant, but the output quality feels closer to GPT-4o. On HumanEval and MBPP, DeepSeek's coding models are consistently at the top of the open-weight leaderboards.

Speed is the second thing I noticed. V4 Flash consistently hits ~60 tokens/sec on my benchmark, which is the fastest of any model I tested. For chat UIs where latency is UX, that matters more than people admit.

English quality is solid. I ran it through my usual battery of MMLU subsets, and it lands within noise of Western flagship models.

What Doesn't

Vision is the obvious gap. DeepSeek doesn't have a native multimodal model that I've found, so anything image-related goes to Qwen or GLM.

Chinese-language quality is "very good" rather than "best in class." For pure Chinese understanding or generation, Kimi and GLM both edge it out. The difference is small, maybe 2-4% on benchmarks, but it exists.

Model variety is narrower than Qwen. If you need a 70B parameter sweet spot or a 200B beast, DeepSeek doesn't have one.

Code Example

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain quantum computing in 100 words"}]
)
print(response.choices[0].message.content)

This is the exact pattern I use in production. Swap deepseek-v4-flash for any other model name and it just works.

Qwen: The Swiss Army Knife

If DeepSeek is a scalpel, Qwen is the entire operating room. Alibaba ships so many models that I genuinely lost track during testing.

The Lineup

Model	Output $/M	What I Use It For
Qwen3-8B	$0.01	Classification, routing, ultra-cheap preprocessing
Qwen3-32B	$0.28	My general-purpose default for non-critical paths
Qwen3-Coder-30B	$0.35	When I want bigger context than DeepSeek Coder
Qwen3-VL-32B	$0.52	Image understanding
Qwen3-Omni-30B	$0.52	Audio + video + image in one call
Qwen3.5-397B	$2.34	When I genuinely need the big brain

What Works

The breadth is unmatched. Qwen3-8B at $0.01/M is so cheap that I use it as a routing classifier to decide whether to escalate to a larger model. That single trick cut my average inference cost by 40%.

Vision models are genuinely good. Qwen3-VL-32B handles OCR, chart understanding, and document parsing better than I expected. Omni is even more interesting, since it takes audio, video, and image in one request. I haven't found a Western equivalent at that price.

The Alibaba infrastructure backing is real. I've never had an outage hit a Qwen model, and latency is consistently under 2 seconds for first-token on 32B variants.

What Doesn't

Naming conventions are a mess. Qwen3, Qwen3.5, Qwen3.6, with variant letters I still don't fully understand. I keep a sticky note on my monitor. This is annoying enough that I'd dock them a point if I were scoring.

English quality is solid but not exceptional. On creative writing benchmarks, it lands slightly behind DeepSeek. Not by much, but consistently.

Some models are overpriced. Qwen3.6-35B at roughly $1/M feels steep when Qwen3-32B is $0.28/M and almost as good. I'm not sure who the target buyer is.

Kimi: The Reasoning Specialist

Kimi is the model I reach for when I'm stuck. It thinks harder than the others, and you can see the difference in the output.

The Lineup

Model	Output $/M	What I Use It For
K2.5	$3.00	Hard reasoning, research synthesis
K2-Pro	$3.50	Maximum reasoning quality

What Works

The reasoning benchmarks are real. On GPQA, MATH, and the harder MMLU subsets, K2.5 outperforms everyone else in this comparison by 5-8%. If you have a task where getting the right answer matters more than speed or cost, Kimi is the answer.

Chinese-language quality is best-in-class, tied with GLM in my tests. For long-form Chinese content generation, Kimi has a slightly more natural voice.

Context handling at 128K is reliable. I dropped a 90K-token legal document in once, and it actually cited clauses correctly when I asked questions. That alone saved me a week's worth of manual review.

What Doesn't

Speed. Kimi is the slowest of the four, by a lot. K2.5 clocks around 25 tokens/sec on my benchmark, roughly half of DeepSeek's speed. For interactive UIs, that latency is noticeable.

Price is the real issue. $3.00/M for K2.5 means you cannot default to this model on cost-sensitive workloads. I only route to Kimi when other models fail my eval suite.

No vision. Like DeepSeek, Kimi doesn't have a multimodal variant. For image tasks, you're going to Qwen or GLM.

The smaller models aren't compelling. There's no "Kimi Flash" equivalent at $0.25/M, so you're stuck paying premium pricing or routing elsewhere.

GLM: The Underrated Wildcard

GLM is the model I underestimated. I started testing it as an afterthought, and it's now handling two production pipelines.

The Lineup

Model	Output $/M	What I Use It For
GLM-4-9B	$0.01	Cheap Chinese-language routing
GLM-5	$1.92	Best overall GLM option

What Works

Chinese-language quality is genuinely top-tier. For pure Chinese tasks, GLM-5 ties or slightly beats Kimi in my internal evals. If your users are primarily Chinese-speaking, this matters.

GLM-4.6V is a surprisingly competent vision model. For document understanding and OCR-heavy tasks in Chinese, it's my pick over Qwen3-VL.

Pricing on the small model is absurd. GLM-4-9B at $0.01/M is one of the cheapest models on any provider. I use it as a classifier.

What Doesn't

Code generation is the weakest of the four. On HumanEval, GLM lands 5-10 points below DeepSeek and Qwen. Not unusable, just not where I'd send critical code paths.

English quality is good but not exceptional. For English-heavy workloads, GLM feels a half-step behind the alternatives.

Documentation and ecosystem are thinner. I had to dig through Zhipu's repos to find good examples. Qwen's ecosystem is more mature.

The Real Workloads

Synthetic benchmarks are nice, but here's what actually shipped to production after testing:

Workload 1: Customer support summarization → DeepSeek V4 Flash. The 3% quality bump and 87% cost reduction are why this is no longer a question.

Workload 2: Document Q&A on Chinese contracts → GLM-5. Beat Kimi by 2% on my eval suite at 36% lower cost. No brainer.

Workload 3: Multimodal receipt parsing → Qwen3-VL-32B. The only one of the four that does OCR well at this price point.

Workload 4: Math and logic chains → Kimi K2.5. Worth every penny when the answer has to be right.

Workload 5: Routing classifier → Qwen3-8B at $0.01/M. Too cheap to bother comparing alternatives.

My Actual Recommendations

If you're a backend engineer picking a default model today, here's what I'd do:

Default: DeepSeek V4

DEV Community

I Spent a Week Comparing DeepSeek, Qwen, Kimi, and GLM

The Setup

The Contenders at a Glance

DeepSeek: The Value Workhorse

The Lineup

What Works

What Doesn't

Code Example

Qwen: The Swiss Army Knife

The Lineup

What Works

What Doesn't

Kimi: The Reasoning Specialist

The Lineup

What Works

What Doesn't

GLM: The Underrated Wildcard

The Lineup

What Works

What Doesn't

The Real Workloads

My Actual Recommendations

Top comments (0)