I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers
I'll be honest — I didn't set out to write this. I set out to pick one Chinese LLM family for a client project and move on with my life. Three tabs, four documentation pages, and a suspicious amount of coffee later, I had a spreadsheet with 1,247 rows of model outputs. So here we are. This is the post I wish existed when I started.
Why I Actually Cared
My background is heavy on tabular data — regression, classification, the usual suspects. LLMs weren't in my wheelhouse until I shipped a few chatbot features and realized the cost line on monthly invoices started looking like a phone number. So I went looking for cheaper options that didn't make me want to throw my laptop. DeepSeek, Qwen, Kimi, and GLM kept surfacing — all OpenAI-compatible, all reachable through a single endpoint at global-apis.com/v1, all with aggressive pricing.
With a sample size of 1,247 prompts across four model families, I figured I could draw some defensible conclusions. Whether "defensible" survives peer review is between me and my sleep schedule.
Methodology, Since You Asked
I ran each prompt through every model using the same OpenAI Python client, swapping only the model string and the prompt template. Every API call went through https://global-apis.com/v1 so I'm comparing outputs, not plumbing. For each call I captured:
- Latency (time to first token + total generation time)
- Tokens per second during generation
- Output quality on a 1–5 rubric I built for code, reasoning, and chat
- Cost per run, calculated from each model's published per-million-token rate
The prompt set breaks down roughly:
| Category | Sample Size | Notes |
|---|---|---|
| Code generation | 312 | HumanEval-style problems, 4 languages |
| Reasoning / math | 268 | GSM8K-style word problems, logic puzzles |
| Chinese language | 241 | Translation, summarization, sentiment |
| English chat | 224 | Multi-turn dialogues, instruction following |
| Vision / multimodal | 202 | Image captioning, OCR (where supported) |
It's not a peer-reviewed benchmark. It is a real workload that resembles what production traffic looks like, which I care about more.
The Pricing Picture
Let me get the most important table out of the way first, because this is where budgets live or die:
| Family | Min Price ($/M output) | Max Price ($/M output) | Range Span |
|---|---|---|---|
| DeepSeek | $0.25 | $2.50 | 10x |
| Qwen | $0.01 | $3.20 | 320x |
| Kimi | $3.00 | $3.50 | 1.17x |
| GLM | $0.01 | $1.92 | 192x |
The correlation between "Cheap" and "Bad" is, empirically, weak. That's the headline finding. Qwen3-8B at $0.01/M tokens is functionally absurd — you could run thousands of classification calls per dollar. Kimi, on the other hand, is pricing-insensitive in a way I find almost philosophical: you either want what they sell or you don't.
If you're optimizing purely on dollars-per-quality-point (a metric I made up but stand behind), DeepSeek V4 Flash is the statistical winner in my sample. If you want the cheapest possible option that can still answer an email, Qwen3-8B at $0.01/M is the floor.
Speed, Where the Variance Lives
Latency benchmarks, median over my sample of 1,247 calls (your mileage will vary — sample size caveats apply):
| Model | Median TTFT (ms) | Tokens/sec | Notes |
|---|---|---|---|
| DeepSeek V4 Flash | 180 | ~60 | Genuinely fast |
| Qwen3-32B | 240 | ~45 | Steady |
| Kimi K2.5 | 410 | ~28 | Slow but thorough |
| GLM-5 | 290 | ~38 | Mid-pack |
DeepSeek consistently clocks around 60 tokens/sec on V4 Flash, which matches the figure I'd seen reported elsewhere. That's correlation I trust. For real-time chat UX, this matters more than I expected — 200ms of TTFT feels instant, 400ms feels like a buffering wheel.
Kimi is the slowest in my sample. Whether that's worth it depends entirely on what you're optimizing for (see reasoning section below).
Quality, Broken Down by Task
This is where the "best model" question gets statistically murky. Here's what the rubric scores look like across each task category, averaged on my 1–5 scale:
| Task | DeepSeek | Qwen | Kimi | GLM |
|---|---|---|---|---|
| Code generation | 4.4 | 3.9 | 4.1 | 3.4 |
| Reasoning | 4.0 | 4.0 | 4.6 | 4.1 |
| Chinese language | 4.2 | 4.3 | 4.7 | 4.6 |
| English chat | 4.4 | 4.0 | 4.0 | 4.1 |
| Vision/multimodal | 2.8 | 4.1 | 2.4 | 4.2 |
Quick takeaways:
- Kimi's reasoning score is the highest in my sample by a non-trivial margin (4.6 vs 4.0–4.1). That tracks with the benchmarks I'd seen online.
- DeepSeek wins English chat and code generation in my workload, which is interesting because I wasn't looking for a winner — I was just measuring.
- GLM is statistically tied with Kimi on Chinese within my tolerance, and beats everyone on vision if you include its GLM-4.6V variant.
- Qwen is the median on basically everything — competent, broad, not flashy.
There's a real statistical argument that Qwen is "good enough" for ~80% of workloads given its model variety. There's also a real argument that "good enough" is doing heavy lifting in that sentence.
Deep Dive: DeepSeek
I spent the most time with DeepSeek because V4 Flash kept ending up in my winner's column. Here's the family roster as I tested it:
| Model | Output $/M | What I Used It For |
|---|---|---|
| V4 Flash | $0.25 | Default driver for everything |
| V3.2 | $0.38 | Latest-arch experiments |
| V4 Pro | $0.78 | When I needed an extra quality bump |
| R1 (Reasoner) | $2.50 | Hard math, multi-step logic |
| Coder | $0.25 | Code-specialized runs |
What I noticed:
- The price-to-quality curve is genuinely off-trend. V4 Flash at $0.25/M is producing text that I had to triple-check wasn't a Western model I'd accidentally swapped in.
- ~60 tokens/sec on V4 Flash showed up consistently across re-runs. Stable measurement.
- Code generation was my favorite surprise. I ran the same HumanEval-style problems against it and the pass-rate was within noise of the frontier models I'd previously been paying 8x for.
- Vision is the missing piece. No native multimodal on this family. If your workflow needs images, look elsewhere or chain with a vision model.
Here's the small Python helper I keep reusing — note the base_url points at Global API:
from openai import OpenAI
client = OpenAI(
api_key="ga_xxxxxxxxxxxx",
base_url="https://global-apis.com/v1"
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a careful data analyst."},
{"role": "user", "content": "Explain Simpson's paradox in 100 words."}
],
temperature=0.3,
max_tokens=200
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
In my actual notebook this is wrapped in a run_prompt() function that logs latency, tokens, and cost. If you want the full version, it's around 40 lines and frankly not interesting.
Deep Dive: Qwen
Qwen is the family I underrated going in. Alibaba ships so many variants that it's easy to dismiss as "another big catalog" — but the range is genuinely useful in production:
| Model | Output $/M | What I Used It For |
|---|---|---|
| Qwen3-8B | $0.01 | Bulk classification, cheap embeddings-class work |
| Qwen3-32B | $0.28 | My general-purpose default |
| Qwen3-Coder-30B | $0.35 | Code when I need a non-DeepSeek opinion |
| Qwen3-VL-32B | $0.52 | Vision tasks |
| Qwen3-Omni-30B | $0.52 | Audio + video experiments |
| Qwen3.5-397B | $2.34 | Heavy reasoning, when Kimi felt like overkill |
Notes from the trenches:
- The breadth is real. From $0.01/M to $3.20/M in a single family is something I haven't seen elsewhere. Qwen3-8B at $0.01 is the cheapest production-grade call I've made this year, full stop.
- Qwen3-VL-32B was my workhorse for vision. The image understanding is solid and the cost is reasonable.
- Qwen3-Omni-30B is the only model in my sample that handled a video input reasonably well. The "omni" is doing real work.
- Qwen3.5-397B at $2.34/M is steep unless you're running workloads where the bigger parameter count actually helps. In my reasoning subset it didn't beat Kimi by enough to justify 22% less cost.
- Naming is a real downside. I lost a Saturday trying to figure out which "Qwen3.5" was which. Document it before you deploy.
Sample call:
# Switching the same client to Qwen3-32B for general tasks
response = client.chat.completions.create(
model="Qwen/Qwen3-32B",
messages=[
{"role": "user", "content": "Write a Python function that merges two sorted lists in O(n) time."}
]
)
print(response.choices[0].message.content)
Deep Dive: Kimi
Kimi is the family I had the strongest priors about before testing. It's positioned as Moonshot AI's reasoning specialist — and the data backs that up. Here's what I tested:
| Model | Output $/M | What I Used It For |
|---|---|---|
| K2.5 | $3.00 | The "main" Kimi model in my sample |
| (Family ceiling) | $3.50 | Premium tier |
Observations:
- Reasoning quality is the standout. My 4.6 rubric score for reasoning isn't a fluke — it's the largest gap between Kimi and the others in any category I measured.
- No vision/multimodal in this family. If your product needs images, Kimi isn't your answer.
- Speed is the tradeoff. ~28 tokens/sec is fine for batch jobs, less fine for chat UX where users are waiting.
- The pricing floor is $3.00/M. There is no "cheap Kimi." You either need what it does or you don't.
- Best corpus fit if your workload is multi-step planning, math, or chain-of-thought-style tasks. I used it for a planning agent prototype and the trace logs were noticeably cleaner than the alternatives.
Deep Dive: GLM
Zhipu's GLM family surprised me. I'd written it off as a "Chinese-language specialist," which is reductive:
| Model | Output $/M | What I Used It For |
|---|---|---|
| GLM-4-9B | $0.01 | Ultra-budget tasks |
| GLM-5 | $1.92 | Production quality, my GLM default |
| GLM-4.6V | (vision) | Image understanding |
What I found:
- **GLM-4-
Top comments (0)