DEV Community

rarenode
rarenode

Posted on

I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

I'll be honest — I didn't set out to write this. I set out to pick one Chinese LLM family for a client project and move on with my life. Three tabs, four documentation pages, and a suspicious amount of coffee later, I had a spreadsheet with 1,247 rows of model outputs. So here we are. This is the post I wish existed when I started.

Why I Actually Cared

My background is heavy on tabular data — regression, classification, the usual suspects. LLMs weren't in my wheelhouse until I shipped a few chatbot features and realized the cost line on monthly invoices started looking like a phone number. So I went looking for cheaper options that didn't make me want to throw my laptop. DeepSeek, Qwen, Kimi, and GLM kept surfacing — all OpenAI-compatible, all reachable through a single endpoint at global-apis.com/v1, all with aggressive pricing.

With a sample size of 1,247 prompts across four model families, I figured I could draw some defensible conclusions. Whether "defensible" survives peer review is between me and my sleep schedule.

Methodology, Since You Asked

I ran each prompt through every model using the same OpenAI Python client, swapping only the model string and the prompt template. Every API call went through https://global-apis.com/v1 so I'm comparing outputs, not plumbing. For each call I captured:

  • Latency (time to first token + total generation time)
  • Tokens per second during generation
  • Output quality on a 1–5 rubric I built for code, reasoning, and chat
  • Cost per run, calculated from each model's published per-million-token rate

The prompt set breaks down roughly:

Category Sample Size Notes
Code generation 312 HumanEval-style problems, 4 languages
Reasoning / math 268 GSM8K-style word problems, logic puzzles
Chinese language 241 Translation, summarization, sentiment
English chat 224 Multi-turn dialogues, instruction following
Vision / multimodal 202 Image captioning, OCR (where supported)

It's not a peer-reviewed benchmark. It is a real workload that resembles what production traffic looks like, which I care about more.

The Pricing Picture

Let me get the most important table out of the way first, because this is where budgets live or die:

Family Min Price ($/M output) Max Price ($/M output) Range Span
DeepSeek $0.25 $2.50 10x
Qwen $0.01 $3.20 320x
Kimi $3.00 $3.50 1.17x
GLM $0.01 $1.92 192x

The correlation between "Cheap" and "Bad" is, empirically, weak. That's the headline finding. Qwen3-8B at $0.01/M tokens is functionally absurd — you could run thousands of classification calls per dollar. Kimi, on the other hand, is pricing-insensitive in a way I find almost philosophical: you either want what they sell or you don't.

If you're optimizing purely on dollars-per-quality-point (a metric I made up but stand behind), DeepSeek V4 Flash is the statistical winner in my sample. If you want the cheapest possible option that can still answer an email, Qwen3-8B at $0.01/M is the floor.

Speed, Where the Variance Lives

Latency benchmarks, median over my sample of 1,247 calls (your mileage will vary — sample size caveats apply):

Model Median TTFT (ms) Tokens/sec Notes
DeepSeek V4 Flash 180 ~60 Genuinely fast
Qwen3-32B 240 ~45 Steady
Kimi K2.5 410 ~28 Slow but thorough
GLM-5 290 ~38 Mid-pack

DeepSeek consistently clocks around 60 tokens/sec on V4 Flash, which matches the figure I'd seen reported elsewhere. That's correlation I trust. For real-time chat UX, this matters more than I expected — 200ms of TTFT feels instant, 400ms feels like a buffering wheel.

Kimi is the slowest in my sample. Whether that's worth it depends entirely on what you're optimizing for (see reasoning section below).

Quality, Broken Down by Task

This is where the "best model" question gets statistically murky. Here's what the rubric scores look like across each task category, averaged on my 1–5 scale:

Task DeepSeek Qwen Kimi GLM
Code generation 4.4 3.9 4.1 3.4
Reasoning 4.0 4.0 4.6 4.1
Chinese language 4.2 4.3 4.7 4.6
English chat 4.4 4.0 4.0 4.1
Vision/multimodal 2.8 4.1 2.4 4.2

Quick takeaways:

  • Kimi's reasoning score is the highest in my sample by a non-trivial margin (4.6 vs 4.0–4.1). That tracks with the benchmarks I'd seen online.
  • DeepSeek wins English chat and code generation in my workload, which is interesting because I wasn't looking for a winner — I was just measuring.
  • GLM is statistically tied with Kimi on Chinese within my tolerance, and beats everyone on vision if you include its GLM-4.6V variant.
  • Qwen is the median on basically everything — competent, broad, not flashy.

There's a real statistical argument that Qwen is "good enough" for ~80% of workloads given its model variety. There's also a real argument that "good enough" is doing heavy lifting in that sentence.

Deep Dive: DeepSeek

I spent the most time with DeepSeek because V4 Flash kept ending up in my winner's column. Here's the family roster as I tested it:

Model Output $/M What I Used It For
V4 Flash $0.25 Default driver for everything
V3.2 $0.38 Latest-arch experiments
V4 Pro $0.78 When I needed an extra quality bump
R1 (Reasoner) $2.50 Hard math, multi-step logic
Coder $0.25 Code-specialized runs

What I noticed:

  • The price-to-quality curve is genuinely off-trend. V4 Flash at $0.25/M is producing text that I had to triple-check wasn't a Western model I'd accidentally swapped in.
  • ~60 tokens/sec on V4 Flash showed up consistently across re-runs. Stable measurement.
  • Code generation was my favorite surprise. I ran the same HumanEval-style problems against it and the pass-rate was within noise of the frontier models I'd previously been paying 8x for.
  • Vision is the missing piece. No native multimodal on this family. If your workflow needs images, look elsewhere or chain with a vision model.

Here's the small Python helper I keep reusing — note the base_url points at Global API:

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a careful data analyst."},
        {"role": "user", "content": "Explain Simpson's paradox in 100 words."}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
Enter fullscreen mode Exit fullscreen mode

In my actual notebook this is wrapped in a run_prompt() function that logs latency, tokens, and cost. If you want the full version, it's around 40 lines and frankly not interesting.

Deep Dive: Qwen

Qwen is the family I underrated going in. Alibaba ships so many variants that it's easy to dismiss as "another big catalog" — but the range is genuinely useful in production:

Model Output $/M What I Used It For
Qwen3-8B $0.01 Bulk classification, cheap embeddings-class work
Qwen3-32B $0.28 My general-purpose default
Qwen3-Coder-30B $0.35 Code when I need a non-DeepSeek opinion
Qwen3-VL-32B $0.52 Vision tasks
Qwen3-Omni-30B $0.52 Audio + video experiments
Qwen3.5-397B $2.34 Heavy reasoning, when Kimi felt like overkill

Notes from the trenches:

  • The breadth is real. From $0.01/M to $3.20/M in a single family is something I haven't seen elsewhere. Qwen3-8B at $0.01 is the cheapest production-grade call I've made this year, full stop.
  • Qwen3-VL-32B was my workhorse for vision. The image understanding is solid and the cost is reasonable.
  • Qwen3-Omni-30B is the only model in my sample that handled a video input reasonably well. The "omni" is doing real work.
  • Qwen3.5-397B at $2.34/M is steep unless you're running workloads where the bigger parameter count actually helps. In my reasoning subset it didn't beat Kimi by enough to justify 22% less cost.
  • Naming is a real downside. I lost a Saturday trying to figure out which "Qwen3.5" was which. Document it before you deploy.

Sample call:

# Switching the same client to Qwen3-32B for general tasks
response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "user", "content": "Write a Python function that merges two sorted lists in O(n) time."}
    ]
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Deep Dive: Kimi

Kimi is the family I had the strongest priors about before testing. It's positioned as Moonshot AI's reasoning specialist — and the data backs that up. Here's what I tested:

Model Output $/M What I Used It For
K2.5 $3.00 The "main" Kimi model in my sample
(Family ceiling) $3.50 Premium tier

Observations:

  • Reasoning quality is the standout. My 4.6 rubric score for reasoning isn't a fluke — it's the largest gap between Kimi and the others in any category I measured.
  • No vision/multimodal in this family. If your product needs images, Kimi isn't your answer.
  • Speed is the tradeoff. ~28 tokens/sec is fine for batch jobs, less fine for chat UX where users are waiting.
  • The pricing floor is $3.00/M. There is no "cheap Kimi." You either need what it does or you don't.
  • Best corpus fit if your workload is multi-step planning, math, or chain-of-thought-style tasks. I used it for a planning agent prototype and the trace logs were noticeably cleaner than the alternatives.

Deep Dive: GLM

Zhipu's GLM family surprised me. I'd written it off as a "Chinese-language specialist," which is reductive:

Model Output $/M What I Used It For
GLM-4-9B $0.01 Ultra-budget tasks
GLM-5 $1.92 Production quality, my GLM default
GLM-4.6V (vision) Image understanding

What I found:

  • **GLM-4-

Top comments (0)