gentlenode

Posted on Jul 3

I Tested DeepSeek, Qwen, Kimi and GLM for a Month - Real Results

#ai #deepseek #webdev #python

honestly, I never thought I'd be writing about chinese AI models. Like, a year ago I was paying OpenAI $10/M output tokens and feeling good about it. Then I started poking around at what these chinese labs were shipping and... well, lets just say my cloud bill looks very different now.

heres the deal: DeepSeek, Qwen, Kimi, and GLM are basically the four horsemen of china's AI scene right now. Each one comes from a different lab, each one has a totally different vibe, and pricing ranges from literally pocket change to "is this a typo?" territory. I ran all four through real workloads for about a month. Heres what I actually found.

Why I Even Bothered

I run a tiny SaaS (couple thousand MRR, nothing fancy) and my biggest expense by FAR is API calls. I was bleeding money on GPT-4o for stuff that honestly didn't need to be that expensive. My buddy - shoutout to Wei - kept telling me chinese models had caught up. I was skeptical. Then I tried DeepSeek V4 Flash on a whim and... yeah. My jaw kinda dropped.

The thing is, picking between these four is annoying because they're all good in different ways. DeepSeek is cheap. Qwen has like 47 models. Kimi is scary smart. GLM handles chinese like a native (shocker, I know). So I spent a month actually using them on real client work instead of synthetic benchmarks. Pretty much everything below comes from that grind.

The Quick And Dirty Comparison

Before I get into the long stuff, here's the cheat sheet I wish someone had handed me on day one:

Thing	DeepSeek	Qwen	Kimi	GLM
Developer	DeepSeek (幻方)	Alibaba (阿里)	Moonshot AI (月之暗面)	Zhipu AI (智谱)
Price Range	$0.25-$2.50/M	$0.01-$3.20/M	$3.00-$3.50/M	$0.01-$1.92/M
Budget Pick	V4 Flash @ $0.25/M	Qwen3-8B @ $0.01/M	none (all premium)	GLM-4-9B @ $0.01/M
Top Dog	V4 Flash @ $0.25/M	Qwen3-32B @ $0.28/M	K2.5 @ $3.00/M	GLM-5 @ $1.92/M
Code Gen	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Chinese	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
English	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Reasoning	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Speed	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Vision	Limited	Yes (VL, Omni)	Nope	Yes (GLM-4.6V)
Context	128K	128K	128K	128K
API Style	OpenAI	OpenAI	OpenAI	OpenAI

All four use OpenAI-compatible endpoints, which is HUGE because you can swap them in without rewriting your codebase. Thats how I tested them - same prompts, different model strings, compare outputs. Speaking of which, lemme show you how easy this is.

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain quantum computing in 100 words"}]
)
print(response.choices[0].message.content)

Thats it. Thats the whole thing. You're not signing up for four different APIs with four different auth schemes - you swap the model name and you're done. More on that later.

DeepSeek: My Daily Driver Now

okay so DeepSeek is the one I ended up using the most. The value proposition is just... stupid good. V4 Flash costs $0.25/M output tokens. Let that sink in. Thats not a typo.

Here's the model lineup:

Model	Output $/M	Use it for
V4 Flash	$0.25	everyday coding, content, basically everything
V3.2	$0.38	latest architecture stuff
V4 Pro	$0.78	when you need production-grade quality
R1 (Reasoner)	$2.50	hard math, chain-of-thought stuff
Coder	$0.25	code-specific tasks

V4 Flash is what I default to. It hits around 60 tokens/sec which feels instant for most things, and the output quality genuinely surprised me. I was running it on summarization, code review, content rewriting - the boring everyday stuff - and it kept producing stuff I'd have sworn came from GPT-4o.

What I liked:

Price-to-performance ratio is INSANE. I literally cut my API bill by 70% the first week
Code generation is top tier. HumanEval, MBPP - whatever benchmark you throw at it, it's there
Speed. 60 tok/sec on the Flash model means responses feel snappy
English output is indistinguishable from the big western models for most tasks
It's open-weight lineage, which feels less sketchy somehow

What bugged me:

No vision. If you need image understanding, look elsewhere
Chinese isn't quite as good as GLM or Kimi. It's GOOD, just not the best
Fewer model sizes. Qwen has like ten times more options if you're picky about that

For pure coding work, DeepSeek is hard to beat at this price point. I built a whole feature in one afternoon using V4 Flash as my pair programmer and the code was cleaner than what I would have written myself. Not exaggerating.

Qwen: When You Want Options

Qwen is Alibaba's baby and it's basically the "we have everything" store. If DeepSeek is a well-curated boutique, Qwen is a massive department store. They have a model for literally every niche.

Check this lineup:

Model	Output $/M	Use it for
Qwen3-8B	$0.01	tiny tasks, classification, cheap stuff
Qwen3-32B	$0.28	general purpose workhorse
Qwen3-Coder-30B	$0.35	code generation
Qwen3-VL-32B	$0.52	image understanding
Qwen3-Omni-30B	$0.52	multimodal (audio, video, image)
Qwen3.5-397B	$2.34	enterprise reasoning

That $0.01/M on Qwen3-8B is REAL. I use it for routing - like, deciding whether a user query needs the big model or if a small one can handle it. Costs basically nothing.

What I liked:

Widest model range in this comparison. From $0.01 to $3.20 - covers literally every budget
The vision models (VL series) are genuinely good
Omni-modal support means audio, video, and image in one model
Alibaba backing means the infrastructure doesn't go down
They ship new versions constantly. Qwen3.5, Qwen3.6, theres always something new

What bugged me:

Naming is a MESS. Qwen3-32B, Qwen3.5-397B, Qwen3.6-35B - keeping track is a job
English is good but not DeepSeek level. Slightly stiffer outputs sometimes
Some models are kinda pricey. Qwen3.6-35B at $1/M feels steep for what you get

heres how I'd use it:

response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[{"role": "user", "content": "Write a Python function to merge two sorted lists"}]
)

Qwen is what I'd recommend to a team that's just starting to optimise their AI spend. The 8B model is so cheap you can spam it for routing, then escalate to the 32B or higher when you need real quality.

Kimi: The Brainy One

Kimi comes from Moonshot AI (the name 月之暗面 literally means "dark side of the moon" which is cool). This one is positioned as the thinking model. You know those chain-of-thought, step-by-step reasoning tasks? Kimi is REALLY good at those.

Pricing for Kimi is higher across the board:

Model	Output $/M	Use it for
K2.5	$3.00	reasoning, complex logic
(others)	up to $3.50	premium tier

Yeah, $3.00/M minimum. That's roughly 12x what DeepSeek V4 Flash costs. But heres the thing - when you NEED reasoning, you NEED reasoning. For tasks where I needed multi-step logic, math, or hard analytical work, Kimi was the best of the bunch.

What I liked:

Reasoning benchmarks - it crushes them
Chinese output feels very natural
Context handling on long docs is solid
Output quality for complex tasks is genuinely top-tier

What bugged me:

Expensive. $3.00/M adds up FAST
Slower than the others. If you need snappy responses, this isnt it
No vision/multimodal support
Smaller model lineup

I only used Kimi when I really needed to think hard. Like, debugging a gnarly algorithmic problem or analyzing a 50-page contract. For everyday stuff? Way overkill.

GLM: The Chinese-Language Champion

GLM is from Zhipu AI and it's the one to beat if your work involves chinese language. GLM stands for... actually I always forget what it stands for. Doesn't matter. What matters is it's REALLY good at chinese.

Pricing is interesting:

Model	Output $/M	Use it for
GLM-4-9B	$0.01	budget chinese tasks
GLM-5	$1.92	top tier

That $0.01/M entry point for the 9B model is wild. I tested it on some chinese content I had lying around and... it was noticeably better than DeepSeek or Qwen at the same size. Like, it actually understood nuance and idioms instead of translating word-by-word.

What I liked:

Best chinese language quality of the four
GLM-4.6V is a solid vision model
Reasonable pricing across the range
OpenAI-compatible API

What bugged me:

Code generation is the weakest of the four. If you're building dev tools, look elsewhere
Less of a community compared to Qwen or DeepSeek
English output is fine but not inspiring

If I had a chinese-focused product (translation app, content tool, whatever), GLM is what I'd reach for. For everything else, it's a solid runner-up.

The Verdict: What I'd Actually Use

after a month of running real workloads, here's my honest take:

For most indie hackers: DeepSeek V

DEV Community

I Tested DeepSeek, Qwen, Kimi and GLM for a Month - Real Results

Why I Even Bothered

The Quick And Dirty Comparison

DeepSeek: My Daily Driver Now

Qwen: When You Want Options

Kimi: The Brainy One

GLM: The Chinese-Language Champion

The Verdict: What I'd Actually Use

Top comments (0)