I Compared DeepSeek vs Qwen vs Kimi vs GLM - Real Results
ok so heres the thing. ive been building indie projects for like 6 years now and the past few months have been WILD when it comes to chinese AI models. like honestly, every time i blink theres a new model dropping and i genuinely cant keep up anymore.
last month i finally snapped and decided to actually sit down and test four of the big ones: DeepSeek, Qwen, Kimi, and GLM. i ran them through my actual workflows. coding tasks, content writing, some chinese language stuff for a client project, reasoning puzzles. the whole deal.
heres what i found. and yes, im gonna be brutally honest about the stuff that sucked too.
Why I Even Bothered With This
look, i know theres a million "AI comparison" posts out there. most of them are garbage. they just regurgitate marketing benchmarks and call it a day. i wanted to actually USE these models for real work and see which ones earned their keep.
the other thing is pricing has gotten SO weird. you got models at like $0.01 per million tokens and others at $3.50. thats a 350x difference. you cannot just pick one randomly and hope it works out. i learned that the hard way when i burned through $200 in a weekend testing stuff (dont ask).
i tested everything through Global API btw because their unified endpoint lets me swap models without rewriting code. absolute lifesaver. more on that later.
The Quick and Dirty Comparison
heres a quick table before i dive deep. ill keep the prices EXACT because messing those up would be criminal:
| Feature | DeepSeek | Qwen | Kimi | GLM |
|---|---|---|---|---|
| Developer | DeepSeek (幻方) | Alibaba (阿里) | Moonshot AI (月之暗面) | Zhipu AI (智谱) |
| Price Range | $0.25-$2.50/M | $0.01-$3.20/M | $3.00-$3.50/M | $0.01-$1.92/M |
| Budget Pick | V4 Flash @ $0.25/M | Qwen3-8B @ $0.01/M | N/A (all premium) | GLM-4-9B @ $0.01/M |
| Best Overall | V4 Flash @ $0.25/M | Qwen3-32B @ $0.28/M | K2.5 @ $3.00/M | GLM-5 @ $1.92/M |
| Code | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Chinese | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| English | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Reasoning | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Speed | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Vision | Limited | ✅ (VL, Omni) | ❌ | ✅ (GLM-4.6V) |
| Context | 128K | 128K | 128K | 128K |
| OpenAI API | ✅ | ✅ | ✅ | ✅ |
ok so right off the bat you can see these models are NOT interchangeable. they all have their thing.
GLM: The Underdog That Surprised Me
honestly, i almost skipped GLM. i figured "eh, its chinese-only, probably not useful for me." WRONG. i was so wrong.
GLM comes from Zhipu AI (智谱), and they make some genuinely solid models. the pricing is INSANE too. like GLM-4-9B at $0.01/M output tokens? thats basically free. i ran like 5000 queries and my bill was literally like four dollars.
heres the model lineup:
- GLM-4-9B at $0.01/M - the ultra-cheap workhorse
- GLM-5 at $1.92/M - their flagship, competes with the big boys
what blew me away was the chinese language performance. im working on this project that requires generating formal chinese business emails and GLM-5 just CRUSHED it. way better than DeepSeek. way better than Qwen for this specific use case. the nuance was actually there.
the downsides tho: code generation is mediocre. like its FINE for basic stuff but for anything complex i was reaching for DeepSeek or Qwen. and the english isnt bad but its not DeepSeek-level either.
oh and they have GLM-4.6V which handles vision tasks. i tested it on some product photos and it was pretty solid. not GPT-4o level but for the price? absolutely usable.
would i use GLM again? YES. specifically for chinese language work and any task where i need to save money. its like the budget king nobody talks about.
Kimi: The Brainy One That Costs a Lot
ok Kimi is from Moonshot AI (月之暗面) and these guys are clearly going for the "smart model" angle. you can tell because they only have ONE pricing tier and its expensive: $3.00-$3.50/M output.
their flagship K2.5 sits at $3.00/M and honestly? it earns it. i threw some genuinely hard reasoning tasks at it - like the kind where other models get confused halfway through - and Kimi just handled it. multi-step logic, math word problems, the works.
the thing is, i cant use it for everything. at $3.00/M output, running kimi on bulk tasks would bankrupt me. i literally used it for like 30 test queries and already felt the cost creep up. this is a "break glass in case of emergency" model for me.
where it falls short: speed. its noticeably slower than DeepSeek. and theres no vision/multimodal support which in 2025 is kinda wild. if i need to handle images, kimi is not my pick.
english is solid tho, no complaints there. its just expensive.
would i use Kimi again? YES but only for the hard stuff. its like hiring a genius consultant - you dont call them for every little question.
DeepSeek: My Daily Driver
ok heres where i get to be really excited. DeepSeek has become my default for like 80% of my work. its just SO good for the price.
lets talk models:
- V4 Flash at $0.25/M - this is the one i use constantly
- V3.2 at $0.38/M - their newer architecture
- V4 Pro at $0.78/M - for when i need extra polish
- R1 (Reasoner) at $2.50/M - the deep thinking model
- Coder at $0.25/M - specialized for code
heres what gets me. V4 Flash at $0.25/M genuinely rivals the quality of GPT-4o for most of my use cases. and its FAST. like 60 tokens/sec fast. when im iterating on code or generating content, that speed matters a lot.
i tested V4 Flash against some western models for english copywriting and it actually WON on a few prompts. like the tone was better. more natural. less "AI-ese."
code generation is the real standout tho. i ran it through my standard coding test suite and it consistently nailed HumanEval and MBPP style problems. better than Qwen for code in my experience. better than Kimi (which surprised me).
the downsides: no real vision capabilities. chinese is good but not GLM-level. and the model lineup is smaller than Qwen so if you need specific sizes you might be out of luck.
would i use DeepSeek again? ABSOLUTELY. its already my default.
Qwen: The Swiss Army Knife
alibaba's Qwen family is WILD because they have SO MANY models. like its almost overwhelming. heres what theyve got:
- Qwen3-8B at $0.01/M - the ultra-budget option
- Qwen3-32B at $0.28/M - general purpose sweet spot
- Qwen3-Coder-30B at $0.35/M - code specialized
- Qwen3-VL-32B at $0.52/M - vision language
- Qwen3-Omni-30B at $0.52/M - the everything model
- Qwen3.5-397B at $2.34/M - enterprise tier
the range is genuinely impressive. you can go from $0.01 to $3.20/M and find a Qwen model at basically every price point in between. theres also Qwen3.6 stuff in there which can get pricey.
where Qwen shines: multimodal. like if you need ONE model that handles text AND images AND audio AND video, Qwen3-Omni-30B at $0.52/M is genuinely compelling. i tested it on a mixed-media workflow and it just worked. no special handling needed.
the alibaba backing means the infrastructure is solid. enterprise grade. i never had uptime issues with any Qwen model during my testing period.
weaknesses tho, and i gotta be honest here: the naming is CONFUSING. like Qwen3 vs Qwen3.5 vs Qwen3.6 vs the various suffixes (VL, Omni, Coder). it took me like 20 minutes just to figure out which model was which. and some of the mid-tier models feel overpriced - Qwen3.6-35B at $1/M output is steep for what you get.
english is solid but in my testing DeepSeek edged it out slightly for fluency. not by a lot tho.
would i use Qwen again? YES but specifically when i need vision or multimodal capabilities. its the best in class for that.
How I Actually Use These Models Now
heres my actual workflow after all this testing:
Daily coding work: DeepSeek V4 Flash. no contest. $0.25/M and its FAST.
Hard reasoning tasks: Kimi K2.5. when i genuinely need the brain power, i pay the $3.00/M.
Chinese language projects: GLM-5 or Kimi K2.5 (both are amazing at chinese).
Vision/multimodal stuff: Qwen3-VL-32B or Qwen3-Omni-30B.
Ultra budget experiments: Qwen3-8B or GLM-4-9B at $0.01/M.
i know thats a lot of models but thats the reality. you dont need ONE model for everything. you need the RIGHT model for each task.
Some Actual Code I Wrote
heres a basic python setup that lets me swap between models easily. i use this all the time:
from openai import OpenAI
client = OpenAI(
api_key="ga_xxxxxxxxxxxx",
base_url="https://global-apis.com/v1"
)
def ask_model(model_name, prompt):
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
prompt = "Explain the difference between TCP and UDP in simple terms"
print("=== DeepSeek V4 Flash ===")
print(ask_model("deepseek-v4-flash", prompt))
print("\n=== Qwen3-32B ===")
print(ask_model("Qwen/Qwen3-32B", prompt))
see how clean that is? same code structure, just swap the model name. thats why i love Global API's unified setup. i dont have to learn four different SDKs or auth systems.
heres another example for vision stuff with Qwen:
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
]
}]
)
print(response.choices[0].message.content)
The Stuff Nobody Talks About
a few things i noticed that arent in the spec sheets:
Latency matters more than you think. DeepSeek V4 Flash being fast means i iterate faster. when youre testing prompts or debugging code, those saved seconds add up.
Context window 128K is fine for MOST stuff. people obsess over context but unless youre doing novel-length analysis, 128K is plenty.
Model "tier" doesnt always mean quality. Qwen3-8B at $0.01/M is shockingly good for basic tasks. dont pay more than you need to.
Reasoning models are worth it sometimes. i was skeptical about Kimi K2.5 and R1 until i had a genuinely complex multi-step problem. then i got it.
Which One Should You Pick?
heres my honest advice:
- If youre building a startup and need cheap reliable text generation: DeepSeek V4 Flash. seriously. just use it.
- If you need multimodal/vision: Qwen3-VL or Qwen3-Omni.
- If youre doing heavy chinese language work: GLM-5 or Kimi K2.5.
- If you need the absolute best reasoning: Kimi K2.5.
- If youre experimenting on a tight budget: Qwen3-8B or GLM-4-9B at $0.01/M.
theres no single winner. the original article's TLDR said DeepSeek V4 Flash wins on price-to-performance and i agree with that. but honestly? the ecosystem is mature enough now that you should be using multiple models.
Final Thoughts
ok so wrapping this up. i spent way too many hours testing these models and im still finding new use cases. the chinese AI ecosystem has gotten legitimately good. like scary good. and the prices are so low that you can afford to experiment.
my biggest piece of advice: dont commit to one model. set up something like what i showed you above where you can swap models easily. test them on YOUR actual workloads. the benchmarks are useful but they dont capture everything.
if you want to try
Top comments (0)