I Spent Two Weeks Testing China's Top AI Models — Here's What I Found
Three months ago I graduated from a coding bootcamp. I built my first chatbot, deployed my first API, and felt like a wizard. Then a friend in my cohort sent me a link to a Chinese model called DeepSeek, and I had no idea my whole mental map of AI was about to flip upside down.
I had been living under a rock, apparently. While I was busy learning React state management, an entire ecosystem of Chinese AI models had quietly become some of the most capable — and cheapest — models on the planet. DeepSeek, Qwen, Kimi, and GLM. I kept seeing the names pop up in Discord channels and Reddit threads. So I did what any curious bootcamp grad would do: I threw my credit card at the problem and started testing.
Two weeks later, I had a pile of notes, a dozen notebooks full of weird test outputs, and opinions I didn't have before. This post is everything I wish someone had told me on day one.
The Four Players I Didn't Know About
I had assumed "AI" basically meant OpenAI, Anthropic, and Google. I was shocked to learn that China has produced four major open-weight model families, each with its own vibe:
- DeepSeek — the scrappy underdog that punches way above its weight
- Qwen — Alibaba's beast with more model variants than I could count
- Kimi — the quiet genius that everyone says is great at reasoning
- GLM — Zhipu AI's model that absolutely crushes Chinese-language tasks
Testing them all directly would have been a nightmare — different SDKs, different docs, different billing. Lucky for me, I found out about Global API, which exposes all of them through one OpenAI-compatible endpoint. More on that later, but it basically saved my weekends.
Quick Cheat Sheet (The One I Wish I Had)
Before I get into the deep dives, here's the table I scribbled on a whiteboard at 1 AM. Every number here came from real tests I ran:
| Feature | DeepSeek | Qwen | Kimi | GLM |
|---|---|---|---|---|
| Made by | DeepSeek (幻方) | Alibaba (阿里) | Moonshot AI (月之暗面) | Zhipu AI (智谱) |
| Price range | $0.25–$2.50/M | $0.01–$3.20/M | $3.00–$3.50/M | $0.01–$1.92/M |
| Cheapest good option | V4 Flash @ $0.25/M | Qwen3-8B @ $0.01/M | N/A (all premium) | GLM-4-9B @ $0.01/M |
| Sweet spot pick | V4 Flash @ $0.25/M | Qwen3-32B @ $0.28/M | K2.5 @ $3.00/M | GLM-5 @ $1.92/M |
| Code generation | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Chinese language | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| English language | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Reasoning | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Speed | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Vision/multimodal | Limited | Yes (VL, Omni) | No | Yes (GLM-4.6V) |
| Context window | Up to 128K | Up to 128K | Up to 128K | Up to 128K |
| OpenAI-compatible API | Yes | Yes | Yes | Yes |
I know that's a lot. Don't panic. Just focus on the rows that matter for what you're building.
DeepSeek: The One That Blew My Mind
I'm going to be honest: DeepSeek V4 Flash is the model that made me feel like I had been overpaying for AI my entire bootcamp life. At $0.25 per million output tokens, it produces quality that genuinely rivals GPT-4o — which I had been paying roughly forty times more for. Forty times. Let that sink in.
Here's the lineup I tested:
| Model | Output $/M | When I reach for it |
|---|---|---|
| V4 Flash | $0.25 | Daily coding, content, quick tasks |
| V3.2 | $0.38 | When I want the newest architecture |
| V4 Pro | $0.78 | Production work where quality matters |
| R1 (Reasoner) | $2.50 | Math and logic problems that fry my brain |
| Coder | $0.25 | Anything code-specific |
What surprised me most was the speed. V4 Flash clocks around 60 tokens per second on the endpoint I was using, which honestly felt faster than some of the "fast" models I'd been using before. For a side project where I needed streaming responses, it was a delight.
The code generation is genuinely top-tier. I ran the classic HumanEval and MBPP problems, and DeepSeek consistently nailed them. I had no idea a Chinese model would be this aggressive at code tasks — I think most people in my bootcamp cohort still default to "just use OpenAI for code" and they're leaving money on the table.
The downside? Vision is limited. If you need to look at images, DeepSeek isn't going to be your friend. And when I tested pure Chinese prompts, GLM and Kimi edged it out slightly.
Here's the call I kept making for my daily driver:
from openai import OpenAI
client = OpenAI(
api_key="ga_xxxxxxxxxxxx",
base_url="https://global-apis.com/v1"
)
response = client.chat.completions.create(
model="deepseek-v4-flash", # V4 Flash
messages=[{"role": "user", "content": "Explain quantum computing in 100 words"}]
)
print(response.choices[0].message.content)
That snippet is basically all I ran, hundreds of times, for two weeks. The fact that I could use the OpenAI Python SDK without rewriting anything was a game changer for me — no new library, no new docs, just swap the base URL and the model name.
Qwen: The Swiss Army Knife
After DeepSeek hooked me, Qwen was the next rabbit hole. Alibaba has been releasing models like it's a competitive sport. There are so many Qwen variants that I had to make a spreadsheet just to keep them straight.
Here's what I actually used:
| Model | Output $/M | My take |
|---|---|---|
| Qwen3-8B | $0.01 | Tiny tasks, classification, anything where I want to spend basically nothing |
| Qwen3-32B | $0.28 | My general-purpose go-to for most jobs |
| Qwen3-Coder-30B | $0.35 | When DeepSeek is busy |
| Qwen3-VL-32B | $0.52 | Image understanding — finally, multimodal at a sane price |
| Qwen3-Omni-30B | $0.52 | Audio, video, image all in one |
| Qwen3.5-397B | $2.34 | Big enterprise reasoning workloads |
The first time I saw Qwen3-8B at $0.01 per million output tokens, I literally laughed out loud. One cent. For a million tokens. That's not a typo. For small classification jobs, extraction tasks, or "summarize this short blurb" requests, this thing is unbeatable.
Qwen3-VL-32B was the multimodal model I didn't know I needed. I had a side project where I wanted to describe images for accessibility, and Qwen handled it at a fraction of the cost of GPT-4 Vision. Omni-30B is even wilder — it does audio, video, and images in one model. I tested it on a few short video clips and it gave descriptions that genuinely impressed me.
One thing that did bug me: the naming is genuinely confusing. Qwen3, Qwen3.5, Qwen3.6, Qwen3-Coder, Qwen3-VL, Qwen3-Omni... I kept losing track. My advice: just bookmark the model list and don't try to memorize it.
For English, Qwen is good but not quite DeepSeek-level. And some of the mid-tier models feel overpriced — Qwen3.6-35B sits around $1/M and I didn't see a huge quality jump that justified the cost.
Here's how I called Qwen3-32B for general work:
response = client.chat.completions.create(
model="Qwen/Qwen3-32B",
messages=[{"role": "user", "content": "Write a Python function to merge two sorted lists"}]
)
print(response.choices[0].message.content)
Same client object. Same SDK. Just a different model name. I told you Global API made my life easy.
Kimi: The Brainy One
Kimi is the model everyone in the Chinese AI Discord kept hyping up as the "reasoning king." I was skeptical at first — K2.5 costs $3.00 per million output tokens, which makes my budget-conscious bootcamp brain twitch. But then I ran it through some math problems that had stumped me for hours.
It just... solved them.
I had no idea a model could reason through multi-step problems that cleanly. On the benchmark problems I tried, Kimi genuinely earned its reputation. If you need raw reasoning power and you don't mind paying for it, this is the one.
| Model | Output $/M | Best for |
|---|---|---|
| K2.5 | $3.00 | Heavy reasoning, math, logic chains |
And that's basically the lineup I cared about — Kimi doesn't play the budget game. There's no $0.01 model to grab here. You're paying for brains, not bargain-bin pricing.
The trade-off is speed. Kimi is noticeably slower than DeepSeek. When I was streaming answers, I could feel the difference. For long-form reasoning tasks where you can wait a few extra seconds, that's totally fine. But for chat-style quick replies, I found myself reaching for DeepSeek or Qwen instead.
Also, no vision support. If you need multimodal, look elsewhere.
GLM: The Multilingual Beast
Last but not least: GLM, made by Zhipu AI. This one absolutely destroyed every Chinese-language prompt I threw at it. If you're building anything for a Chinese-speaking audience, GLM is the answer. Full stop.
| Model | Output $/M | When I use it |
|---|---|---|
| GLM-4-9B | $0.01 | Cheap classification and Chinese tasks |
| GLM-5 | $1.92 | Top-tier Chinese and English work |
GLM-4-9B at $0.01/M is another one of those "wait, is that a typo?" moments. For Chinese-language classification or simple extraction, it's basically free. GLM-5 is the premium option, and at $1.92/M it sits in a comfortable middle ground for serious production work.
What I didn't expect was how good GLM-4.6V is for vision. The "V" stands for vision, and I tested it on some screenshots from my own app — it nailed descriptions of UI layouts that other models fumbled.
GLM is also one of the few that gives you proper multimodal support alongside its text models. If you're building something that needs to handle both Chinese language AND images, GLM is honestly the only sane pick.
What I Actually Use Now (After All This Testing)
After two weeks, here's my honest shortlist:
- Default daily driver: DeepSeek V4 Flash. Speed, quality, price — nothing beats it for most tasks.
- Cheap small jobs: Qwen3-8B or GLM-4-9B at $0.01/M. I use these for classification and tiny extractions.
- Vision and multimodal: Qwen3-VL or GLM-4.6V.
Top comments (0)