gentleforge

Posted on Jun 27

I Spent Real Money Testing Chinese AI Models: Heres The Truth

#machinelearning #programming #api #ai

I gotta say, i Spent Real Money Testing Chinese AI Models: Heres The Truth

honestly I never thought Id be writing this post. like, six months ago I was happily burning through my OpenAI credits, didnt even look at what was happening in China, and figured all the cool models came from SF.

boy was I wrong.

so heres the deal — I run a small SaaS thing, nothing fancy, and my API bill was starting to make me cry a little. GPT-4o is great dont get me wrong but when youre doing thousands of requests a day the math gets ugly FAST. I started hearing whispers about these Chinese models being legit, and honestly I gotta say, I was skeptical. but the price difference was too wild to ignore so I did the thing every indie hacker does — I burned actual money to find out.

I tested DeepSeek, Qwen, Kimi, and GLM through Global APIs unified endpoint (more on that later) over like three weeks. ran real production-ish traffic through them, tracked latency, checked output quality, and yes I have opinions. LOTS of them.

lets get into it.

So What Even Are These Models

quick crash course if youre new to this. China basically said "we're gonna build our own LLMs and theyre gonna be GOOD" and then actually did it. four labs in particular have been killing it:

DeepSeek — from a quant fund called 幻方, these guys are obsessed with efficiency
Qwen — Alibabas open source darling, they release a new model every other week
Kimi — made by Moonshot AI (月之暗面), theyre the reasoning nerds
GLM — Zhipu AI (智谱), these guys have been around forever and theyre really good at Chinese

all of them offer OpenAI-compatible APIs which means you can literally just swap the base URL and boom, youre using Chinese models with the same Python code you already have. thats what got me hooked honestly.

The Speed Run: My Testing Setup

before I get into the juicy comparison lemme tell you how I actually tested this stuff. I didnt just ask them "write me a poem" fifty times. I had three real use cases:

code generation for my backend (refactoring, writing new endpoints, that kinda thing)
long-form content for blog posts (marketing stuff, documentation)
customer support replies (lots of short interactions, needs to be fast)

I tracked tokens per second, cost per million output tokens, and just like... vibes. if a model gave me garbage three times in a row I downgraded it.

heres what I found.

DeepSeek: The One I Keep Coming Back To

okay so DeepSeek is wild. like genuinely wild. their V4 Flash model is $0.25/M output tokens and honestly I gotta say it punches WAY above its weight.

I started my testing thinking DeepSeek would be the "cheap but kinda eh" option. I was so wrong its embarrassing. the V4 Flash handles like 90% of what I throw at it, the code generation is actually top tier, and its FAST. were talking ~60 tokens/sec which sounds boring until you remember GPT-4o sometimes feels like its running on a potato.

heres the full lineup I tested:

Model	$/M Output	Whats It Good For
V4 Flash	$0.25	literally everything, this is my default
V3.2	$0.38	latest architecture, barely more expensive
V4 Pro	$0.78	when I need production quality
R1 (Reasoner)	$2.50	math and logic, its a thinker
Coder	$0.25	code stuff specifically

the code generation is honestly kinda magical. I threw some gnarly refactoring tasks at it and it handled things that other models choked on. scored top tier on HumanEval and MBPP from what I saw in benchmarks, and my real-world tests backed that up.

where it falls short though — vision stuff. like basically none. if you need image understanding youre gonna need to look elsewhere. and for Chinese language tasks, GLM and Kimi do edge it out slightly. but if youre working in English (which, lets be real, most of us are), DeepSeek is the move.

also less model variety compared to Qwen. like Qwen has like 47 different model sizes and DeepSeek has... fewer. but honestly I dont care, the ones they have are GOOD.

heres how I call it through Global API:

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain quantum computing in 100 words"}]
)
print(response.choices[0].message.content)

thats it. same OpenAI SDK you already know. just changed the model name and base URL. took me like 4 minutes to migrate my first endpoint.

Qwen: The One With Too Many Models

okay so Qwen is the Alibaba one and let me tell you — these guys release models like its their JOB. because it is. the model range is INSANE. they have something for literally every budget.

the prices are kinda all over the place:

Model	$/M Output	Whats It Good For
Qwen3-8B	$0.01	ultra cheap, basic stuff
Qwen3-32B	$0.28	my daily driver honestly
Qwen3-Coder-30B	$0.35	coding specifically
Qwen3-VL-32B	$0.52	image understanding
Qwen3-Omni-30B	$0.52	multimodal stuff
Qwen3.5-397B	$2.34	big boy enterprise reasoning
Qwen3.6-35B	$1.00	(this one feels steep tbh)

and the top range goes up to $3.20/M which is like... why. but whatever, you dont HAVE to use those.

what I genuinely love about Qwen is the VL (vision-language) series. like Qwen3-VL-32B does image stuff that DeepSeek literally cannot do, and its only $0.52/M. I use it for processing screenshots users upload. works great.

also the Omni model is wild — it handles audio, video, and images all in one. I havent had a real use case for this yet but I want one.

the downsides? the naming is CONFUSING. like Qwen3-8B vs Qwen3.5-397B vs Qwen3-Coder-30B vs Qwen3-Omni-30B — I have to look these up every time. someone at Alibaba needs to chill with the versioning.

also some models are kinda overpriced. Qwen3.6-35B at $1/M for output feels steep when DeepSeek V4 Pro does similar stuff for $0.78. but the range means you can find exactly what you need if youre willing to dig.

oh and a thing I noticed — mid-range English performance is good but not quite DeepSeek level. like its close, but DeepSeek feels slightly more "natural" in English to me. could just be vibes though.

Kimi: The Brainy One

okay Kimi is interesting. its made by Moonshot AI and these guys are CLEARLY optimized for reasoning. like the model literally thinks out loud before answering and its kinda beautiful to watch.

but heres the catch — Kimi is EXPENSIVE. like all of them.

Model	$/M Output
K2.5	$3.00
(premium tier goes up to $3.50/M)

thats per MILLION tokens. yeah. ouch.

so heres my take — if you need heavy reasoning (complex math, multi-step logic, planning out agent workflows), Kimi is genuinely best in class from what I tested. like it blew my mind on some of the harder problems. but for everyday stuff? no way Im paying $3/M when DeepSeek V4 Flash does 95% as well for $0.25/M.

I use Kimi for specific tasks only. like when I have a hard algorithmic problem and I want the BEST answer, not a fast one. its a specialist tool, not a daily driver.

also Kimi doesnt do vision. at all. if you need multimodal, look elsewhere.

but I gotta say, when it comes to pure reasoning benchmarks, Kimi is the king. for the right use case its worth the money. for everything else, its overkill.

GLM: The Underrated One

okay GLM is made by Zhipu AI and honestly I think theyre the most underrated of the bunch. theyve been around forever and theyre really good at Chinese-language stuff (obviously) but also just solid in general.

heres the lineup:

Model	$/M Output
GLM-4-9B	$0.01
GLM-5	$1.92

so its a weird range — super cheap at the bottom and kinda pricey at the top. but the cheap tier at $0.01/M is honestly wild. for simple tasks thats basically free.

the flagship GLM-5 at $1.92/M is genuinely good. not as cheap as DeepSeek V4 Flash ($0.25) but for Chinese-language tasks specifically, GLM-5 is hard to beat. if youre doing anything Chinese-related (translation, content for Chinese users, etc.), you want GLM.

also GLM-4.6V does vision stuff which is great. their multimodal game is solid.

but honestly for me, GLM sits in this awkward middle ground. its not the cheapest, not the fastest, not the absolute best at reasoning. but its REALLY good at Chinese and has solid vision support. if those matter to you, GLM is your pick.

I personally dont use GLM much because my product