Alex Chen

Posted on Jun 27

I Compared Chinese AI Models to GPT-4o All Weekend — I Was Shocked

#ai #programming #machinelearning #python

So here's what happened: i Compared Chinese AI Models to GPT-4o All Weekend — I Was Shocked

I just graduated from a coding bootcamp three months ago, and I've been building little side projects like crazy. One of them is a chatbot for my friend's bakery website (yes, really — she sells sourdough, and yes, it's adorable). Anyway, I was burning through OpenAI credits fast and started wondering if there was a cheaper way. That's how I fell down this rabbit hole.

I had no idea what I was about to find.

How It All Started (And Why I'm Writing This)

So here's the thing. When I was in bootcamp, we used GPT-4o for basically everything. Homework, debugging, generating dummy data, writing tests — the usual. I just assumed that's what everyone used because, well, that's all anyone talked about.

Then one night I was doom-scrolling on Reddit and someone mentioned DeepSeek. I clicked the link, landed on their site, and immediately hit a wall. It wanted me to sign up with a Chinese phone number. I literally don't have one. I closed the tab and felt kind of dumb for even trying.

But the prices I saw before I closed it? They stuck in my head. I was shocked at how low they were.

I started asking around in some Discord servers. Someone pointed me to something called Global API. They said it lets you use Chinese AI models through the same kind of interface OpenAI uses. You pay with PayPal. You don't need a Chinese phone number. You get billed in dollars. I was skeptical but figured, what do I have to lose?

Spoiler: a lot has changed since bootcamp, and I needed to share this with other people like me who are just getting started.

The Pricing Thing That Completely Blew My Mind

Okay, let me just lay out the numbers I kept seeing over and over. I copied these from multiple sources and they all matched up, so I'm pretty confident they're right.

For American models, here's what you pay per million tokens:

GPT-4o charges $2.50 for input and $10.00 for output
Claude 3.5 Sonnet charges $3.00 for input and $15.00 for output
Gemini 1.5 Pro charges $1.25 for input and $5.00 for output
GPT-4o-mini charges $0.15 for input and $0.60 for output

For Chinese models:

DeepSeek V4 Flash charges $0.18 for input and $0.25 for output
Qwen3-32B charges $0.18 for input and $0.28 for output
GLM-5 charges $0.73 for input and $1.92 for output
Kimi K2.5 charges $0.59 for input and $3.00 for output

I stared at these numbers for way too long. Like, am I reading this wrong? GPT-4o is 40 times more expensive than DeepSeek V4 Flash at output. Forty. Times. I had no idea the gap was this wide.

Let me be honest — when something is 40x cheaper, my first reaction is "okay, it must be garbage." That's what bootcamp drilled into me. You get what you pay for. Pick the expensive tool because it's better.

But then I saw the benchmark scores. And that's where things got really weird.

The Benchmark Stuff I Barely Understood (But Tried Anyway)

Full disclosure: I had to look up what MMLU, HumanEval, and C-Eval even were. Turns out they're just tests you give to AI models to see how smart they are. Cool. Got it.

General Reasoning (The Brain Test)

On the MMLU-style reasoning benchmarks, the scores look like this:

GPT-4o scores 88.7 and costs $10.00 per million output tokens
Claude 3.5 Sonnet scores 89.0 and costs $15.00 per million output tokens
Kimi K2.5 scores 87.0 and costs $3.00 per million output tokens
DeepSeek V4 Flash scores 85.5 and costs $0.25 per million output tokens
GLM-5 scores 86.0 and costs $1.92 per million output tokens
Qwen3.5-397B scores 87.5 and costs $2.34 per million output tokens

Okay wait. Let me say that again. DeepSeek V4 Flash is about 3 points behind GPT-4o on reasoning, and it costs 40 times less. Three points. That's it. That's the whole quality difference.

I was shook, honestly.

Code Generation (The Bootcamp Test)

Since I'm a code person, this is the one I cared about most. HumanEval is basically "can this thing write working code?" Here's what I found:

DeepSeek V4 Flash scores 92.0 and costs $0.25 per million tokens
Qwen3-Coder-30B scores 91.5 and costs $0.35 per million tokens
GPT-4o scores 92.5 and costs $10.00 per million tokens
Claude 3.5 Sonnet scores 93.0 and costs $15.00 per million tokens
DeepSeek Coder scores 91.0 and costs $0.25 per million tokens

So DeepSeek V4 Flash scores 92.0 on code. GPT-4o scores 92.5. That's a half-point difference. For forty times the price.

I had to read that multiple times. My brain kept trying to find the catch. The catch never came.

Chinese Language (The Surprise Test)

This one surprised me the most. On C-Eval, which tests Chinese language understanding:

GLM-5 scores 91.0 and costs $1.92 per million tokens
Kimi K2.5 scores 90.5 and costs $3.00 per million tokens
Qwen3-32B scores 89.0 and costs $0.28 per million tokens
GPT-4o scores 88.5 and costs $10.00 per million tokens
DeepSeek V4 Flash scores 88.0 and costs $0.25 per million tokens

The American model comes in last here. GLM-5 is the winner. I would never have guessed that in a million years.

Why This Isn't Common Knowledge (The Annoying Part)

Okay so if Chinese AI is so good and so cheap, why isn't everyone using it? I asked this out loud and a friend in the Discord looked at me like I was from another planet.

Here's the deal. There's a massive access problem.

If you want to use DeepSeek, Kimi, Qwen, or GLM directly, you usually need:

A Chinese phone number to register (which I don't have and you probably don't either)
WeChat Pay or Alipay to pay (again, not happening for most people outside China)
Documentation that's mostly in Chinese
Sometimes you have to deal with weird geo-restrictions

It's like being told there's a free buffet across town, but you don't have a car, the bus doesn't go there, and the menu is in a language you can't read. That's basically the situation.

I had no idea this was the actual barrier. I thought it was quality. It was never quality.

Global API: How I Actually Got Access

This is the part where I tell you how I personally got around all that stuff. Global API is the tool I kept hearing about, so I gave it a shot. They basically give you one endpoint that works for both American and Chinese models, with OpenAI-compatible formatting.

That last part is huge. In bootcamp we all learned how to call the OpenAI API. So when something says it's "OpenAI-compatible," that means I don't have to learn a new system. It's the same code, just pointed at a different URL.

You sign up with email. You pay with PayPal or a regular credit card. You get billed in dollars. The docs are in English. You can be sitting in Ohio or Oman or wherever and it just works.

I cannot overstate how much this changed things for me.

The Head-to-Head Battles I Ran

I spent basically my whole Saturday doing side-by-side comparisons. Here's what I found.

DeepSeek V4 Flash vs GPT-4o

For output pricing, DeepSeek V4 Flash is $0.25 per million tokens and GPT-4o is $10.00 per million tokens. That's a 40x difference.

Quality-wise, I'd give both 4 or 5 stars, depending on what you're doing. GPT-4o is slightly better at general reasoning and it has vision (meaning you can send it images). DeepSeek V4 Flash is faster — like 60 tokens per second versus 50 — and it ties on code.

The big one? GPT-4o can process images. DeepSeek V4 Flash can't (at least not in this version). If you need vision, GPT-4o still wins. If you don't, V4 Flash is the obvious pick.

Qwen3-32B vs GPT-4o-mini

This one wasn't even close. Qwen3-32B is $0.28 per million output tokens versus GPT-4o-mini at $0.60. Qwen3 wins on price (2.1x cheaper), beats it on general quality, beats it on code, and definitely beats it on Chinese. I genuinely couldn't find a single category where GPT-4o-mini came out ahead. If you're using GPT-4o-mini in 2026, you're leaving money on the table.

Kimi K2.5 vs Claude 3.5 Sonnet

Kimi K2.5 costs $3.00 per million output tokens, and Claude 3.5 Sonnet costs $15.00. That's a 5x price gap. On reasoning, they're basically tied — both deserve 5 stars. On Chinese, Kimi wins by a mile (because, well, it should). Claude is great at writing-style stuff, but if you're optimizing for cost, Kimi is the call.

Some Real Code From My Bakery Project

Let me show you what this actually looks like in code, because that's the part my bootcamp friends always ask about. The base URL is global-apis.com/v1, which means I didn't have to change anything except the URL and model name.

Here's how I switched my bakery chatbot from GPT-4o to DeepSeek V4 Flash:


python
from openai import OpenAI

# client = OpenAI(api_key="sk-...")

# New setup - paying $0.25 per million output tokens
client = OpenAI(
    api_key="YOUR_GLOBAL_API_KEY",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a friendly bakery assistant. Help customers pick

DEV Community