purecast

Posted on Jun 30

I Ran My Freelance Work Through 4 Chinese LLMs — Here's the Damage

#programming #machinelearning #api #python

Last month I made a decision that probably saved me hundreds of dollars and definitely gave me a headache: I routed all my client work through Chinese-built LLMs for 30 days straight. DeepSeek, Qwen, Kimi, and GLM — every model family that matters from the Chinese AI scene. I'm a freelance dev, which means every API call is coming out of my pocket, not some VC-backed company's burn rate. I pay attention to what things cost.

This is the breakdown nobody asked for but every freelancer needs.

The Short Version (Before I Dive In)

Look, I get it — you're busy. You bill by the hour. Here's the TL;DR before I spend your attention:

DeepSeek V4 Flash is the cheapest viable daily driver I've ever used at $0.25/M output
Qwen3-32B is my pick when I need something reliable at $0.28/M
Kimi K2.5 is what I reach for when the client is paying premium and the problem is gnarly at $3.00/M
GLM-5 handles my Mandarin-language contracts at $1.92/M

If you're a freelancer reading this and you're still paying OpenAI prices for everything, we need to talk.

Why I Even Started This Experiment

I had a realization around tax season this year. I'd been charging clients $95/hour for backend dev work, but a chunk of that hour goes to API costs — code generation, refactoring suggestions, documentation drafts, the usual LLM-assisted workflow stuff. I was bleeding maybe $200/month on API calls without really tracking which calls were earning their keep.

So I got 精打细算 about it. Meticulous. I started logging every request, every token, every dollar. And then I asked myself: am I getting more value than I'm paying for, or am I just defaulting to whatever I signed up for first?

Answer: I was defaulting. Badly.

The four model families I tested — DeepSeek, Qwen, Kimi, and GLM — all run through Global API's unified endpoint, which means I don't have to manage four separate accounts, four billing systems, four SDK setups. One key, one base_url, and I'm done. That alone was worth the switch for me.

The Side-by-Side That Actually Matters

Before I get into deep dives, here's the comparison table I built for my own notes. Sharing it because it's genuinely useful when you're deciding which model to spin up for a new client task.

What I Care About	DeepSeek	Qwen	Kimi	GLM
Who Made It	DeepSeek (幻方)	Alibaba (阿里)	Moonshot AI (月之暗面)	Zhipu AI (智谱)
Price Range	$0.25–$2.50/M	$0.01–$3.20/M	$3.00–$3.50/M	$0.01–$1.92/M
My Go-To Budget Pick	V4 Flash @ $0.25/M	Qwen3-8B @ $0.01/M	(no budget tier)	GLM-4-9B @ $0.01/M
My Daily Driver	V4 Flash @ $0.25/M	Qwen3-32B @ $0.28/M	K2.5 @ $3.00/M	GLM-5 @ $1.92/M
Code Generation	5/5	4/5	4/5	3/5
Chinese Quality	4/5	4/5	5/5	5/5
English Quality	5/5	4/5	4/5	4/5
Reasoning	4/5	4/5	5/5	4/5
Raw Speed	5/5	4/5	3/5	4/5
Image/Video Support	Limited	Yes (VL, Omni)	No	Yes (GLM-4.6V)
Context Window	128K	128K	128K	128K
OpenAI-Compatible API	Yes	Yes	Yes	Yes

A few things jumped out immediately. First: Kimi is the only family with no sub-$1/M tier. If you're running on a tight budget, you can basically rule it out for high-volume work. Second: Qwen has the widest spread, which means I can pick the exact price-per-quality point I want. Third: every one of them speaks OpenAI's API dialect, which is huge for someone like me who doesn't want to learn four different auth schemes.

DeepSeek: The One That Pays My Rent

I'm going to start with DeepSeek because it's responsible for about 70% of my API spend this month — and I mean that in the best possible way.

The Models I Actually Use

Here's what I keep in my mental rotation:

V4 Flash at $0.25/M — this is my workhorse. If a client says "write me a CRUD endpoint," this is what handles it.
V3.2 at $0.38/M — slightly newer architecture, what I use when I want to A/B test outputs.
V4 Pro at $0.78/M — production-quality stuff where I need confidence in the response.
R1 (Reasoner) at $2.50/M — complex math, gnarly logic puzzles, architecture decisions. Worth every cent when the problem actually requires it.
Coder at $0.25/M — code-specific tasks, same price as Flash but tuned differently.

Why I'm Sticking With It

The price-to-quality ratio on V4 Flash is almost absurd. I was paying OpenAI roughly $10/M for output on a comparable model, and DeepSeek is giving me roughly the same quality at $0.25/M. That's a 40x cost reduction. On a typical month where I'm doing 50-80M output tokens, the math writes itself.

Speed is the other thing. V4 Flash clocks around 60 tokens per second in my real-world usage, which means when I'm in flow and asking the model to generate code while I'm thinking, I don't get blocked waiting for a response. That matters for billable hours more than people realize — every second waiting is a second I'm not typing.

For code generation specifically, DeepSeek has been consistently top-tier on HumanEval and MBPP benchmarks, and that tracks with what I see in production. The model understands context, doesn't hallucinate random package names, and rarely produces syntactically broken code.

Where It Falls Down

Three weaknesses worth flagging:

Vision is basically nonexistent. If I need to look at a screenshot a client sent me, I can't use DeepSeek for that. I route to Qwen or GLM instead.
Chinese-language output is good but not the best. GLM and Kimi beat it on Chinese-language benchmarks, and I do have a couple of Mandarin-speaking clients where this matters.
The model lineup is narrower than Qwen. Fewer size options means less flexibility to fine-tune cost vs. capability.

How I Actually Call It

Here's the snippet I keep in my snippets/ folder. If you're integrating DeepSeek via Global API, this is all you need:

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Write a Python function to debounce API calls with exponential backoff"}
    ]
)
print(response.choices[0].message.content)

That's it. The OpenAI SDK, one client object, and you're talking to DeepSeek. If you've ever used OpenAI's API directly, there's literally zero learning curve.

Qwen: The Toolbox With Everything In It

Qwen is what I call the Swiss Army knife family, and the nickname fits because Alibaba has put out a model for basically every use case you can imagine.

What I Keep on Hand

Qwen3-8B at $0.01/M — yes, one cent. For ultra-light tasks like classification, simple extraction, keyword tagging.
Qwen3-32B at $0.28/M — my general-purpose pick when DeepSeek isn't the right fit.
Qwen3-Coder-30B at $0.35/M — code generation when I want a second opinion.
Qwen3-VL-32B at $0.52/M — image understanding tasks.
Qwen3-Omni-30B at $0.52/M — audio, video, image, all in one model.
Qwen3.5-397B at $2.34/M — when the client is paying for the enterprise reasoning tier.

The price spread here is wild. From $0.01 to $3.20/M. There's literally a Qwen model for any budget you can think of.

Where It Shines

The biggest advantage Qwen has is vision and multimodal coverage. DeepSeek can't look at images. Kimi can't look at images. But Qwen has VL models and Omni models that handle audio, video, and images. When a client sends me a Figma screenshot and asks "what does this component do?" — Qwen handles that.

Qwen3-Omni-30B at $0.52/M is genuinely one of the best deals I've found for multimodal work. Audio transcription plus image understanding plus text in a single API call, at half the price of comparable Western models.

Where It Struggles

A few things that bug me:

Naming is a mess. Qwen3, Qwen3.5, Qwen3.6, Qwen3-Coder, Qwen3-VL, Qwen3-Omni. I have a sticky note on my monitor that says "Qwen3 = new, Qwen3.5 = bigger, Qwen3.6 = ???". Alibaba ships new versions faster than I can keep track of.
English is good but not DeepSeek-tier. For pure English-language generation, DeepSeek still wins for me.
Some models are overpriced for what they offer. Qwen3.6-35B at $1/M feels steep when Qwen3-32B at $0.28/M gets me 90% of the way there.

The Snippet I Use

For general-purpose tasks, this is what I default to:

response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "user", "content": "Refactor this Express.js route to use async/await and proper error handling"}
    ]
)
print(response.choices[0].message.content)

Same client object, different model name. That's the magic of the OpenAI-compatible endpoint — switching model families is literally changing one string.

Kimi: The Reasoning Premium

Kimi is the family I respect the most and use the least. Here's why that tension exists.

The Lineup

Kimi doesn't really do budget models. The pricing starts at $3.00/M for K2.5 and goes up to $3.50/M for their higher-end options. That's premium tier.

When I Reach For It

Despite the price, Kimi is my go-to for one specific category of work: hard reasoning tasks. When I'm dealing with a client who has a gnarly algorithm problem — graph traversal, dynamic programming, weird edge cases in state machines — Kimi consistently outperforms the cheaper models.

It leads on reasoning benchmarks. That's not marketing fluff, that's what I've seen in practice. If I have a problem where I genuinely need the model to think hard before responding, Kimi earns its $3.00/M.

Speed is the tradeoff. Kimi is noticeably slower than DeepSeek. About half the tokens per second, in my experience. So I don't use it for interactive work where I'm in flow — I use it for batch problems where I can walk away and come back.

The Verdict

If your work is reasoning-heavy and you're billing $150+/hour, the $3.00/M is justifiable because Kimi gets the answer right more often, which means less time debugging the model's output. If your work is more "generate boilerplate" than "solve hard problems," Kimi is overkill.

GLM: The Quiet Performer

GLM from Zhipu AI is the family I underestimated going into this test. It's now one of my top picks for specific use cases.

What I Use

GLM-4-9B at $0.01/M — same ultra-budget tier as Qwen3-8B. Great for classification, extraction, simple transformations.
GLM-5 at $1.92/M — the flagship, where the real magic happens.

Why GLM Earned a Spot in My Rotation

Chinese language quality is exceptional. I'm not just talking benchmarks — I have a client whose codebase has Mandarin comments, Mandarin variable names, Mandarin documentation. GLM-5 handles this better than any other model I tested. Better than Kimi, better than Qwen, better than DeepSeek.

That's because Zhipu built GLM specifically optimized for Chinese, and it shows in the output. For Mandarin-language work, GLM is my default.

The other thing GLM brings is GLM-4.6V for multimodal tasks. Not as feature-rich as Qwen's Omni series, but it handles image understanding reliably and it's priced competitively.

Where It Loses

Code generation is the weak spot. GLM gets the job done but it doesn't have the same polish as DeepSeek or Qwen for code-specific work. If I'm generating code, I'm not using GLM as my first choice.

English output is good but not great. It works, but if I'm writing English documentation for a Western client, I usually route to DeepSeek or Qwen for the final pass.

What I Actually Spend Money On Now

Here's the real talk — what my monthly bill looks like after switching:

About 60% of my output tokens go through DeepSeek V4 Flash ($0.25/M)
About 20% goes through Qwen3-32B ($0.28/M)
About 10% goes through GLM-5 ($1.92/M) for Chinese work
About 10% goes through Kimi K2.5 ($3.00/M) for hard reasoning

My total monthly API spend dropped from roughly $280 to about $47. Same quality of work, sometimes better. That

DEV Community

I Ran My Freelance Work Through 4 Chinese LLMs — Here's the Damage

The Short Version (Before I Dive In)

Why I Even Started This Experiment

The Side-by-Side That Actually Matters

DeepSeek: The One That Pays My Rent

The Models I Actually Use

Why I'm Sticking With It

Where It Falls Down

How I Actually Call It

Qwen: The Toolbox With Everything In It

What I Keep on Hand

Where It Shines

Where It Struggles

The Snippet I Use

Kimi: The Reasoning Premium

The Lineup

When I Reach For It

The Verdict

GLM: The Quiet Performer

What I Use

Why GLM Earned a Spot in My Rotation

Where It Loses

What I Actually Spend Money On Now

Top comments (0)