Alex Chen

Posted on Jun 28

I Ran DeepSeek, Qwen, Kimi, and GLM Through Real Client Work

#machinelearning #webdev #ai #programming

Last Tuesday I had a problem. A client wanted me to build a content moderation pipeline that could handle roughly 2 million tokens a day, route Chinese customer support emails, and run a coding assistant for their internal dev team. The budget? About $200/month for inference.

That's when I fell down the Chinese AI model rabbit hole.

I've been a freelance dev for six years. I bill by the hour, which means every API call is money out of my own pocket when I'm prototyping. I don't have a CTO approving six-figure LLM budgets. I have a notebook where I write down what each query cost me, and I cross-reference it against what I charged the client.

So when I say I spent a weekend pitting DeepSeek, Qwen, Kimi, and GLM against each other on actual billable work, I mean I literally tracked the dollar difference between them. Here's what I found.

Why I Even Looked at Chinese Models

Honestly? I resisted for a while. I've been running OpenAI and Anthropic for years. Muscle memory, mostly. But a buddy of mine who's also freelancing showed me his March invoice from a Chinese provider. His bill was $47. Mine was $412. Same kind of work. That got my attention.

I started small. Pulled in DeepSeek first because every dev thread I read said it was cheap. Then I branched out. Qwen because Alibaba's name kept popping up. Kimi because I needed something with real reasoning chops. And GLM because I had a bilingual project that wasn't getting the Chinese quality I needed from Western models.

All four have OpenAI-compatible APIs, which means I didn't have to rewrite a single line of my existing code. That's the unlock right there. Swap the base URL, swap the model name, done.

Here's how I actually tested them.

The Test Setup (Real Numbers, Not Vibes)

I built a small benchmark suite. Four jobs that mirror what my clients actually pay me for:

Bulk content summarization — 800 articles, average 2,000 tokens each
English coding tasks — LeetCode-style problems plus real codebase refactoring
Chinese customer email classification — routing intents for a Shanghai-based e-commerce client
Multi-step reasoning — math word problems, logic puzzles, the stuff my consulting clients throw at me

I ran each job through every model. Tracked tokens, tracked cost, tracked whether the output was usable on the first try or needed a re-roll.

Here's what each one charges per million output tokens (input is cheaper, but output is where bills explode):

Provider	Budget Pick	Mid-Tier Workhorse	Premium Model	Range
DeepSeek	V4 Flash @ $0.25	V4 Pro @ $0.78	R1 @ $2.50	$0.25–$2.50
Qwen	Qwen3-8B @ $0.01	Qwen3-32B @ $0.28	Qwen3.5-397B @ $2.34	$0.01–$3.20
Kimi	—	K2.5 @ $3.00	K2.5 Pro @ $3.50	$3.00–$3.50
GLM	GLM-4-9B @ $0.01	GLM-4 Plus @ $0.92	GLM-5 @ $1.92	$0.01–$1.92

Kimi doesn't really do "budget." That's the first thing to know. Everything they sell is priced like premium whiskey.

DeepSeek: My New Default for Most Stuff

I went into this thinking DeepSeek would be a curiosity. I left thinking it's my new daily driver.

V4 Flash at $0.25/M output is the headline number. That's not a typo. A quarter per million tokens. Let me put that in freelance terms: if I process 1 million output tokens in a month, that's 25 cents. I used to spend that on a single complex GPT-4 call.

The model itself? Fast. I clocked V4 Flash at around 60 tokens per second on average, which is among the snappiest I've seen. It handled my English coding benchmarks almost as well as GPT-4o, and on HumanEval-style problems it punched above its weight. For the content summarization job, it was my second-cheapest option and quality was a pass — meaning the client didn't ask me to redo it.

Where it stumbles: No native vision. If your client needs image understanding, DeepSeek isn't doing it. Chinese-language quality is also slightly behind GLM and Kimi — not bad, just not the leader. And the model lineup isn't as deep as Qwen's, so if you need a very specific size or behavior, you might not find a match.

For me, the math is simple. If I billed 40 hours last month and 12 of those were GPT-4 calls, I was probably spending $80–$150 on inference alone. With V4 Flash, that drops to maybe $15. That's an extra $100 in my pocket for the same deliverables.

Here's what the swap looks like in practice:

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Refactor this Python class to use dataclasses"}
    ]
)
print(response.choices[0].message.content)

That's literally the only change from my old OpenAI code. New model name, new URL. Everything else identical.

Qwen: The One With the Most Options

If DeepSeek is a scalpel, Qwen is a Swiss Army knife. Alibaba's team has built a model for basically every niche I can think of.

The lineup is wild:

Qwen3-8B at $0.01/M — For tasks I used to skip because the cost wasn't worth it. Tag generation, simple classification, anything high-volume and low-complexity.
Qwen3-32B at $0.28/M — My general-purpose pick. Slightly more than DeepSeek V4 Flash, but it handles ambiguity better in my experience.
Qwen3-Coder-30B at $0.35/M — Specifically tuned for code. I haven't stress-tested this one enough yet, but initial runs were solid.
Qwen3-VL-32B at $0.52/M — Vision-language model. This is what I reach for when the client sends me a screenshot and asks "what does this error mean?"
Qwen3-Omni-30B at $0.52/M — Audio, video, image, text. I haven't had a project that needed this yet, but it's nice to know it exists.
Qwen3.5-397B at $2.34/M — Their enterprise reasoning beast. Overkill for most freelance work, but for the one consulting gig a year that needs serious inference, it's there.

The price ladder is the real story. I can route different parts of the same pipeline to different Qwen models and optimise cost without leaving the API. Summarization goes through Qwen3-8B at $0.01. The complex reasoning layer goes through Qwen3-32B at $0.28. Vision tasks use the VL variant. One provider, one bill, six different price points.

Where it stumbles: The naming is genuinely confusing. Qwen3, Qwen3.5, Qwen3.6, Qwen3-Coder, Qwen3-VL, Qwen3-Omni — I keep a cheat sheet pinned to my monitor. Some models in the mid-range feel overpriced for what they deliver. Qwen3.6-35B at $1/M is a tough sell when GLM-5 gives me similar quality for $1.92 but with better Chinese support.

For a freelance dev with varied clients, Qwen is the "I don't know exactly what I'll need this month" pick. That flexibility is worth a small premium.

Here's my typical Qwen call for general coding work:

response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "user", "content": "Write a Python function to merge two sorted lists"}
    ]
)
print(response.choices[0].message.content)

Kimi: The Premium Reasoning Pick

Moonshot AI built Kimi for a different crowd. The pricing tells you everything: $3.00/M for K2.5, $3.50/M for K2.5 Pro. That's not budget territory. That's "I need this to be right the first time" territory.

And honestly? When I ran my multi-step reasoning benchmarks, Kimi delivered. Math word problems, logic puzzles, multi-hop questions — it was consistently the most accurate of the four. If I'm doing a consulting engagement where the client is paying me $200/hour and the LLM call is in the critical path of the deliverable, I want Kimi.

Where it stumbles: The price. There's no budget option. Every Kimi model is a premium model. For high-volume work, this is a non-starter. I used it for maybe 5% of my test workload, and even then I was wincing at the bill.

Also: no vision/multimodal support. If your work involves images, Kimi isn't in the running.

But for the specific jobs where reasoning quality is the whole point — think legal document analysis, financial modeling assistance, complex code architecture reviews — Kimi earned its place in my toolkit. I just don't reach for it often.

GLM: The Bilingual Powerhouse

Zhipu AI's GLM family is what I pull out when a project gets serious about Chinese language quality.

GLM-5 at $1.92/M is the flagship, and on Chinese-language benchmarks it ties or beats Kimi. The reasoning isn't quite at Kimi's level in English, but for Chinese-first work, GLM is the one to beat. My Shanghai e-commerce client had me routing about 50,000 Chinese customer emails a month through GLM, and the classification accuracy was noticeably better than what I got from Western models — including the expensive ones.

The budget play: GLM-4-9B at $0.01/M. Yes, a penny per million tokens. That's not a typo. For high-volume, low-complexity Chinese tasks — entity extraction, sentiment tagging, spam filtering — this is unbeatable. I batched my email routing through this model for the easy 80% and reserved GLM-5 for the genuinely complex 20%.

Where it stumbles: Vision is there but not as mature as Qwen's. The model lineup doesn't have the depth of Qwen's, though it covers the essentials. Speed is good but not DeepSeek-fast. And for pure English work, it's solid but not exceptional — I'd usually reach for DeepSeek V4 Flash first.

For my bilingual freelance work, GLM is now non-negotiable. The combination of GLM-4-9B for volume and GLM-5 for quality gives me a Chinese-language stack that's both cheap and accurate.

The Billable Hours Math (Where I Actually Care)

Let me put this in concrete terms for fellow freelancers.

Say you have a client project that involves processing about 5 million output tokens per month across mixed tasks. Here's what each provider would cost you at my recommended picks:

DeepSeek V4 Flash only: 5M × $0.25 = $1.25/month
Qwen mixed (Qwen3-32B primary): roughly 5M × $0.28 = $1.40/month
GLM mixed (4-9B + 5): blended ~$0.50/M = $2.50/month
Kimi K2.5: 5M × $3.00 = $15.00/month

Compare that to GPT-4o at $10/M output, which would run $50/month for the same workload.

If you're billing the client $5,000 for the project and your inference cost drops from $50 to $2, that's an extra $48 in your margin. Across 10 clients a month? $480. That's a meaningful chunk of my rent.

The catch: you have to actually validate that the cheaper model gives you usable output. If I have to re-run a job three times because V4 Flash hallucinated, my time cost eats the API savings. So test before you commit. Spend an afternoon, run your real workloads, track the results. That's what I did, and it's why I can write this article with confidence instead of vibes.

What I Actually Use Day to Day

After all this testing, here's my current setup:

80% of my queries go to DeepSeek V4 Flash. Default driver. Fast, cheap, good enough for content, coding, and general reasoning.
15% goes to Qwen3-32B. When I need a slightly more polished response for client-facing copy, or when the task involves vision.
4% goes to GLM-4-9B or GLM-5. Anything Chinese-language, especially customer-facing.
1% goes to Kimi K2.5. The hardest reasoning tasks where I genuinely cannot afford a wrong answer.

This isn't the "right" answer for everyone. If your work is 90% Chinese, flip the priorities. If you're doing high-stakes legal AI, lean heavier on Kimi. If you're processing millions of tokens a day, the ultra-budget models from Qwen and GLM are your friends.

The Code That Ties It All Together

One of the things I love about routing everything through Global API is that my fallback logic is trivial. If one model is having a bad day, or if I want to A/B test outputs, I can swap with one line:


python
from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

def get_completion(prompt: str, model: str = "deepseek-v4-flash"):
    """My standard wrapper. Change the default model, change my whole stack."""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Daily coding work
code_result = get_completion("Write a debounce function in JavaScript")

# Bump to Qwen when I need vision or slightly higher quality
vision_result = get_completion

DEV Community

I Ran DeepSeek, Qwen, Kimi, and GLM Through Real Client Work

Why I Even Looked at Chinese Models

The Test Setup (Real Numbers, Not Vibes)

DeepSeek: My New Default for Most Stuff

Qwen: The One With the Most Options

Kimi: The Premium Reasoning Pick

GLM: The Bilingual Powerhouse

The Billable Hours Math (Where I Actually Care)

What I Actually Use Day to Day

The Code That Ties It All Together

Top comments (0)