swift

Posted on Jun 21

Why I Migrated From GPT-4o to DeepSeek — A Backend Engineer's Notes

#webdev #programming #python #api

Six months ago, my monthly OpenAI bill crossed four figures and I finally snapped. Not because the cost was unbearable in absolute terms, but because I had a sneaking suspicion I was overpaying for marginal quality gains. So I did what any sane backend engineer would do: I instrumented my service to log token usage by endpoint, spun up parallel calls to every major Chinese model, and started comparing numbers like my paycheck depended on it. Spoiler — it kind of did.

This is the story of what I found when I actually ran Chinese AI models (DeepSeek, Qwen, Kimi, GLM) head-to-head against the US incumbents (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) on a real production workload. Not a synthetic benchmark, not a vibes-based Twitter thread — actual requests flowing through my service. Fwiw, the results were not what I expected.

The Pricing Problem Nobody Wants to Talk About

Let's start with the part CFOs care about. The price gap between US and Chinese models in 2026 isn't a rounding error — it's a yawning chasm. Here's what I'm currently paying (or would pay) per million tokens:

Model	Origin	Input $/M	Output $/M	Multiplier vs DeepSeek V4 Flash
DeepSeek V4 Flash	🇨🇳	$0.18	$0.25	1× (baseline)
Qwen3-32B	🇨🇳	$0.18	$0.28	1.1×
GPT-4o-mini	🇺🇸	$0.15	$0.60	2.4×
Kimi K2.5	🇨🇳	$0.59	$3.00	12×
GLM-5	🇨🇳	$0.73	$1.92	7.7×
Gemini 1.5 Pro	🇺🇸	$1.25	$5.00	20×
GPT-4o	🇺🇸	$2.50	$10.00	40×
Claude 3.5 Sonnet	🇺🇸	$3.00	$15.00	60×

Sixty times. Let that marinate. Claude 3.5 Sonnet's output pricing is 60× more than DeepSeek V4 Flash. For my workload — heavy on short-to-medium classification and extraction calls — that's the difference between $40/month and $2,400/month. Same corpus, same prompts, same downstream business logic.

The knee-jerk reaction is "yeah but you get what you pay for." Does that hold up? Let me show you the numbers.

Benchmark Numbers, Because Vibes Don't Ship to Production

I pulled community-average scores for the three categories I care about as a backend engineer: general reasoning (MMLU-style), code generation (HumanEval), and Chinese-language performance (C-Eval). These are approximate — your mileage will absolutely vary based on prompt format, temperature, and whether you remembered to escape your JSON properly. Imo, they paint a clear picture regardless.

General Reasoning

Model	MMLU-style Score	Output $/M
Claude 3.5 Sonnet	89.0	$15.00
GPT-4o	88.7	$10.00
Qwen3.5-397B	87.5	$2.34
Kimi K2.5	87.0	$3.00
GLM-5	86.0	$1.92
DeepSeek V4 Flash	85.5	$0.25

The spread between the best and worst here is about 3.5 points. That's not nothing, but it's also not 60× of anything. Under the hood, most of these models are converging on the same training-data-plus-RLHF plateau, and the differences come down to fine-tuning specifics rather than fundamental capability gaps.

Code Generation (HumanEval)

Model	Score	Output $/M
Claude 3.5 Sonnet	93.0	$15.00
GPT-4o	92.5	$10.00
DeepSeek V4 Flash	92.0	$0.25
Qwen3-Coder-30B	91.5	$0.35
DeepSeek Coder	91.0	$0.25

This is the section that made me audibly laugh when I first saw it. DeepSeek V4 Flash scores within one point of GPT-4o on HumanEval while charging 40× less for output tokens. And the specialized DeepSeek Coder variant — built specifically for this task — is a hair behind at 91.0 for the same $0.25/M. If you're not using these for code-adjacent workloads, you're leaving real money on the table.

Chinese Language (C-Eval)

Model	Score	Output $/M
GLM-5	91.0	$1.92
Kimi K2.5	90.5	$3.00
Qwen3-32B	89.0	$0.28
GPT-4o	88.5	$10.00
DeepSeek V4 Flash	88.0	$0.25

Shocking absolutely no one, models trained on Chinese corpora perform better on Chinese-language evaluations. GLM-5 and Kimi K2.5 top this list, with Qwen3-32B punching far above its weight at $0.28/M. Even DeepSeek V4 Flash, which is positioned as a generalist, beats GPT-4o on C-Eval — for 40× less money.

The Real Moat: Access, Not Quality

Here's where I have to get real for a second. Picking Chinese models based on benchmarks alone is easy. Actually deploying them? That's where the friction lives. The obstacles aren't technical — they're commercial and regulatory:

Concern	US Models	Chinese Direct	Global API
Payment	Credit card ✅	WeChat/Alipay ❌	PayPal + cards ✅
Signup	Email ✅	Chinese phone # ❌	Email ✅
Wire format	OpenAI-compatible ✅	Custom per provider ❌	OpenAI-compatible ✅
Geo-restrictions	None ✅	Often blocked ❌	None ✅
Docs language	English ✅	Mostly Chinese ❌	English ✅
Support	English ✅	Chinese ❌	Both ✅
Currency	USD ✅	CNY only ❌	USD ✅

The primary barrier to Chinese models in 2026 isn't model quality — that's basically a solved problem. It's the sheer operational overhead of getting an account, getting verified, getting paid, and then dealing with N different SDK quirks from N different providers. Under the hood, most Chinese providers don't even speak the same wire format, which means you'd need to maintain N client implementations. RFC 7231 wouldn't approve.

That's why I ended up routing everything through Global API — it gives me OpenAI-compatible endpoints, USD billing, and PayPal support, which means I can A/B test providers without touching my application code.

Code Example: The Drop-In Replacement

Here's the beautiful thing about OpenAI-compatible APIs. Switching providers is literally a one-line config change in most codebases. Here's a simplified version of what my service looks like:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("GLOBAL_API_KEY"),
    base_url="https://global-apis.com/v1",
)

def classify_ticket(text: str) -> dict:
    response = client.chat.completions.create(
        model="deepseek-v4-flash",  # swap to gpt-4o, claude-3.5-sonnet, etc.
        messages=[
            {"role": "system", "content": "Classify the support ticket. Return JSON."},
            {"role": "user", "content": text},
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return response.choices[0].message.content

I run the exact same code path against gpt-4o, deepseek-v4-flash, qwen3-32b, kimi-k2.5, and glm-5 — the only thing that changes is the model string. This is what proper API design looks like, and frankly, the OpenAI spec has become the de facto standard (see also: every other provider scrambling to clone it). If you're not exploiting that portability, you're working too hard.

Head-to-Head: The Matchups That Mattered for Me

I won't bore you with every possible pairing. Here are the three that actually moved the needle in my workload.

DeepSeek V4 Flash vs GPT-4o

Dimension	V4 Flash	GPT-4o	Winner
Output cost	$0.25/M	$10.00/M	V4 Flash (40× cheaper)
General quality	B+	A	GPT-4o (small margin)
Code	A	A	Tie
Throughput	~60 tok/s	~50 tok/s	V4 Flash
Context window	128K	128K	Tie
Vision input	❌	✅	GPT-4o

My verdict: V4 Flash for everything except image-bearing requests. The quality delta is real but small — maybe 3-5% on my classification tasks. The cost delta is not small. If you need vision, pay the OpenAI tax and route through the same Global API proxy; otherwise, I don't see a defensible reason to default to GPT-4o in 2026.

Qwen3-32B vs GPT-4o-mini

Dimension	Qwen3-32B	GPT-4o-mini	Winner
Output cost	$0.28/M	$0.60/M	Qwen (2.1× cheaper)
General quality	A-	B+	Qwen
Code	A-	B+	Qwen
Chinese	A	B	Qwen

My verdict: Qwen wins on every axis I tested. The pricing is close, but the quality gap isn't — Qwen3-32B consistently outperformed GPT-4o-mini on my extraction and rewriting tasks. If you're still defaulting to -mini for cost reasons, you should probably stop. The savings are an illusion once you account for retries and quality issues.

Kimi K2.5 vs Claude 3.5 Sonnet

Dimension	K2.5	Claude 3.5 Sonnet	Winner
Output cost	$3.00/M	$15.00/M	K2.5 (5× cheaper)
Reasoning	A+	A+	Tie (essentially)
Chinese	A+	B	K2.5
Long context	200K	200K	Tie
Tool use	A	A+	Claude (small edge)

My verdict: This was the hardest call. Claude 3.5 Sonnet genuinely has the best tool-use behavior I've seen — fewer hallucinations, better structured outputs, more reliable function calling. If your product leans heavily on agentic workflows with multiple tool invocations, Claude's edge is real. But for pure reasoning, K2.5 ties it at 1/5 the price, and beats it outright on Chinese. Honestly, the right answer here might be "use K2.5 for the bulk path, fall back to Claude for tool-heavy flows" — which is exactly what I'm doing.

Code Example: The Fallback Pattern

Since I brought it up, here's how I implement the tiered routing. It's nothing fancy — just a wrapper that tries the cheap model first, escalates on low confidence:


python
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("GLOBAL_API_KEY"),
    base_url="https://global-apis.com/v1",
)

def generate_with_fallback(prompt: str, complexity: str = "low") -> str:
    # Route based on request complexity heuristic
    if complexity == "low":
        primary = "deepseek-v4-flash"
        fallback = "gpt-4o"
    elif complexity == "tool_heavy":
        primary = "claude-3.5-sonnet"
        fallback = "kimi-k2.5"
    else:
        primary = "kimi-k2.5"
        fallback = "claude-3.5-sonnet"

    try:
        response = client.chat.completions.create(
            model=primary,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
        )
        return response.choices[0].message.content
    except Exception as e:
        # Log, alert, and escalate
        logger.warning(f"Primary {primary} failed: {e}, escalating to {fallback}")
        response = client.chat.completions.create(
            model=fallback,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
        )
        return response.choices[0].

DEV Community

Why I Migrated From GPT-4o to DeepSeek — A Backend Engineer's Notes

The Pricing Problem Nobody Wants to Talk About

Benchmark Numbers, Because Vibes Don't Ship to Production

General Reasoning

Code Generation (HumanEval)

Chinese Language (C-Eval)

The Real Moat: Access, Not Quality

Code Example: The Drop-In Replacement

Head-to-Head: The Matchups That Mattered for Me

DeepSeek V4 Flash vs GPT-4o

Qwen3-32B vs GPT-4o-mini

Kimi K2.5 vs Claude 3.5 Sonnet

Code Example: The Fallback Pattern

Top comments (0)