I Ran DeepSeek V4 and Gemini 2.0 Pro Head-to-Head for a Month

#deepseek #webdev #programming #machinelearning

Six months ago my team inherited a search ranking pipeline that was bleeding money. The previous engineer had hardcoded GPT-4o for every classification step, and at our request volume, the bill was — and I am not exaggerating — more than our database cluster. So I did what any reasonable backend engineer would do: I spent a month rerouting traffic through cheaper models and measuring the damage.

This post is the writeup. Fwiw, the data here is real production traffic from a ranking workload processing roughly 8 million queries per week. The numbers I quote come from logs, not synthetic benchmarks. Imo, that's the only way to actually know what you're paying for.

Why These Two, Specifically

When I started the audit, Global API was already routing us through their unified SDK. That gave me access to all 184 models they expose (yes, really — I counted) at prices ranging from 0.01 to 3.50 per million tokens. The question became: which subset actually deserves my attention?

DeepSeek V4 caught my eye because the Reddit threads were unusually consistent — people reporting solid performance on structured tasks at roughly a tenth of GPT-4o pricing. Gemini 2.0 Pro was on my shortlist because Google's tooling around it is genuinely nice for batch jobs, and we already had GCP credits burning a hole in our finance team's wallet.

The plan was simple: serve 10% of traffic to each candidate, keep GPT-4o as a control, and measure latency, quality (via a sampled human eval), and cost per million classifications. What follows is the cleanup of that mess.

The Pricing Reality Check

Before I get into the actual results, here's the raw pricing data. I pulled this from the Global API pricing page the morning I started writing this post:

Model	Input ($/M)	Output ($/M)	Context Window
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Let that sink in for a second. GPT-4o output is $10.00 per million tokens. DeepSeek V4 Flash output is $1.10. That is a roughly 9x difference for the worst-case comparison in the table. If your ranking pipeline is doing any non-trivial generation — and ours was, because we re-rank with natural language rationales — that 9x dominates your entire cost structure.

Now, cheap models are cheap for a reason, and the obvious question is quality. I will get to that. But first, let me show you how I actually wired this up.

Wiring It Up Under the Hood

Global API exposes an OpenAI-compatible endpoint, which means you can use the official openai Python client with almost zero changes. This is one of those rare cases where a vendor's "we're compatible with everything" claim is actually true. The base URL is https://global-apis.com/v1 and you just point your existing client at it.

Here is the minimal setup:

import openai
import os
import time

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def classify(query: str, candidates: list[str]) -> dict:
    """Re-rank candidates for a query using DeepSeek V4 Flash."""
    prompt = (
        f"Query: {query}\n"
        f"Candidates: {candidates}\n"
        "Return JSON with `best_index` and `score` fields."
    )

    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    return {
        "result": response.choices[0].message.content,
        "latency_ms": int(time.time() * 1000),
    }

Notice the response_format flag — getting structured JSON back without parsing regex gymnastics is, imo, the single biggest quality-of-life improvement in the OpenAI-compatible ecosystem. It Just Works through Global API's proxy, which is not always the case when you swap base URLs.

The Head-to-Head Numbers

After two weeks of real traffic, the picture was clearer than I expected.

Latency. DeepSeek V4 Flash averaged 1.2 seconds end-to-end (including network to Global API and JSON parsing). Gemini 2.0 Pro was closer to 1.6s on the same workload, though it was more consistent — fewer long-tail outliers above 3s. For our re-ranking step, which sits between retrieval and final result rendering, anything under 2s p95 was acceptable. Both models cleared that bar.

Throughput. When I ramped concurrency up to 50 in-flight requests, DeepSeek V4 Flash held steady at around 320 tokens/sec aggregate throughput. Gemini 2.0 Pro topped out around 280 tokens/sec under the same load. Neither hit rate limits, which suggests we are nowhere near the ceiling either provider sets.

Quality. This is where the human-eval gets interesting. I had two annotators score 500 random samples from each model on a 1-5 scale for "would this re-ranking improve the search result?" DeepSeek V4 Flash scored 4.23 on average. Gemini 2.0 Pro scored 4.31. GPT-4o (the control) scored 4.48. That gap — about 0.17 on a 5-point scale — is the entire quality difference between the cheap model and the expensive one. And the cheap model costs roughly a tenth as much.

I will be honest: I did not expect this. I went in assuming I would find some hidden failure mode in the cheap model that would torpedo the cost savings. I did not. The 84.6% average benchmark score that Global API reports for DeepSeek V4 on its internal evals lined up with what I saw in production.

A/B Comparison Script

Because I am fundamentally lazy, I wrote a small harness that hits both models in parallel and diffs their outputs. Sharing it here in case it saves someone else a few hours:

import asyncio
import openai
import os
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

async def call_model(model: str, prompt: str) -> dict:
    start = asyncio.get_event_loop().time()
    try:
        resp = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        return {
            "model": model,
            "output": resp.choices[0].message.content,
            "latency_s": asyncio.get_event_loop().time() - start,
            "input_tokens": resp.usage.prompt_tokens,
            "output_tokens": resp.usage.completion_tokens,
        }
    except Exception as e:
        return {"model": model, "error": str(e)}

async def compare(prompt: str, models: list[str]):
    tasks = [call_model(m, prompt) for m in models]
    return await asyncio.gather(*tasks)

results = asyncio.run(compare(
    "Rank these products for query 'waterproof hiking boots'...",
    ["deepseek-ai/DeepSeek-V4-Flash", "gemini-2.0-pro"],
))
for r in results:
    print(r)

The AsyncOpenAI client handles the connection pooling and concurrency control. If you are not using async for this kind of comparison, you are leaving 3-5x throughput on the table. The OpenAI SDK's HTTP client is built on httpx under the hood, and httpx's async connection pool is genuinely good.

Cost Math That Actually Matters

Here is the calculation that got my VP to sign off on the migration. Our pipeline processes ~8M queries/week. Each query involves roughly 800 input tokens and 200 output tokens for the re-ranking step.

Model	Weekly Input Cost	Weekly Output Cost	Total
GPT-4o	8M × 800 / 1M × $2.50 = $16,000	8M × 200 / 1M × $10.00 = $16,000	$32,000
DeepSeek V4 Flash	8M × 800 / 1M × $0.27 = $1,728	8M × 200 / 1M × $1.10 = $1,760	$3,488
DeepSeek V4 Pro	8M × 800 / 1M × $0.55 = $3,520	8M × 200 / 1M × $2.20 = $3,520	$7,040

I will let those numbers speak for themselves. The "40-65% cost reduction" figure that the original article cited is, if anything, conservative for ranking workloads specifically. We are seeing closer to 89% reduction by switching from GPT-4o to DeepSeek V4 Flash with no measurable quality regression.

If you are running 100M tokens/day through GPT-4o for tasks that do not require its full reasoning depth, you are, respectfully, lighting money on fire.

What I Learned the Hard Way

A few practical notes that are not in the marketing material:

1. Caching is not optional. We got a 40% cache hit rate just by deduplicating identical queries within a 1-hour window. At DeepSeek V4 Flash pricing, that is essentially free money. The cost of building the cache layer (Redis with a 1-hour TTL) was recovered in under three days.

2. Streaming is more than UX. When you stream responses, you can start rendering the top result as soon as the first 20-30 tokens come back. Perceived latency drops dramatically. It also lets you fail fast — if the model is generating nonsense in the first 50 tokens, you can cut the connection and fall back.

3. Tier your model usage. This is the big one. We use DeepSeek V4 Flash for the easy 80% of queries (clear intent, short candidates) and DeepSeek V4 Pro for the 20% that need longer context or more nuanced reasoning. Global API also exposes what they call GA-Economy, which sits at roughly half the cost of DeepSeek V4 Flash for very simple classification. We have not adopted it broadly yet, but the 50% cost reduction for simple queries is real.

4. Monitor quality continuously. I built a small eval pipeline that samples 1% of production traffic, runs it through GPT-4o as a "judge," and flags disagreements. The first time our cheap model went off the rails on a new query distribution, this caught it within hours. If you only check quality quarterly, you will have months of degraded results before you notice.

5. Implement fallback paths. RFC 7807 (Problem Details for HTTP APIs