gentlenode

Posted on Jun 15

I Benchmarked Claude and GPT-4o Across All 184 Models

#machinelearning #deepseek #tutorial #python

Look, i Benchmarked Claude and GPT-4o Across All 184 Models

Three weeks ago I set out to answer a question that had been nagging me for months: does it actually make sense, in 2026, to default to GPT-4o when there are now 184 models available through Global API at token prices ranging from $0.01 to $3.50 per million? I'm a data scientist by trade, so I didn't want opinions. I wanted numbers. This piece is the full writeup of what I found.

Spoiler: the cost gap is even wider than I expected, and quality is not the binary "GPT-4o wins" narrative you'll see in most blog posts. But you should read the methodology before you trust the numbers. Sample size matters.

Why I Ran This Benchmark

I've been shipping LLM-backed features in production for about four years. My default for most of that time has been GPT-4o — not because I tested alternatives rigorously, but because it was the path of least resistance. The pricing felt punishing at $2.50 per million input tokens and $10.00 per million output tokens, but I justified it with vague claims about "quality."

Then a colleague pointed me at Global API's unified endpoint, where I could run DeepSeek V4 Flash at $0.27 input / $1.10 output, DeepSeek V4 Pro at $0.55 / $2.20, Qwen3-32B at $0.30 / $1.20, and GLM-4 Plus at $0.20 / $0.80 — all with comparable or larger context windows. My immediate reaction was skepticism. These are not household names in my circle. So I decided to treat the question empirically.

I built a benchmark suite, picked five candidate models, ran 500 prompts through each, and recorded cost, latency, and quality scores. Here's what I found.

The Test Harness

Before I get into the results, let me show you the harness. It's embarrassingly short, which is kind of the point — Global API uses an OpenAI-compatible interface, so the standard SDK just works.

import openai
import os
import time
import json

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def run_prompt(model: str, prompt: str, max_tokens: int = 512) -> dict:
    start = time.perf_counter()
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.0,
    )
    elapsed = time.perf_counter() - start
    return {
        "model": model,
        "latency_s": round(elapsed, 3),
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "content": response.choices[0].message.content,
    }

I picked temperature=0.0 to reduce variance across runs. I used a fixed max_tokens=512 cap so the cost-per-completion metric was comparable across models. The full test loop ran each of my 500 prompts through five models, for 2,500 total completions, logged to a SQLite database for later analysis.

The Five Models I Tested

I didn't benchmark all 184 — that would be statistically overkill for a 2,500-sample study and would have eaten my compute budget. I picked five models that span the price/quality frontier, including the GPT-4o baseline.

Model	Input ($/M tok)	Output ($/M tok)	Context Window
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Notice the cost spread. GPT-4o output is roughly 4× more expensive than the next priciest model in my set (DeepSeek V4 Pro at $2.20/M), and 12.5× more expensive than GLM-4 Plus at $0.80/M. That's not a rounding error. That's a structural difference in unit economics.

What I Tested On

The 500-prompt corpus broke down as follows:

150 classification prompts — short text → single label, multi-class
150 extraction prompts — long document → structured JSON
100 summarization prompts — 2K-token input → 200-token output
100 reasoning prompts — math word problems, multi-step logic
50 long-context prompts — 60K+ token inputs, retrieval-heavy

I built a hand-labeled gold set for each category so I could score outputs deterministically. For open-ended tasks (summarization), I used a panel-of-three LLM graders with Cohen's kappa agreement above 0.71, which I'd consider "substantial agreement" by Landis & Koch standards.

The Cost Numbers

Here's where the data gets really interesting. After running all 2,500 completions, I aggregated total spend per model.

Model	Total Spend (USD)	Cost per Completion	Rank
GLM-4 Plus	$0.41	$0.00082	1
DeepSeek V4 Flash	$0.56	$0.00112	2
Qwen3-32B	$0.62	$0.00124	3
DeepSeek V4 Pro	$1.14	$0.00228	4
GPT-4o	$5.18	$0.01036	5

GPT-4o cost 12.6× more than GLM-4 Plus to produce the same number of completions. The "Key Finding" the original Global API writeup cited — a 40-65% cost reduction — was, if anything, conservative for my workload. I was looking at a 74% reduction just by switching from GPT-4o to GLM-4 Plus on the same prompts, with no other changes.

Caveat: your mileage will absolutely vary. My prompt mix is extraction-and-classification heavy, where smaller models tend to shine. If you're doing 100K-token creative generation, the math shifts.

Latency: Not What I Expected

I went in assuming GPT-4o would be the fastest, since OpenAI's infrastructure is generally excellent. The data said otherwise.

Model	Mean Latency (s)	P95 Latency (s)	Throughput (tok/s)
DeepSeek V4 Flash	0.87	1.41	380
GLM-4 Plus	0.94	1.52	340
Qwen3-32B	1.02	1.78	320
GPT-4o	1.20	1.95	320
DeepSeek V4 Pro	1.58	2.34	270

Across the full 500-prompt run, DeepSeek V4 Flash had a mean latency of 0.87s — 27% faster than GPT-4o. The 320 tokens/sec throughput figure I often see cited for GPT-4o held up, but the faster wall-clock latency on the Flash model is probably explained by streaming chunk size and TTFT (time to first token), which I didn't isolate in this study. That's a follow-up I'd like to run.

The P95 numbers tell a similar story. No model broke 2.5s at the 95th percentile, which is good enough for most interactive UIs.

Quality: This Is Where It Gets Nuaned

I keep seeing "GPT-4o is the best" treated as axiomatic. My data says: not always, and not by a lot.

I scored each completion 0/1 against my gold labels, then averaged within each category. Here's what I got:

Model	Classification	Extraction	Summarization	Reasoning	Long-Context	Mean
GPT-4o	0.94	0.91	0.86	0.81	0.72	0.848
DeepSeek V4 Pro	0.93	0.90	0.85	0.83	0.78	0.858
Qwen3-32B	0.91	0.88	0.83	0.78	0.69	0.818
DeepSeek V4 Flash	0.89	0.86	0.81	0.74	0.71	0.802
GLM-4 Plus	0.87	0.84	0.79	0.71	0.66	0.774

A few things stand out:

DeepSeek V4 Pro actually beat GPT-4o on my benchmark suite — 0.858 vs 0.848 mean. The gap is small (about 1 percentage point) and within sampling noise given n=500, so I wouldn't call it a statistically significant difference. But the directional finding is interesting.
GPT-4o won on classification by a hair, which is consistent with what others have reported. If you have a narrow, well-defined classification task and you trust the labels, GPT-4o is fine.
Long-context is the weak point for everyone. The 60K+ token prompts dragged down all models. GPT-4o's 0.72 there is below its overall mean, and DeepSeek V4 Pro's 200K context window gave it a real edge at 0.78.
GLM-4 Plus is the dark horse. It came in last on quality, but only 7.4 percentage points behind GPT-4o, while costing 12.6× less. For high-volume, low-stakes use cases (intent classification in a chatbot, simple entity extraction), I think this is the right trade.

The Quality-Adjusted Cost Calculation

A data scientist wouldn't leave it at raw scores. Let me compute quality per dollar.

Model	Quality	Cost/Completion	Quality per Dollar	Rank
GLM-4 Plus	0.774	$0.00082	944	1
DeepSeek V4 Flash	0.802	$0.00112	716	2
Qwen3-32B	0.818	$0.00124	660	3
DeepSeek V4 Pro	0.858	$0.00228	376	4
GPT-4o	0.848	$0.01036	82	5

This is the table that genuinely surprised me. On a quality-per-dollar basis, GLM-4 Plus is 11.5× better than GPT-4o for my workload. DeepSeek V4 Flash is 8.7× better. Even DeepSeek V4 Pro — which beat GPT-4o on raw quality — is 4.6× more cost-efficient.

Now, before you fire me an email: I know quality isn't always fungible with cost. A 7-point quality gap might matter enormously if you're doing medical triage, and not at all if you're routing customer support tickets. The point isn't that GLM-4 Plus is "better" than GPT-4o. The point is that for many real workloads, the cost gap is far larger than the quality gap, and the right model depends on your tolerance curve.

A Streaming Example With Cost Tracking

One thing I started doing in production that I wish I'd started sooner: streaming with live cost accumulation. Here's a small pattern I use:

def stream_with_cost(model: str, prompt: str) -> tuple[str, int, float]:
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True,
        stream_options={"include_usage": True},
    )

    chunks, total_tokens = [], 0
    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            chunks.append(chunk.choices[0].delta.content)
        if chunk.usage:
            total_tokens = chunk.usage.completion_tokens

    pricing = {
        "deepseek-ai/DeepSeek-V4-Flash": (0.27, 1.10),
        "gpt-4o": (2.50, 10.00),
    }
    in_price, out_price = pricing[model]
    est_cost = (total_tokens / 1_000_000) * out_price

    return "".join(chunks), total_tokens, est_cost

Streaming matters more than people realise. Beyond the obvious UX win (first token in ~150ms instead of waiting for the full response), my A/B test on a docs Q&A feature showed a 23% increase in user satisfaction when responses streamed versus rendered all at once. That's a sample of about 4,200 sessions, p<0.01, so I'm fairly confident it's not noise.

Caching: The 40% Savings I Almost Missed

I'll admit this was a finding I almost didn't pursue because it felt obvious. But after instrumenting my real production traffic for two weeks, the data was loud: 40% of my prompts were near-duplicates of recent prompts. Caching responses dropped my effective per-completion cost by another 40% on top of the model-switch savings.

If you're not caching, start. The simplest version is a hash-of-prompt → response dict with a TTL. A more sophisticated version uses embedding similarity with a cosine threshold around 0.92. Either way, the ROI is enormous.

The Production Pattern I Settled On

After all this, here's the routing logic I deployed:

GLM-4 Plus for classification and short extraction. Costs pennies, quality is fine.
DeepSeek V4 Flash for summarization and moderate-complexity generation. Sweet spot for cost/quality.
DeepSeek V4 Pro for long-context and reasoning-heavy tasks. Beats GPT-4o here.
GPT-4o for the small fraction of prompts where the task genuinely requires its specific strengths, or as a fallback.

I built the fallback with a simple try/except that retries on a different model if the primary one returns a 429 or 5xx. The

DEV Community