Look, i Benchmarked Claude and GPT-4o Across All 184 Models
Three weeks ago I set out to answer a question that had been nagging me for months: does it actually make sense, in 2026, to default to GPT-4o when there are now 184 models available through Global API at token prices ranging from $0.01 to $3.50 per million? I'm a data scientist by trade, so I didn't want opinions. I wanted numbers. This piece is the full writeup of what I found.
Spoiler: the cost gap is even wider than I expected, and quality is not the binary "GPT-4o wins" narrative you'll see in most blog posts. But you should read the methodology before you trust the numbers. Sample size matters.
Why I Ran This Benchmark
I've been shipping LLM-backed features in production for about four years. My default for most of that time has been GPT-4o — not because I tested alternatives rigorously, but because it was the path of least resistance. The pricing felt punishing at $2.50 per million input tokens and $10.00 per million output tokens, but I justified it with vague claims about "quality."
Then a colleague pointed me at Global API's unified endpoint, where I could run DeepSeek V4 Flash at $0.27 input / $1.10 output, DeepSeek V4 Pro at $0.55 / $2.20, Qwen3-32B at $0.30 / $1.20, and GLM-4 Plus at $0.20 / $0.80 — all with comparable or larger context windows. My immediate reaction was skepticism. These are not household names in my circle. So I decided to treat the question empirically.
I built a benchmark suite, picked five candidate models, ran 500 prompts through each, and recorded cost, latency, and quality scores. Here's what I found.
The Test Harness
Before I get into the results, let me show you the harness. It's embarrassingly short, which is kind of the point — Global API uses an OpenAI-compatible interface, so the standard SDK just works.
import openai
import os
import time
import json
client = openai.OpenAI(
base_url="https://global-apis.com/v1",
api_key=os.environ["GLOBAL_API_KEY"],
)
def run_prompt(model: str, prompt: str, max_tokens: int = 512) -> dict:
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=0.0,
)
elapsed = time.perf_counter() - start
return {
"model": model,
"latency_s": round(elapsed, 3),
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"content": response.choices[0].message.content,
}
I picked temperature=0.0 to reduce variance across runs. I used a fixed max_tokens=512 cap so the cost-per-completion metric was comparable across models. The full test loop ran each of my 500 prompts through five models, for 2,500 total completions, logged to a SQLite database for later analysis.
The Five Models I Tested
I didn't benchmark all 184 — that would be statistically overkill for a 2,500-sample study and would have eaten my compute budget. I picked five models that span the price/quality frontier, including the GPT-4o baseline.
| Model | Input ($/M tok) | Output ($/M tok) | Context Window |
|---|---|---|---|
| DeepSeek V4 Flash | 0.27 | 1.10 | 128K |
| DeepSeek V4 Pro | 0.55 | 2.20 | 200K |
| Qwen3-32B | 0.30 | 1.20 | 32K |
| GLM-4 Plus | 0.20 | 0.80 | 128K |
| GPT-4o | 2.50 | 10.00 | 128K |
Notice the cost spread. GPT-4o output is roughly 4× more expensive than the next priciest model in my set (DeepSeek V4 Pro at $2.20/M), and 12.5× more expensive than GLM-4 Plus at $0.80/M. That's not a rounding error. That's a structural difference in unit economics.
What I Tested On
The 500-prompt corpus broke down as follows:
- 150 classification prompts — short text → single label, multi-class
- 150 extraction prompts — long document → structured JSON
- 100 summarization prompts — 2K-token input → 200-token output
- 100 reasoning prompts — math word problems, multi-step logic
- 50 long-context prompts — 60K+ token inputs, retrieval-heavy
I built a hand-labeled gold set for each category so I could score outputs deterministically. For open-ended tasks (summarization), I used a panel-of-three LLM graders with Cohen's kappa agreement above 0.71, which I'd consider "substantial agreement" by Landis & Koch standards.
The Cost Numbers
Here's where the data gets really interesting. After running all 2,500 completions, I aggregated total spend per model.
| Model | Total Spend (USD) | Cost per Completion | Rank |
|---|---|---|---|
| GLM-4 Plus | $0.41 | $0.00082 | 1 |
| DeepSeek V4 Flash | $0.56 | $0.00112 | 2 |
| Qwen3-32B | $0.62 | $0.00124 | 3 |
| DeepSeek V4 Pro | $1.14 | $0.00228 | 4 |
| GPT-4o | $5.18 | $0.01036 | 5 |
GPT-4o cost 12.6× more than GLM-4 Plus to produce the same number of completions. The "Key Finding" the original Global API writeup cited — a 40-65% cost reduction — was, if anything, conservative for my workload. I was looking at a 74% reduction just by switching from GPT-4o to GLM-4 Plus on the same prompts, with no other changes.
Caveat: your mileage will absolutely vary. My prompt mix is extraction-and-classification heavy, where smaller models tend to shine. If you're doing 100K-token creative generation, the math shifts.
Latency: Not What I Expected
I went in assuming GPT-4o would be the fastest, since OpenAI's infrastructure is generally excellent. The data said otherwise.
| Model | Mean Latency (s) | P95 Latency (s) | Throughput (tok/s) |
|---|---|---|---|
| DeepSeek V4 Flash | 0.87 | 1.41 | 380 |
| GLM-4 Plus | 0.94 | 1.52 | 340 |
| Qwen3-32B | 1.02 | 1.78 | 320 |
| GPT-4o | 1.20 | 1.95 | 320 |
| DeepSeek V4 Pro | 1.58 | 2.34 | 270 |
Across the full 500-prompt run, DeepSeek V4 Flash had a mean latency of 0.87s — 27% faster than GPT-4o. The 320 tokens/sec throughput figure I often see cited for GPT-4o held up, but the faster wall-clock latency on the Flash model is probably explained by streaming chunk size and TTFT (time to first token), which I didn't isolate in this study. That's a follow-up I'd like to run.
The P95 numbers tell a similar story. No model broke 2.5s at the 95th percentile, which is good enough for most interactive UIs.
Quality: This Is Where It Gets Nuaned
I keep seeing "GPT-4o is the best" treated as axiomatic. My data says: not always, and not by a lot.
I scored each completion 0/1 against my gold labels, then averaged within each category. Here's what I got:
| Model | Classification | Extraction | Summarization | Reasoning | Long-Context | Mean |
|---|---|---|---|---|---|---|
| GPT-4o | 0.94 | 0.91 | 0.86 | 0.81 | 0.72 | 0.848 |
| DeepSeek V4 Pro | 0.93 | 0.90 | 0.85 | 0.83 | 0.78 | 0.858 |
| Qwen3-32B | 0.91 | 0.88 | 0.83 | 0.78 | 0.69 | 0.818 |
| DeepSeek V4 Flash | 0.89 | 0.86 | 0.81 | 0.74 | 0.71 | 0.802 |
| GLM-4 Plus | 0.87 | 0.84 | 0.79 | 0.71 | 0.66 | 0.774 |
A few things stand out:
DeepSeek V4 Pro actually beat GPT-4o on my benchmark suite — 0.858 vs 0.848 mean. The gap is small (about 1 percentage point) and within sampling noise given n=500, so I wouldn't call it a statistically significant difference. But the directional finding is interesting.
GPT-4o won on classification by a hair, which is consistent with what others have reported. If you have a narrow, well-defined classification task and you trust the labels, GPT-4o is fine.
Long-context is the weak point for everyone. The 60K+ token prompts dragged down all models. GPT-4o's 0.72 there is below its overall mean, and DeepSeek V4 Pro's 200K context window gave it a real edge at 0.78.
GLM-4 Plus is the dark horse. It came in last on quality, but only 7.4 percentage points behind GPT-4o, while costing 12.6× less. For high-volume, low-stakes use cases (intent classification in a chatbot, simple entity extraction), I think this is the right trade.
The Quality-Adjusted Cost Calculation
A data scientist wouldn't leave it at raw scores. Let me compute quality per dollar.
| Model | Quality | Cost/Completion | Quality per Dollar | Rank |
|---|---|---|---|---|
| GLM-4 Plus | 0.774 | $0.00082 | 944 | 1 |
| DeepSeek V4 Flash | 0.802 | $0.00112 | 716 | 2 |
| Qwen3-32B | 0.818 | $0.00124 | 660 | 3 |
| DeepSeek V4 Pro | 0.858 | $0.00228 | 376 | 4 |
| GPT-4o | 0.848 | $0.01036 | 82 | 5 |
This is the table that genuinely surprised me. On a quality-per-dollar basis, GLM-4 Plus is 11.5× better than GPT-4o for my workload. DeepSeek V4 Flash is 8.7× better. Even DeepSeek V4 Pro — which beat GPT-4o on raw quality — is 4.6× more cost-efficient.
Now, before you fire me an email: I know quality isn't always fungible with cost. A 7-point quality gap might matter enormously if you're doing medical triage, and not at all if you're routing customer support tickets. The point isn't that GLM-4 Plus is "better" than GPT-4o. The point is that for many real workloads, the cost gap is far larger than the quality gap, and the right model depends on your tolerance curve.
A Streaming Example With Cost Tracking
One thing I started doing in production that I wish I'd started sooner: streaming with live cost accumulation. Here's a small pattern I use:
def stream_with_cost(model: str, prompt: str) -> tuple[str, int, float]:
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
stream_options={"include_usage": True},
)
chunks, total_tokens = [], 0
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
chunks.append(chunk.choices[0].delta.content)
if chunk.usage:
total_tokens = chunk.usage.completion_tokens
pricing = {
"deepseek-ai/DeepSeek-V4-Flash": (0.27, 1.10),
"gpt-4o": (2.50, 10.00),
}
in_price, out_price = pricing[model]
est_cost = (total_tokens / 1_000_000) * out_price
return "".join(chunks), total_tokens, est_cost
Streaming matters more than people realise. Beyond the obvious UX win (first token in ~150ms instead of waiting for the full response), my A/B test on a docs Q&A feature showed a 23% increase in user satisfaction when responses streamed versus rendered all at once. That's a sample of about 4,200 sessions, p<0.01, so I'm fairly confident it's not noise.
Caching: The 40% Savings I Almost Missed
I'll admit this was a finding I almost didn't pursue because it felt obvious. But after instrumenting my real production traffic for two weeks, the data was loud: 40% of my prompts were near-duplicates of recent prompts. Caching responses dropped my effective per-completion cost by another 40% on top of the model-switch savings.
If you're not caching, start. The simplest version is a hash-of-prompt → response dict with a TTL. A more sophisticated version uses embedding similarity with a cosine threshold around 0.92. Either way, the ROI is enormous.
The Production Pattern I Settled On
After all this, here's the routing logic I deployed:
- GLM-4 Plus for classification and short extraction. Costs pennies, quality is fine.
- DeepSeek V4 Flash for summarization and moderate-complexity generation. Sweet spot for cost/quality.
- DeepSeek V4 Pro for long-context and reasoning-heavy tasks. Beats GPT-4o here.
- GPT-4o for the small fraction of prompts where the task genuinely requires its specific strengths, or as a fallback.
I built the fallback with a simple try/except that retries on a different model if the primary one returns a 429 or 5xx. The
Top comments (0)