DEV Community

Cover image for GPT-5 vs Claude Sonnet 4: real per-task cost and benchmark comparison for production workloads
gauravdagde
gauravdagde

Posted on • Originally published at preto.ai

GPT-5 vs Claude Sonnet 4: real per-task cost and benchmark comparison for production workloads

You're choosing between GPT-5 and Claude Sonnet 4 for a production workload. Pricing pages give you per-million-token numbers. Benchmark leaderboards give you scores that don't always survive contact with your actual queries. The honest comparison lives in between — per-task cost on workloads that look like yours, with the gotchas that don't show up on either page.

This post is that comparison.

Deprecation note before we go further. Claude Sonnet 4 (claude-sonnet-4-20250514, launched May 22, 2025) is deprecated and retires on June 15, 2026. Anthropic's recommended migration target is Claude Sonnet 4.6 — same pricing, larger context window, more capable. Most teams choosing between GPT-5 and "Sonnet 4" today are practically choosing between GPT-5 and Sonnet 4.6 because the original Sonnet 4 won't be in production a quarter from now. Numbers below cover both.

TL;DR

  1. GPT-5 is roughly 1.6–2.0x cheaper per task than Sonnet 4 / 4.6 on most workload mixes ($1.25/$10 vs $3/$15 per MTok).
  2. GPT-5 wins on math and science reasoning (AIME 2025: 94.6% vs 70.5%; GPQA Diamond: 88.4% vs 75.4%). Sonnet wins on agentic tool-use reliability (Tau-Bench).
  3. The most cost-effective production setup is neither alone — a routing layer that uses GPT-5 nano or Haiku 4.5 for simple work and escalates to GPT-5 or Sonnet 4.6 only when needed.

Pricing reality (April 2026)

Dimension GPT-5 (Aug 2025) Sonnet 4 (deprecating) Sonnet 4.6 (current)
Input $1.25 / MTok $3.00 / MTok $3.00 / MTok
Output $10.00 / MTok $15.00 / MTok $15.00 / MTok
Cached input ~$0.125 / MTok (~90% off) $0.30 / MTok (cache read) $0.30 / MTok (cache read)
Context window 400K (2x rate >272K on GPT-5.4) 200K 1M (flat pricing)
Max output 128K 64K 64K
Reasoning model? Yes — reasoning tokens billed as output; 5 effort levels Extended thinking; thinking tokens as output Extended + adaptive thinking
Batch API 50% off 50% off 50% off
Knowledge cutoff ~Sep/Oct 2024 ~Mar 2025 Aug 2025

Sources: OpenAI GPT-5 model page, Anthropic pricing, Anthropic models overview.

The headline: GPT-5 is 2.4x cheaper on input, 1.5x cheaper on output. On a typical 4:1 input:output mix that blends to a 1.6–2.0x cost advantage. Caching changes the picture: GPT-5's 90% cached-input discount is competitive with Anthropic's 90% cache-read discount, but Anthropic charges 1.25x premium on cache writes (5-min TTL), front-loading cost on the first call.

The long-context line is where Sonnet 4.6 quietly wins. GPT-5.4 charges 2x the standard rate above 272K input tokens. Sonnet 4.6 has flat pricing across its full 1M token context. For document-heavy workloads (large codebases, long PDF analysis, research synthesis), Sonnet 4.6 is often cheaper per request despite the higher per-token rate.


Head-to-head benchmarks (launch window)

Benchmark GPT-5 Sonnet 4 Margin
SWE-bench Verified 74.9% 72.7% (80.2% high-compute) Tight; high-compute Sonnet leads
AIME 2025 (no tools) 94.6% 70.5% GPT-5 +24.1pp
GPQA Diamond 88.4% (Pro) 75.4% GPT-5 +13.0pp
MMMU (multimodal) 84.2% 74.4% GPT-5 +9.8pp
Aider Polyglot 88% n/a published
Tau-Bench Retail n/a published 80.5% Sonnet
Tau-Bench Airline n/a published 60.0% Sonnet

Sources: OpenAI GPT-5 launch, Anthropic Claude 4 launch.

The pattern: GPT-5 dominates pure reasoning benchmarks (math, science, multimodal). Sonnet 4 holds its own on agentic tool-use where reliability matters more than peak intelligence. Tau-Bench is a stronger predictor of how a model behaves inside a long agent loop than MMLU is.

Current generation narrows significantly. SWE-bench Verified: Sonnet 4.6 79.6%, GPT-5.4 ~80%, Opus 4.5/4.6 80.8–80.9%. The gap between picking GPT-5.4 and Sonnet 4.6 for SWE work is much smaller than the launch numbers suggest.


Cost per real task

Five workloads at typical sizes, no caching, no batching.

Customer support reply (200 in / 150 out)

Model Cost At 100K replies/mo
GPT-5 $0.001750 $175
Sonnet 4 / 4.6 $0.002850 $285

GPT-5 1.6x cheaper. Both handle this well; pick on cost.

Code review of 500-line PR (4,000 in / 800 out)

Model Cost
GPT-5 $0.013000
Sonnet 4 / 4.6 $0.024000

GPT-5 1.85x cheaper. Sonnet 4.6's SWE-bench parity + agentic tool-use track record makes it the conventional choice for code-review agents anyway. The premium buys reliability on the tool-use chain, not raw intelligence.

Document summarization (3,000 in / 400 out)

Model Cost
GPT-5 $0.007750
Sonnet 4 / 4.6 $0.015000

GPT-5 1.94x cheaper. Quality comparable for sub-200K-token documents. Above 272K tokens, Sonnet 4.6's flat 1M context flips the cost picture — cheaper per long-document call than GPT-5.4.

RAG-enabled Q&A (2,500 in / 250 out)

Model Cost
GPT-5 $0.005625
Sonnet 4 / 4.6 $0.011250

GPT-5 2.0x cheaper. Both handle RAG well; pick on cost unless you've benchmarked Sonnet 4.6 outperforming on your specific retrieval-grounded answer quality.

Agentic task with 5 tool calls (~8,000 in / ~3,500 out incl reasoning)

Model Cost
GPT-5 $0.045000
Sonnet 4 / 4.6 $0.076500

GPT-5 1.7x cheaper per agent run. Sonnet 4 / 4.6's agentic reliability advantage means production agents often pay the premium for fewer retries and tool-use failures. Cost-per-successful-task ratio narrows considerably once retry math is included.

⚠️ Reasoning token caveat. Add ~30–60% to GPT-5 output cost at reasoning_effort >= medium. Reasoning tokens are silent and consume your output budget. Sonnet 4.6 extended thinking has the same dynamic. Numbers above assume low effort.


Production gotchas neither pricing page mentions

1. GPT-5 reasoning inflation

reasoning_effort has 5 levels (none/low/medium/high/xhigh). xhigh runs ~3–5x the cost of low because of hidden reasoning token volume. max_completion_tokens ≠ visible output for reasoning models — the budget includes the reasoning tokens you're billed for but never see.

# Good — explicit; controls hidden reasoning cost
response = client.responses.create(
    model="gpt-5",
    input=prompt,
    reasoning={"effort": "low"},      # explicit, not default
    max_output_tokens=500             # ceiling includes reasoning
)
Enter fullscreen mode Exit fullscreen mode

"Be concise" in the prompt does not control reasoning verbosity. Source.

2. Sonnet output verbosity

Anthropic's own docs note Sonnet's "engaging responses" and recommend prompt-tuning for concision. Real-world reports consistently describe Sonnet outputs as more verbose than GPT-5. Cost implication: more output tokens at $15/M. Mitigation: explicit length constraints + max_tokens ceiling.

3. Long-context pricing flip

Above 272K input tokens, GPT-5.4 charges 2x standard input rate. Sonnet 4.6's 1M context has flat pricing throughout. For workloads regularly using long context — codebases >50K lines, multi-document research synthesis, RAG with large retrieval windows — Sonnet 4.6 can be cheaper per request despite the higher per-token rate.

4. Rate limits

OpenAI GPT-5 (post-Sept 2025 increase): T1 500K TPM, T2 1M, T3 2M, T4 4M, T5 40M. Source.

Anthropic: T1 (after $5 deposit) 50 RPM, T2 1K, T3 2K, T4 4K.

Not directly comparable (TPM vs RPM), but production teams hitting bursty traffic on Anthropic regularly need tier escalation earlier than equivalent OpenAI workloads.

5. Reliability (Dec 2025 reference)

IsDown LLM provider report Dec 2025: Anthropic 20 incidents (7 major), 184.5 hrs total. OpenAI 22 incidents (1 major), 182.7 hrs. Anthropic fewer total but more severe; OpenAI more frequent but minor.

6. Recent security incidents

November 2025 OpenAI Mixpanel breach (API portal customer profiles); Anthropic Claude Code internal-files exposure. Both public. AI Incident DB.

7. Geographic / data residency premiums

Anthropic 1.1x multiplier for US-only inference_geo on Opus 4.6+. Bedrock and Vertex regional endpoints add ~10% premium for Sonnet 4.5+. Source.


The decision framework

Choose GPT-5 (or GPT-5 mini) when

  • Math-heavy reasoning, science Q&A, technical analysis
  • Structured-output generation at lower cost
  • High-volume workloads where the 1.6–2.0x price gap compounds
  • RAG and document summarization on documents under 272K tokens
  • Customer support replies where per-reply cost matters more than peak quality

Choose Claude Sonnet 4.6 when

  • Agentic workflows with tool-use reliability requirements
  • Software engineering agents (code review, refactoring, multi-file patches)
  • Long-context workloads — flat 1M pricing beats GPT-5's 2x above 272K
  • Writing-heavy tasks where coherent verbose output is preferred
  • Workloads where retry math (failed tool calls) makes "cheaper" GPT-5 more expensive in practice

Choose neither alone for production

The architecture that minimizes cost-per-task across a real product is a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and FAQ-pattern requests, escalating to GPT-5 or Sonnet 4.6 only for requests that need the capability.

Berkeley's RouteLLM benchmarks demonstrate ~85% cost reduction at 95% quality on routable workloads. The setup is straightforward; the gain is much larger than the GPT-5-vs-Sonnet pricing gap.


What the pricing comparison doesn't capture

The 2x price gap between GPT-5 and Sonnet 4.6 is real but it's not the most consequential variable in your bill. The variables that matter more, in roughly this order:

  1. Whether you're routing at all. A team running 100% on Sonnet 4.6 is paying 6–10x what a team running a routed mix of Haiku 4.5 + Sonnet 4.6 pays for the same product.
  2. Whether prompt caching is active. Up to 90% off cached input on both providers. The bug that breaks the cache (timestamps in the prefix) is more expensive than picking the more expensive model.
  3. Whether reasoning_effort is set. Default reasoning settings on GPT-5 can blow your output budget by 3–5x silently.
  4. Whether you're on the Batch API for batchable work. Flat 50% off — invisible if traffic is all real-time, enormous if any non-realtime work is in the mix.
  5. Then, finally, the per-token rate of the model you picked.

Picking the right model matters. Picking the right routing, caching, and reasoning configuration matters more.

Full breakdown with the per-task cost table, deprecation timeline, and production gotchas: preto.ai/blog/gpt-5-vs-claude-sonnet-4/

Top comments (0)