You're choosing between GPT-5 and Claude Sonnet 4 for a production workload. Pricing pages give you per-million-token numbers. Benchmark leaderboards give you scores that don't always survive contact with your actual queries. The honest comparison lives in between — per-task cost on workloads that look like yours, with the gotchas that don't show up on either page.
This post is that comparison.
Deprecation note before we go further. Claude Sonnet 4 (
claude-sonnet-4-20250514, launched May 22, 2025) is deprecated and retires on June 15, 2026. Anthropic's recommended migration target is Claude Sonnet 4.6 — same pricing, larger context window, more capable. Most teams choosing between GPT-5 and "Sonnet 4" today are practically choosing between GPT-5 and Sonnet 4.6 because the original Sonnet 4 won't be in production a quarter from now. Numbers below cover both.
TL;DR
- GPT-5 is roughly 1.6–2.0x cheaper per task than Sonnet 4 / 4.6 on most workload mixes ($1.25/$10 vs $3/$15 per MTok).
- GPT-5 wins on math and science reasoning (AIME 2025: 94.6% vs 70.5%; GPQA Diamond: 88.4% vs 75.4%). Sonnet wins on agentic tool-use reliability (Tau-Bench).
- The most cost-effective production setup is neither alone — a routing layer that uses GPT-5 nano or Haiku 4.5 for simple work and escalates to GPT-5 or Sonnet 4.6 only when needed.
Pricing reality (April 2026)
| Dimension | GPT-5 (Aug 2025) | Sonnet 4 (deprecating) | Sonnet 4.6 (current) |
|---|---|---|---|
| Input | $1.25 / MTok | $3.00 / MTok | $3.00 / MTok |
| Output | $10.00 / MTok | $15.00 / MTok | $15.00 / MTok |
| Cached input | ~$0.125 / MTok (~90% off) | $0.30 / MTok (cache read) | $0.30 / MTok (cache read) |
| Context window | 400K (2x rate >272K on GPT-5.4) | 200K | 1M (flat pricing) |
| Max output | 128K | 64K | 64K |
| Reasoning model? | Yes — reasoning tokens billed as output; 5 effort levels | Extended thinking; thinking tokens as output | Extended + adaptive thinking |
| Batch API | 50% off | 50% off | 50% off |
| Knowledge cutoff | ~Sep/Oct 2024 | ~Mar 2025 | Aug 2025 |
Sources: OpenAI GPT-5 model page, Anthropic pricing, Anthropic models overview.
The headline: GPT-5 is 2.4x cheaper on input, 1.5x cheaper on output. On a typical 4:1 input:output mix that blends to a 1.6–2.0x cost advantage. Caching changes the picture: GPT-5's 90% cached-input discount is competitive with Anthropic's 90% cache-read discount, but Anthropic charges 1.25x premium on cache writes (5-min TTL), front-loading cost on the first call.
The long-context line is where Sonnet 4.6 quietly wins. GPT-5.4 charges 2x the standard rate above 272K input tokens. Sonnet 4.6 has flat pricing across its full 1M token context. For document-heavy workloads (large codebases, long PDF analysis, research synthesis), Sonnet 4.6 is often cheaper per request despite the higher per-token rate.
Head-to-head benchmarks (launch window)
| Benchmark | GPT-5 | Sonnet 4 | Margin |
|---|---|---|---|
| SWE-bench Verified | 74.9% | 72.7% (80.2% high-compute) | Tight; high-compute Sonnet leads |
| AIME 2025 (no tools) | 94.6% | 70.5% | GPT-5 +24.1pp |
| GPQA Diamond | 88.4% (Pro) | 75.4% | GPT-5 +13.0pp |
| MMMU (multimodal) | 84.2% | 74.4% | GPT-5 +9.8pp |
| Aider Polyglot | 88% | n/a published | — |
| Tau-Bench Retail | n/a published | 80.5% | Sonnet |
| Tau-Bench Airline | n/a published | 60.0% | Sonnet |
Sources: OpenAI GPT-5 launch, Anthropic Claude 4 launch.
The pattern: GPT-5 dominates pure reasoning benchmarks (math, science, multimodal). Sonnet 4 holds its own on agentic tool-use where reliability matters more than peak intelligence. Tau-Bench is a stronger predictor of how a model behaves inside a long agent loop than MMLU is.
Current generation narrows significantly. SWE-bench Verified: Sonnet 4.6 79.6%, GPT-5.4 ~80%, Opus 4.5/4.6 80.8–80.9%. The gap between picking GPT-5.4 and Sonnet 4.6 for SWE work is much smaller than the launch numbers suggest.
Cost per real task
Five workloads at typical sizes, no caching, no batching.
Customer support reply (200 in / 150 out)
| Model | Cost | At 100K replies/mo |
|---|---|---|
| GPT-5 | $0.001750 | $175 |
| Sonnet 4 / 4.6 | $0.002850 | $285 |
GPT-5 1.6x cheaper. Both handle this well; pick on cost.
Code review of 500-line PR (4,000 in / 800 out)
| Model | Cost |
|---|---|
| GPT-5 | $0.013000 |
| Sonnet 4 / 4.6 | $0.024000 |
GPT-5 1.85x cheaper. Sonnet 4.6's SWE-bench parity + agentic tool-use track record makes it the conventional choice for code-review agents anyway. The premium buys reliability on the tool-use chain, not raw intelligence.
Document summarization (3,000 in / 400 out)
| Model | Cost |
|---|---|
| GPT-5 | $0.007750 |
| Sonnet 4 / 4.6 | $0.015000 |
GPT-5 1.94x cheaper. Quality comparable for sub-200K-token documents. Above 272K tokens, Sonnet 4.6's flat 1M context flips the cost picture — cheaper per long-document call than GPT-5.4.
RAG-enabled Q&A (2,500 in / 250 out)
| Model | Cost |
|---|---|
| GPT-5 | $0.005625 |
| Sonnet 4 / 4.6 | $0.011250 |
GPT-5 2.0x cheaper. Both handle RAG well; pick on cost unless you've benchmarked Sonnet 4.6 outperforming on your specific retrieval-grounded answer quality.
Agentic task with 5 tool calls (~8,000 in / ~3,500 out incl reasoning)
| Model | Cost |
|---|---|
| GPT-5 | $0.045000 |
| Sonnet 4 / 4.6 | $0.076500 |
GPT-5 1.7x cheaper per agent run. Sonnet 4 / 4.6's agentic reliability advantage means production agents often pay the premium for fewer retries and tool-use failures. Cost-per-successful-task ratio narrows considerably once retry math is included.
⚠️ Reasoning token caveat. Add ~30–60% to GPT-5 output cost at
reasoning_effort >= medium. Reasoning tokens are silent and consume your output budget. Sonnet 4.6 extended thinking has the same dynamic. Numbers above assumeloweffort.
Production gotchas neither pricing page mentions
1. GPT-5 reasoning inflation
reasoning_effort has 5 levels (none/low/medium/high/xhigh). xhigh runs ~3–5x the cost of low because of hidden reasoning token volume. max_completion_tokens ≠ visible output for reasoning models — the budget includes the reasoning tokens you're billed for but never see.
# Good — explicit; controls hidden reasoning cost
response = client.responses.create(
model="gpt-5",
input=prompt,
reasoning={"effort": "low"}, # explicit, not default
max_output_tokens=500 # ceiling includes reasoning
)
"Be concise" in the prompt does not control reasoning verbosity. Source.
2. Sonnet output verbosity
Anthropic's own docs note Sonnet's "engaging responses" and recommend prompt-tuning for concision. Real-world reports consistently describe Sonnet outputs as more verbose than GPT-5. Cost implication: more output tokens at $15/M. Mitigation: explicit length constraints + max_tokens ceiling.
3. Long-context pricing flip
Above 272K input tokens, GPT-5.4 charges 2x standard input rate. Sonnet 4.6's 1M context has flat pricing throughout. For workloads regularly using long context — codebases >50K lines, multi-document research synthesis, RAG with large retrieval windows — Sonnet 4.6 can be cheaper per request despite the higher per-token rate.
4. Rate limits
OpenAI GPT-5 (post-Sept 2025 increase): T1 500K TPM, T2 1M, T3 2M, T4 4M, T5 40M. Source.
Anthropic: T1 (after $5 deposit) 50 RPM, T2 1K, T3 2K, T4 4K.
Not directly comparable (TPM vs RPM), but production teams hitting bursty traffic on Anthropic regularly need tier escalation earlier than equivalent OpenAI workloads.
5. Reliability (Dec 2025 reference)
IsDown LLM provider report Dec 2025: Anthropic 20 incidents (7 major), 184.5 hrs total. OpenAI 22 incidents (1 major), 182.7 hrs. Anthropic fewer total but more severe; OpenAI more frequent but minor.
6. Recent security incidents
November 2025 OpenAI Mixpanel breach (API portal customer profiles); Anthropic Claude Code internal-files exposure. Both public. AI Incident DB.
7. Geographic / data residency premiums
Anthropic 1.1x multiplier for US-only inference_geo on Opus 4.6+. Bedrock and Vertex regional endpoints add ~10% premium for Sonnet 4.5+. Source.
The decision framework
Choose GPT-5 (or GPT-5 mini) when
- Math-heavy reasoning, science Q&A, technical analysis
- Structured-output generation at lower cost
- High-volume workloads where the 1.6–2.0x price gap compounds
- RAG and document summarization on documents under 272K tokens
- Customer support replies where per-reply cost matters more than peak quality
Choose Claude Sonnet 4.6 when
- Agentic workflows with tool-use reliability requirements
- Software engineering agents (code review, refactoring, multi-file patches)
- Long-context workloads — flat 1M pricing beats GPT-5's 2x above 272K
- Writing-heavy tasks where coherent verbose output is preferred
- Workloads where retry math (failed tool calls) makes "cheaper" GPT-5 more expensive in practice
Choose neither alone for production
The architecture that minimizes cost-per-task across a real product is a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and FAQ-pattern requests, escalating to GPT-5 or Sonnet 4.6 only for requests that need the capability.
Berkeley's RouteLLM benchmarks demonstrate ~85% cost reduction at 95% quality on routable workloads. The setup is straightforward; the gain is much larger than the GPT-5-vs-Sonnet pricing gap.
What the pricing comparison doesn't capture
The 2x price gap between GPT-5 and Sonnet 4.6 is real but it's not the most consequential variable in your bill. The variables that matter more, in roughly this order:
- Whether you're routing at all. A team running 100% on Sonnet 4.6 is paying 6–10x what a team running a routed mix of Haiku 4.5 + Sonnet 4.6 pays for the same product.
- Whether prompt caching is active. Up to 90% off cached input on both providers. The bug that breaks the cache (timestamps in the prefix) is more expensive than picking the more expensive model.
-
Whether
reasoning_effortis set. Default reasoning settings on GPT-5 can blow your output budget by 3–5x silently. - Whether you're on the Batch API for batchable work. Flat 50% off — invisible if traffic is all real-time, enormous if any non-realtime work is in the mix.
- Then, finally, the per-token rate of the model you picked.
Picking the right model matters. Picking the right routing, caching, and reasoning configuration matters more.
Full breakdown with the per-task cost table, deprecation timeline, and production gotchas: preto.ai/blog/gpt-5-vs-claude-sonnet-4/
Top comments (0)