TL;DR: GPT-5.5 (codename "Spud") is the first from-scratch base model since GPT-4.5. New "o2" training infra → 60% fewer hallucinations, 40% better token efficiency. Wins Terminal-Bench 2.0 (82.7%) for agentic coding. Loses SWE-bench Pro (58.6% vs Claude Opus 4.7's 64.3%) for real-codebase work. Most users won't feel it — and that's the actual story.
The release
OpenAI shipped GPT-5.5 on April 23, 2026 to ChatGPT (Plus/Pro/Business/Enterprise), and on April 24 to the API and Codex. Multiple outlets including Axios reported the internal codename: "Spud".
Released: 2026-04-23 (ChatGPT) / 2026-04-24 (API + Codex)
Pricing: $5 input / $30 output per 1M tokens (standard)
$30 input / $180 output per 1M tokens (Pro)
Context window: 1,000,000 tokens (API)
400,000 tokens (Codex)
Why "from-scratch" matters
This is the part people are missing. GPT-5.1 through 5.4 all shared the same pretrained weights as GPT-5. The improvements came entirely from post-training iteration (RLHF, SFT, preference modeling). The base model — the engine — was identical.
GPT-5.5 is different. Pretraining was redone from scratch on new infrastructure ("o2"). Sam Altman called this "a specific phase of intelligence development."
Here's the structure:
| Version | Pre-training | Post-training |
|---|---|---|
| GPT-5 | A | A-based |
| GPT-5.1 ~ 5.4 | A (same) | iterated |
| GPT-5.5 | B (new) | B-based |
The reason this matters for developers: post-training improvements have a ceiling. They can refine behavior on top of base capability, but they can't break through it. New base = new ceiling.
What "o2" delivered
Two measurable wins, both in OpenAI's release notes:
1. Hallucination reduction: 60% (vs GPT-5)
# Before (GPT-5): occasional fabrication on long reasoning chains
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": long_legal_doc_summary}]
)
# → ~3-5% factual error rate on multi-page reasoning
# After (GPT-5.5): same workload, ~60% fewer fabrications
response = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": long_legal_doc_summary}]
)
This is the kind of improvement that matters for legal review, long-form research, and agentic workflows where errors compound.
2. Token efficiency: 40% reduction for equivalent work
# Same task, fewer tokens used
# Translation: your API bill drops ~40% on identical workloads
# (assuming the 40% claim holds across your specific task mix)
If you run anything in production that's API-cost-sensitive — agents, RAG pipelines, batch summarization — this is the headline number for you.
The benchmark split
Here's where it gets interesting. GPT-5.5 doesn't strictly dominate.
| Benchmark | GPT-5.5 | Claude Opus 4.7 | Winner |
|---|---|---|---|
| Terminal-Bench 2.0 (agentic coding) | 82.7% | lower | GPT-5.5 |
| SWE-bench Verified | 88.7% | (not directly compared) | — |
| SWE-bench Pro (real codebases) | 58.6% | 64.3% | Claude |
What this tells you:
- Building autonomous agents that complete multi-step coding tasks? GPT-5.5 leads.
- Fixing real bugs in real codebases? Claude Opus 4.7 leads by ~6 percentage points.
Both models do "coding" but the definition of "coding" differs. Terminal-Bench 2.0 measures how well an agent finishes self-directed work. SWE-bench Pro measures fixing actual bug reports in real OSS repos. The gap is meaningful.
Why "users won't feel it"
This was the most-quoted reaction post-launch: "It's much better, but most users won't notice the difference."
The honest answer points to Ethan Mollick's Jagged Frontier concept: AI capability doesn't improve uniformly. Improvements concentrate in specific regions of the capability surface. GPT-5.5's gains are concentrated in:
- Long-context reasoning
- Multi-step autonomous execution
- Token-efficient inference
- Agentic tool use
What most consumer-facing usage hits:
- Short questions
- Summaries
- Drafting
- One-shot completions
GPT-5 already crossed the perceptual threshold for those. The new gains live in regions that most ChatGPT users don't push against.
Who actually benefits
If you're a developer reading dev.to, you're probably in at least one of these buckets:
benefit_score = (
20 * is_api_user +
25 * builds_agents +
20 * works_with_long_documents +
15 * uses_codex +
10 * runs_high_volume_inference +
10 * cost_sensitive_workload
)
# score >= 40: you'll feel it
# score < 20: you probably won't
Simon Willison's review captured it well: "fast, effective, highly capable." That's the developer view. The casual ChatGPT user view is closer to "feels about the same."
Practical takeaways
- If you're paying for API: re-run your workload on GPT-5.5 and measure actual token usage. The 40% claim is workload-dependent.
- If you build agents: Terminal-Bench 2.0 leadership is real — try it on multi-step pipelines that GPT-5 used to drop.
- If you fix real codebases: Claude Opus 4.7 still has a meaningful edge on SWE-bench Pro. Use the right tool per job.
- If you do long-document RAG: the hallucination reduction is the biggest unlock. Test on your worst-case docs.
Resources
- OpenAI announcement
- Simon Willison's review (April 23)
- Axios on the "Spud" codename
- Ethan Mollick on Jagged Frontier
Anyone running production workloads — are you actually seeing the 40% token reduction, or is it task-dependent? Curious what the real-world numbers look like across different use cases.
Top comments (0)