정상록

Posted on Apr 25

GPT-5.5 (Spud) shipped — and most users won't feel it. Here's the developer-only upgrade hidden inside.

#openai #llm #ai #productivity

TL;DR: GPT-5.5 (codename "Spud") is the first from-scratch base model since GPT-4.5. New "o2" training infra → 60% fewer hallucinations, 40% better token efficiency. Wins Terminal-Bench 2.0 (82.7%) for agentic coding. Loses SWE-bench Pro (58.6% vs Claude Opus 4.7's 64.3%) for real-codebase work. Most users won't feel it — and that's the actual story.

The release

OpenAI shipped GPT-5.5 on April 23, 2026 to ChatGPT (Plus/Pro/Business/Enterprise), and on April 24 to the API and Codex. Multiple outlets including Axios reported the internal codename: "Spud".

Released:        2026-04-23 (ChatGPT) / 2026-04-24 (API + Codex)
Pricing:         $5 input / $30 output per 1M tokens (standard)
                 $30 input / $180 output per 1M tokens (Pro)
Context window:  1,000,000 tokens (API)
                 400,000 tokens (Codex)

Why "from-scratch" matters

This is the part people are missing. GPT-5.1 through 5.4 all shared the same pretrained weights as GPT-5. The improvements came entirely from post-training iteration (RLHF, SFT, preference modeling). The base model — the engine — was identical.

GPT-5.5 is different. Pretraining was redone from scratch on new infrastructure ("o2"). Sam Altman called this "a specific phase of intelligence development."

Here's the structure:

Version	Pre-training	Post-training
GPT-5	A	A-based
GPT-5.1 ~ 5.4	A (same)	iterated
GPT-5.5	B (new)	B-based

The reason this matters for developers: post-training improvements have a ceiling. They can refine behavior on top of base capability, but they can't break through it. New base = new ceiling.

What "o2" delivered

Two measurable wins, both in OpenAI's release notes:

1. Hallucination reduction: 60% (vs GPT-5)

# Before (GPT-5): occasional fabrication on long reasoning chains
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": long_legal_doc_summary}]
)
# → ~3-5% factual error rate on multi-page reasoning

# After (GPT-5.5): same workload, ~60% fewer fabrications
response = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": long_legal_doc_summary}]
)

This is the kind of improvement that matters for legal review, long-form research, and agentic workflows where errors compound.

2. Token efficiency: 40% reduction for equivalent work

# Same task, fewer tokens used
# Translation: your API bill drops ~40% on identical workloads
# (assuming the 40% claim holds across your specific task mix)

If you run anything in production that's API-cost-sensitive — agents, RAG pipelines, batch summarization — this is the headline number for you.

The benchmark split

Here's where it gets interesting. GPT-5.5 doesn't strictly dominate.

Benchmark	GPT-5.5	Claude Opus 4.7	Winner
Terminal-Bench 2.0 (agentic coding)	82.7%	lower	GPT-5.5
SWE-bench Verified	88.7%	(not directly compared)	—
SWE-bench Pro (real codebases)	58.6%	64.3%	Claude

What this tells you:

Building autonomous agents that complete multi-step coding tasks? GPT-5.5 leads.
Fixing real bugs in real codebases? Claude Opus 4.7 leads by ~6 percentage points.

Both models do "coding" but the definition of "coding" differs. Terminal-Bench 2.0 measures how well an agent finishes self-directed work. SWE-bench Pro measures fixing actual bug reports in real OSS repos. The gap is meaningful.

Why "users won't feel it"

This was the most-quoted reaction post-launch: "It's much better, but most users won't notice the difference."

The honest answer points to Ethan Mollick's Jagged Frontier concept: AI capability doesn't improve uniformly. Improvements concentrate in specific regions of the capability surface. GPT-5.5's gains are concentrated in:

Long-context reasoning
Multi-step autonomous execution
Token-efficient inference
Agentic tool use

What most consumer-facing usage hits:

Short questions
Summaries
Drafting
One-shot completions

GPT-5 already crossed the perceptual threshold for those. The new gains live in regions that most ChatGPT users don't push against.

Who actually benefits

If you're a developer reading dev.to, you're probably in at least one of these buckets:

benefit_score = (
    20 * is_api_user +
    25 * builds_agents +
    20 * works_with_long_documents +
    15 * uses_codex +
    10 * runs_high_volume_inference +
    10 * cost_sensitive_workload
)
# score >= 40: you'll feel it
# score < 20: you probably won't

Simon Willison's review captured it well: "fast, effective, highly capable." That's the developer view. The casual ChatGPT user view is closer to "feels about the same."

Practical takeaways

If you're paying for API: re-run your workload on GPT-5.5 and measure actual token usage. The 40% claim is workload-dependent.
If you build agents: Terminal-Bench 2.0 leadership is real — try it on multi-step pipelines that GPT-5 used to drop.
If you fix real codebases: Claude Opus 4.7 still has a meaningful edge on SWE-bench Pro. Use the right tool per job.
If you do long-document RAG: the hallucination reduction is the biggest unlock. Test on your worst-case docs.

Resources

Anyone running production workloads — are you actually seeing the 40% token reduction, or is it task-dependent? Curious what the real-world numbers look like across different use cases.

DEV Community