GPT-5.5 Is Out — What the Numbers Actually Say

#ai #openai #gpt #benchmarks

Six Weeks, and "Spud"

GPT-5.4 came out six weeks ago. The release before that was in December. Before that, November.

The era when model releases were quarterly events is over. They're weekly-to-monthly events now.

The reason this pace is possible is simple. AI is accelerating AI development. According to OpenAI, Codex has 4 million weekly active users and ChatGPT has 9 million paying business users. Real usage feedback at that scale flows straight back into the next training cycle.

Look at Pachocki's statement again.

"The last two years have been surprisingly slow."

He's not saying the present is slow. He's declaring that the future will be faster. GPT-5.5 arrived in six weeks and even that, he's saying, was slow.

Greg Brockman described it in the same briefing as "a new class of intelligence" and "a big step toward agentic and intuitive computing." Strip the marketing and one thing remains: the model refresh cycle is now shorter than most product planning cycles.

The Benchmarks, As Published

Here are the numbers.

Terminal-Bench 2.0 — complex command-line workflows requiring planning, tool use, and iteration:

GPT-5.5         82.7%
GPT-5.4         75.1%
Claude Opus 4.7 69.4%
Gemini 3.1 Pro  68.5%

OSWorld-Verified — how well the model operates a computer autonomously:

GPT-5.5         78.7%
Claude Opus 4.7 78.0%
GPT-5.4         75.0%

SWE-Bench Pro — resolving real GitHub issues in a single pass:

GPT-5.5   58.6%

On Terminal-Bench, GPT-5.5 leads Opus 4.7 by 13.3 points. That's a big jump. But on OSWorld, the gap is 0.7 points. Dominant on one axis, barely ahead on another.

Not "crushing it" — just leading. And the era of ranking models by a single benchmark is already behind us. Computer use has been an area Anthropic invested in heavily, and the more honest reading is that OpenAI just about caught up rather than blew past.

Also: benchmarks are marketing material. OpenAI picked the numbers favorable to them. Real-world feel is something each team has to verify on their own workloads.

1M Context and Weird Token Economics

The pricing is interesting.

GPT-5.5        $5 / $30   per 1M tokens (input / output)
GPT-5.5 Pro    $30 / $180 per 1M tokens (input / output)
Context window 1M
Batch / Flex   half the standard rate
Priority       2.5x the standard rate

It's more expensive than GPT-5.4. But OpenAI claims it does the same work with fewer tokens. Their own post states that GPT-5.5 matches GPT-5.4 per-token latency in production serving.

Translation: the unit price went up, but token consumption goes down enough that the final bill could be similar or lower. What actually hits your wallet depends on your workload. Long-running agent tasks with lots of reasoning might come out ahead. Apps with tons of short one-shot calls might just get more expensive.

And 1M context. OpenAI caught up to territory Anthropic entered earlier. Long document analysis, full-repo understanding, long-running agent sessions — there are real workloads where 1M matters.

Worth noting is the GPT-5.5 Pro pricing. $30 input, $180 output. That's not priced for hobby developers. It's squarely an enterprise workload tier — agents running all day, complex research pipelines, nothing else makes sense at those rates.

Mythos, Code Red, and the Shape of Competition

The most telling line in the Axios report is this.

Internally at OpenAI, Anthropic's rise was reportedly treated as a "code red" moment, and that moment drove a pivot toward enterprise customers.

In the GPT-5.5 briefing OpenAI explicitly referenced Anthropic's Mythos. Mythos is Anthropic's latest frontier model, announced earlier this month but with a limited rollout due to cybersecurity capabilities. OpenAI's reason for bringing it up is clear: the signal they want to send is "we have Mythos-class cyber capability too."

The frontier model race right now isn't tech versus tech. It's enterprise budget versus enterprise budget. Fortune's piece quotes the CIO of Bank of New York, where they're running Anthropic and OpenAI side by side across 220+ AI use cases. Customers like that are the ones actually moving the market.

The real reason models ship every six weeks is here. It's not technical necessity. It's that your competitor can ship every six weeks. The moment you slow down, enterprise contracts start sliding over.

The interesting part is that this competitive dynamic is a win for users. A better model every six weeks, with pricing pressure arriving alongside it. Just having multiple frontier labs active keeps the whole field healthier.

What's Left Behind the Numbers

So what do you actually do with this.

Building your stack around a single model is an increasingly bad bet. There's a very high probability that a better model ships in six weeks. Might be OpenAI. Might be Anthropic. Might be Google. You can't predict which one.

The investment goes one layer up. Harness design, multi-agent orchestration, tool chains, evaluation pipelines, context engineering. These layers survive model swaps. Better yet, they get better as models get better.

Releases like GPT-5.5 are no longer news — they're environment. Infrastructure that updates on a schedule. Building your workflow on that assumption is the realistic stance for 2026.

The people who don't get emotionally tossed around by a 1-2 point benchmark swing are the ones who go the distance. If Terminal-Bench 82.7% becomes 85% in a few months, your workflow design mostly still applies.