Owen

Posted on Apr 24 • Originally published at ofox.ai

GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5, 1M Context, $5/$30 Pricing

#ai #openai #gpt #benchmarks

TL;DR — OpenAI shipped GPT-5.5 on April 23 2026. It is the first ground-up retrain since GPT-4.5, tops the Artificial Analysis Intelligence Index at 60, hits 82.7% on Terminal-Bench 2.0, and doubles the per-token price to $5/$30. ofox has it live — swap in openai/gpt-5.5 and you are done.

What OpenAI actually shipped

From the official announcement: GPT-5.5 is the next default frontier model in ChatGPT and Codex, and the first OpenAI API model to ship with a 1M-token context window.

More importantly, it is the first fully retrained base model since GPT-4.5. Every GPT-5.x release between them — 5.1, 5.2, 5.3, 5.4 — was a post-training iteration on top of the same base. GPT-5.5 is not. The architecture, pretraining corpus, and agent-oriented objectives have all been reworked.

The positioning is explicit: this is an agent model. OpenAI describes it as a system that "takes a sequence of actions, uses tools, checks its own work, and keeps going until a task is finished" — without needing a human to re-prompt at every handoff.

Two variants

Variant	API model ID	Context	Input / Output
GPT-5.5 Thinking	`openai/gpt-5.5`	1M (400K in Codex)	$5 / $30 per 1M
GPT-5.5 Pro	`openai/gpt-5.5-pro`	1M	$30 / $180 per 1M

Thinking is the default — it's what replaces GPT-5.4 in ChatGPT. Pro is the higher-accuracy, higher-latency variant for tasks where you are willing to pay 6× for a few extra percentage points of reliability (more on that below).

Batch and Flex pricing is half the standard rate, same as the rest of the GPT-5.x series.

The benchmark story

OpenAI published the full comparison grid. The agentic-task numbers are where GPT-5.5 visibly pulls ahead:

Benchmark	GPT-5.5 Thinking	GPT-5.4 Thinking	Opus 4.7	Gemini 3.1 Pro
Terminal-Bench 2.0	82.7%	75.1%	69.4%	68.5%
GDPval (knowledge work)	84.9%	83.0%	80.3%	67.3%
OSWorld-Verified (computer use)	78.7%	75.0%	78.0%	—
Toolathalon (agentic tool use)	55.6%	54.6%	—	48.8%
BrowseComp (Pro variant)	90.1%	89.3%	79.3%	85.9%
FrontierMath T4 (Pro variant)	39.6%	38.0%	22.9%	16.7%
CyberGym	81.8%	79.0%	73.1%	—

Terminal-Bench 2.0 is the standout: a 13-point lead over Opus 4.7 on agentic command-line tasks. GDPval — OpenAI's own economic-value benchmark covering 44 knowledge-work occupations — comes in at 84.9%.

On Artificial Analysis's Intelligence Index, GPT-5.5 (xhigh) scores 60, three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview (both at 57) — ending what had been a three-way tie at the top.

Where GPT-5.5 doesn't win

Not every benchmark goes GPT-5.5's way. Third-party aggregators have put together a cleaner cross-vendor view:

The honest picture:

SWE-Bench Pro: Opus 4.7 wins at 64.3% vs 58.6%. This is the benchmark that most closely maps to "fix a real GitHub issue in a real codebase."
SWE-Bench Verified: Opus 4.7 87.6%; GPT-5.5 was not scored here.
MCP-Atlas (scaled tool use): Opus 4.7 77.3% vs 75.3%.
Multilingual Q&A (MMMLU): 83.2% — noticeably behind Opus 4.7 (91.5%) and Gemini 3.1 Pro (92.6%).
Agentic financial analysis (Finance Agent v1.1): Opus 4.7 64.4% vs 61.5%.

The pattern is consistent. GPT-5.5 leads on planning-and-execution — Terminal-Bench, Toolathlon, computer use, long-horizon coding. Opus 4.7 leads on codebase-resolution — SWE-Bench, MCP-Atlas, multilingual understanding. They are not competing on the same axis.

The hallucination problem

One number worth flagging before you swap GPT-5.4 for GPT-5.5 in production: on Artificial Analysis's AA-Omniscience benchmark, GPT-5.5 (xhigh) hits the highest recorded accuracy at 57% — but also the highest hallucination rate at 86%. By comparison, Opus 4.7 (max) hallucinates at 36% and Gemini 3.1 Pro Preview at 50%.

The interpretation matters. AA-Omniscience measures how often the model asserts something confident that turns out to be wrong. GPT-5.5 is better at the right answer when it knows it, but also more willing to confabulate when it doesn't. For agentic workflows that grade themselves as they run, this is a risk — a confident wrong action is worse than a stop-and-ask.

Pricing: the per-token price doubled

GPT-5.4 was $2.50 / $15. GPT-5.5 is $5 / $30. On raw per-token cost that's a 2× jump — the largest single-release price increase OpenAI has made in the GPT-5.x series.

OpenAI's argument is token efficiency: GPT-5.5 "uses significantly fewer tokens to complete the same Codex tasks" than GPT-5.4. Artificial Analysis measured this at roughly a 40% reduction in total tokens per Intelligence Index run, which nets out to about a 20% higher running cost at the top of the index.

That still makes it cheaper than Opus 4.7 at equivalent intelligence. AA notes: "GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800)." Gemini 3.1 Pro Preview hits similar index scores at ~$900, so GPT-5.5 is not the cost leader — but it's not the outlier either.

The 1M context window — and Codex's 400K

API developers get 1M tokens in both Responses and Chat Completions. Codex users get 400K. Why the split?

Codex runs many parallel agent sessions with aggressive caching; 400K is a throughput-and-cost decision, not a capability one. If you are feeding a model the full source of a mid-size codebase plus a year of commits plus the docs in a single shot, use the API, not Codex.

1M tokens is real but expensive. At $5 per 1M input, filling the full window costs $5 on input alone — before you get any output back. Long-context is a tool for tasks that actually need it, not a default.

Community reactions

The developer reaction has been measured, not euphoric. A few recurring threads:

"The jump is bigger than 5.4 → 5.5 suggests." Because this is a base-model retrain, the delta is larger than the version bump implies. Multiple developers on Hacker News flagged this as the real story.
"Terminal-Bench is not SWE-Bench." Experienced practitioners pushed back on the agent-coding headline. VentureBeat's coverage noted the Terminal-Bench lead is real but narrow against Anthropic's in-development Mythos Preview, which scored 82.0% on the same eval.
"The price hike is going to hurt." Plus users got 200 messages per week of GPT-5.5 at launch, which several Reddit threads flagged as a material downgrade in effective usage even if the model is smarter per call.

The net read from the community: this is a genuine capability step, especially for agent workflows, but it is not a no-brainer swap. If your workload is short prompts and tight budgets, GPT-5.4 still makes sense.

Access via ofox

ofox has GPT-5.5 live on day one. If you are already using ofox, one line changes:

from openai import OpenAI

client = OpenAI(
    api_key="sk-of-your-api-key",
    base_url="https://api.ofox.ai/v1"
)

response = client.chat.completions.create(
    model="openai/gpt-5.5",
    messages=[{"role": "user", "content": "Refactor this service to use structured tool calls"}]
)
print(response.choices[0].message.content)

For the Pro variant on the same endpoint:

response = client.chat.completions.create(
    model="openai/gpt-5.5-pro",
    messages=[{"role": "user", "content": "Trace this race condition across all consumers"}]
)

No ofox key yet? Sign up at ofox.ai — one key covers GPT-5.5, Claude, Gemini, Kimi, and the rest.

When to upgrade — and when not to

Running GPT-5.4 in production today? Here's the practical split:

Agent workflows with multi-step tool calls, terminal automation, browsing, or computer use — upgrade. The Terminal-Bench and OSWorld gaps are large enough to matter.

Long-context analysis over codebases, filings, or research corpora — upgrade for the 1M window. Just budget for it.

Codebase-resolution tasks that were working well on GPT-5.4 or Opus 4.6 — test before switching. SWE-Bench Pro suggests Opus 4.7 may still be the better pick for this shape of work.

High-volume, latency-sensitive chat at GPT-5.4 price points — stay put. GPT-5.4 isn't going anywhere and the per-call cost is half.

Tasks where hallucination is the failure mode (factual Q&A, citation generation, compliance) — test before switching. The AA-Omniscience numbers are a real signal.

DEV Community