TL;DR — OpenAI shipped GPT-5.5 on April 23 2026. It is the first ground-up retrain since GPT-4.5, tops the Artificial Analysis Intelligence Index at 60, hits 82.7% on Terminal-Bench 2.0, and doubles the per-token price to $5/$30. ofox has it live — swap in openai/gpt-5.5 and you are done.
What OpenAI actually shipped
From the official announcement: GPT-5.5 is the next default frontier model in ChatGPT and Codex, and the first OpenAI API model to ship with a 1M-token context window.
More importantly, it is the first fully retrained base model since GPT-4.5. Every GPT-5.x release between them — 5.1, 5.2, 5.3, 5.4 — was a post-training iteration on top of the same base. GPT-5.5 is not. The architecture, pretraining corpus, and agent-oriented objectives have all been reworked.
The positioning is explicit: this is an agent model. OpenAI describes it as a system that "takes a sequence of actions, uses tools, checks its own work, and keeps going until a task is finished" — without needing a human to re-prompt at every handoff.
Two variants
| Variant | API model ID | Context | Input / Output |
|---|---|---|---|
| GPT-5.5 Thinking | openai/gpt-5.5 |
1M (400K in Codex) | $5 / $30 per 1M |
| GPT-5.5 Pro | openai/gpt-5.5-pro |
1M | $30 / $180 per 1M |
Thinking is the default — it's what replaces GPT-5.4 in ChatGPT. Pro is the higher-accuracy, higher-latency variant for tasks where you are willing to pay 6× for a few extra percentage points of reliability (more on that below).
Batch and Flex pricing is half the standard rate, same as the rest of the GPT-5.x series.
The benchmark story
OpenAI published the full comparison grid. The agentic-task numbers are where GPT-5.5 visibly pulls ahead:
| Benchmark | GPT-5.5 Thinking | GPT-5.4 Thinking | Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 75.1% | 69.4% | 68.5% |
| GDPval (knowledge work) | 84.9% | 83.0% | 80.3% | 67.3% |
| OSWorld-Verified (computer use) | 78.7% | 75.0% | 78.0% | — |
| Toolathalon (agentic tool use) | 55.6% | 54.6% | — | 48.8% |
| BrowseComp (Pro variant) | 90.1% | 89.3% | 79.3% | 85.9% |
| FrontierMath T4 (Pro variant) | 39.6% | 38.0% | 22.9% | 16.7% |
| CyberGym | 81.8% | 79.0% | 73.1% | — |
Terminal-Bench 2.0 is the standout: a 13-point lead over Opus 4.7 on agentic command-line tasks. GDPval — OpenAI's own economic-value benchmark covering 44 knowledge-work occupations — comes in at 84.9%.
On Artificial Analysis's Intelligence Index, GPT-5.5 (xhigh) scores 60, three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview (both at 57) — ending what had been a three-way tie at the top.
Where GPT-5.5 doesn't win
Not every benchmark goes GPT-5.5's way. Third-party aggregators have put together a cleaner cross-vendor view:
The honest picture:
- SWE-Bench Pro: Opus 4.7 wins at 64.3% vs 58.6%. This is the benchmark that most closely maps to "fix a real GitHub issue in a real codebase."
- SWE-Bench Verified: Opus 4.7 87.6%; GPT-5.5 was not scored here.
- MCP-Atlas (scaled tool use): Opus 4.7 77.3% vs 75.3%.
- Multilingual Q&A (MMMLU): 83.2% — noticeably behind Opus 4.7 (91.5%) and Gemini 3.1 Pro (92.6%).
- Agentic financial analysis (Finance Agent v1.1): Opus 4.7 64.4% vs 61.5%.
The pattern is consistent. GPT-5.5 leads on planning-and-execution — Terminal-Bench, Toolathlon, computer use, long-horizon coding. Opus 4.7 leads on codebase-resolution — SWE-Bench, MCP-Atlas, multilingual understanding. They are not competing on the same axis.
The hallucination problem
One number worth flagging before you swap GPT-5.4 for GPT-5.5 in production: on Artificial Analysis's AA-Omniscience benchmark, GPT-5.5 (xhigh) hits the highest recorded accuracy at 57% — but also the highest hallucination rate at 86%. By comparison, Opus 4.7 (max) hallucinates at 36% and Gemini 3.1 Pro Preview at 50%.
The interpretation matters. AA-Omniscience measures how often the model asserts something confident that turns out to be wrong. GPT-5.5 is better at the right answer when it knows it, but also more willing to confabulate when it doesn't. For agentic workflows that grade themselves as they run, this is a risk — a confident wrong action is worse than a stop-and-ask.
Pricing: the per-token price doubled
GPT-5.4 was $2.50 / $15. GPT-5.5 is $5 / $30. On raw per-token cost that's a 2× jump — the largest single-release price increase OpenAI has made in the GPT-5.x series.
OpenAI's argument is token efficiency: GPT-5.5 "uses significantly fewer tokens to complete the same Codex tasks" than GPT-5.4. Artificial Analysis measured this at roughly a 40% reduction in total tokens per Intelligence Index run, which nets out to about a 20% higher running cost at the top of the index.
That still makes it cheaper than Opus 4.7 at equivalent intelligence. AA notes: "GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800)." Gemini 3.1 Pro Preview hits similar index scores at ~$900, so GPT-5.5 is not the cost leader — but it's not the outlier either.
The 1M context window — and Codex's 400K
API developers get 1M tokens in both Responses and Chat Completions. Codex users get 400K. Why the split?
Codex runs many parallel agent sessions with aggressive caching; 400K is a throughput-and-cost decision, not a capability one. If you are feeding a model the full source of a mid-size codebase plus a year of commits plus the docs in a single shot, use the API, not Codex.
1M tokens is real but expensive. At $5 per 1M input, filling the full window costs $5 on input alone — before you get any output back. Long-context is a tool for tasks that actually need it, not a default.
Community reactions
The developer reaction has been measured, not euphoric. A few recurring threads:
- "The jump is bigger than 5.4 → 5.5 suggests." Because this is a base-model retrain, the delta is larger than the version bump implies. Multiple developers on Hacker News flagged this as the real story.
- "Terminal-Bench is not SWE-Bench." Experienced practitioners pushed back on the agent-coding headline. VentureBeat's coverage noted the Terminal-Bench lead is real but narrow against Anthropic's in-development Mythos Preview, which scored 82.0% on the same eval.
- "The price hike is going to hurt." Plus users got 200 messages per week of GPT-5.5 at launch, which several Reddit threads flagged as a material downgrade in effective usage even if the model is smarter per call.
The net read from the community: this is a genuine capability step, especially for agent workflows, but it is not a no-brainer swap. If your workload is short prompts and tight budgets, GPT-5.4 still makes sense.
Access via ofox
ofox has GPT-5.5 live on day one. If you are already using ofox, one line changes:
from openai import OpenAI
client = OpenAI(
api_key="sk-of-your-api-key",
base_url="https://api.ofox.ai/v1"
)
response = client.chat.completions.create(
model="openai/gpt-5.5",
messages=[{"role": "user", "content": "Refactor this service to use structured tool calls"}]
)
print(response.choices[0].message.content)
For the Pro variant on the same endpoint:
response = client.chat.completions.create(
model="openai/gpt-5.5-pro",
messages=[{"role": "user", "content": "Trace this race condition across all consumers"}]
)
No ofox key yet? Sign up at ofox.ai — one key covers GPT-5.5, Claude, Gemini, Kimi, and the rest.
When to upgrade — and when not to
Running GPT-5.4 in production today? Here's the practical split:
Agent workflows with multi-step tool calls, terminal automation, browsing, or computer use — upgrade. The Terminal-Bench and OSWorld gaps are large enough to matter.
Long-context analysis over codebases, filings, or research corpora — upgrade for the 1M window. Just budget for it.
Codebase-resolution tasks that were working well on GPT-5.4 or Opus 4.6 — test before switching. SWE-Bench Pro suggests Opus 4.7 may still be the better pick for this shape of work.
High-volume, latency-sensitive chat at GPT-5.4 price points — stay put. GPT-5.4 isn't going anywhere and the per-call cost is half.
Tasks where hallucination is the failure mode (factual Q&A, citation generation, compliance) — test before switching. The AA-Omniscience numbers are a real signal.
Related reading
- Claude Opus 4.7 API Review and Upgrade Guide
- Best AI Model for Coding 2026
- Claude vs GPT vs Gemini: Model Comparison Guide
Originally published on ofox.ai/blog.
Top comments (0)