DEV Community

정상록
정상록

Posted on

GPT-5.5 Released: What the Marketing Headlines Don't Tell You

OpenAI announced GPT-5.5 on April 23, 2026. The API rolled out one day later on April 24. Four days in, the marketing claims and benchmark hype are everywhere — but the picture is more nuanced than headlines suggest.

This post is a 1st-source-only digest. Everything here is cross-validated against openai.com, developers.openai.com, CNBC, TechCrunch, Fortune, Help Net Security, and the OpenAI/Codex GitHub issue tracker.

What Was Actually Announced

  • Announcement: April 23, 2026 (Brockman, Glaese, Chen, Pachocki — not Sam Altman)
  • API release: April 24, 2026 (one day later, separate safeguard process)
  • Model IDs: gpt-5.5, snapshot gpt-5.5-2026-04-23
  • Knowledge cutoff: December 1, 2025
  • ChatGPT availability: Plus, Pro, Business, Enterprise (immediate)
  • Codex availability: Plus, Pro, Business, Enterprise, Edu, Go

Benchmarks: Where GPT-5.5 Is SOTA

All scores below are at reasoning effort xhigh per OpenAI's official tables.

Benchmark GPT-5.4 GPT-5.5 Delta
Terminal-Bench 2.0 75.1% 82.7% +7.6pp (SOTA)
Expert-SWE (Internal) 68.5% 73.1% +4.6pp
GDPval (knowledge work) 83.0% 84.9% beats Claude Opus 4.7 (80.3%)
OSWorld-Verified 75.0% 78.7% beats Claude (78.0%)
Tau2-bench Telecom 92.8% 98.0% no prompt tuning
FrontierMath Tier 4 27.1% 35.4% beats Claude (22.9%)
ARC-AGI-2 73.3% 85.0% +11.7pp
MRCR v2 8-needle 512K-1M 36.6% 74.0% +37.4pp (2x recovery)

The MRCR v2 long-context recovery is particularly impressive. GPT-5.4 was losing more than half the needles in the 512K-1M range; GPT-5.5 retains roughly three-quarters.

Where GPT-5.5 Is NOT #1

This is the part most marketing posts skip. Per OpenAI's own published comparison tables:

Benchmark GPT-5.5 Leader Lead Margin
SWE-Bench Pro 58.6% Claude Opus 4.7 (64.3%) -5.7pp
GPQA Diamond 93.6% Claude Opus 4.7 (94.2%) -0.6pp
Humanity's Last Exam (with tools) 52.2% Claude Opus 4.7 (54.7%) -2.5pp
ARC-AGI-1 (Verified) 95.0% Gemini 3.1 Pro (98.0%) -3.0pp
BrowseComp 84.4% Gemini 3.1 Pro (85.9%) -1.5pp

OpenAI itself notes in the announcement that SWE-Bench Pro has potential memorization concerns documented in the literature. Take any single benchmark with appropriate skepticism.

The 1M Context Catch

OpenAI markets GPT-5.5 with a "1M context window." The exact number per developer docs is 1,050,000 tokens. But this number depends heavily on where you use it.

Environment Context Source
API (gpt-5.5) 1,050,000 tokens developers.openai.com
Codex (official) 400,000 tokens OpenAI announcement
Codex (measured) 258,400 tokens (bug report) openai/codex#19319
Max output 128,000 tokens developers.openai.com

Users in the GitHub issue are reporting "exceeds the context window" errors at unexpectedly low input sizes. If you're building tooling that depends on the full 1M window, validate the actual environment, not the marketing claim.

Pricing: 2x Increase + Long Context Premium

The published API pricing:

gpt-5.5
  Input:        $5.00 / 1M tokens
  Output:      $30.00 / 1M tokens
  Cached input: $0.50 / 1M tokens

gpt-5.5-pro (parallel test-time compute variant)
  Input:       $30.00 / 1M tokens
  Output:     $180.00 / 1M tokens
Enter fullscreen mode Exit fullscreen mode

This is exactly 2x GPT-5.4's pricing ($2.50 input / $10 output).

The hidden premium: inputs over 272K tokens get 2x input cost and 1.5x output cost. So if you actually use the full 1M window, you're paying double on input. This makes "1M context is essentially priced twice" a fair characterization.

Other pricing modifiers:

  • Batch / Flex: 50% of standard
  • Priority processing: 250% of standard
  • Regional processing (data residency): +10%
  • Codex Fast mode: 1.5x speed at 2.5x cost

OpenAI argues that token efficiency improvements offset the price hike for many workloads. Your mileage will depend heavily on the work type.

Safety: AISI Found a Universal Jailbreak in 6 Hours

The UK AI Security Institute (AISI) ran a 6-hour expert red team and found a universal jailbreak before launch. OpenAI says they fixed it before release. However, per Transformer News, AISI did not directly verify the fix in the final deployment configuration.

GPT-5.5 is rated "High" on OpenAI's Preparedness Framework for both cybersecurity and biology (below "Critical" but above prior models). OpenAI launched a Bio Bug Bounty program for finding biology safeguard bypasses.

For cybersecurity, OpenAI is positioning defensively via the Trusted Access for Cyber program — vetted defenders get expanded access to GPT-5.5's cyber capabilities. SecureBio evaluation reportedly found "wet-lab virology troubleshooting assistance above expert level," which is the basis for the High rating.

Practical Guidance (Day 4)

Real-world feedback is still thin. Based on what OpenAI has published:

Use GPT-5.5 when:

  • Agentic coding workflows (Terminal-Bench-style tasks)
  • Computer use / OS automation (OSWorld-Verified)
  • Long-context recall in the 512K-1M range
  • Tier-4 frontier mathematics
  • Knowledge work where GDPval is representative

Consider Claude Opus 4.7 when:

  • Pure SWE-Bench Pro-style coding tasks
  • Academic reasoning (GPQA Diamond)
  • Humanity's Last Exam-style questions

Cost optimization:

  • Stay below 272K input tokens to avoid the long context premium
  • Use Batch/Flex modes for 50% off when latency is flexible
  • Cached input drops cost to $0.50/1M (90% savings)
  • In Codex, plan for ~258-400K context, not 1M

Caveats Worth Repeating

  1. All benchmarks above are at reasoning effort xhigh. Default API settings will likely produce lower scores.
  2. We're 4 days post-launch. External reproduction and independent evaluation are pending.
  3. OpenAI's comparison tables have empty cells (-) for some Claude/Gemini entries, so "SOTA across the board" is an overstatement of what was actually published.
  4. Korean and other non-English language performance is not specifically benchmarked in the announcement.

References


Disclaimer: AI-assisted research digest. Verify primary sources before making decisions. We're 4 days into the release; expect updates as third-party evaluations come in.

Top comments (0)