Solomon Neas

Posted on Apr 26 • Originally published at solomonneas.dev

GPT-5.5 Is OpenAI's Workstation Model

#openai #gpt55 #codex #agenticai

OpenAI shipped a model built for work, not only chat.

GPT-5.5 is the clearest version yet of what OpenAI wants the high-end model lane to become: a workstation model. Less chatbot. More Codex, browser control, spreadsheets, documents, research loops, computer use, and long-running tool work.

That distinction matters because the launch makes more sense once you stop reading it as a normal model card race. OpenAI is saying something more specific than "GPT-5.5 is smarter than GPT-5.4." The model is supposed to carry more of the actual work: understand a messy goal, plan, use tools, check itself, move across software, and keep going until the task is finished.¹

That is the pitch. The interesting question is whether the early evidence backs it up.

My read: GPT-5.5 looks like a serious jump for agentic work, especially inside Codex. The launch-day fog cleared fast: API access arrived one day later and pricing is now official. The remaining caveats are cost, routing, mixed early developer reactions, and safety controls that will matter a lot for cyber and bio work.

What OpenAI Actually Released

OpenAI released GPT-5.5 on April 23, 2026. The base model rolled out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. GPT-5.5 Pro rolled out to Pro, Business, and Enterprise users in ChatGPT.¹

The launch-day API caveat aged quickly. OpenAI updated the launch post on April 24 to say GPT-5.5 and GPT-5.5 Pro are now available in the API, and the API changelog says GPT-5.5 is available through Chat Completions, Responses, and Batch. GPT-5.5 Pro is available through Responses for harder problems that benefit from more compute.¹²⁸

That update changes the practical read. This is no longer a launch-day access story. It is a migration story. If you are moving a real workflow, check the exact endpoint, auth path, context mode, caching behavior, and tool support before swapping defaults.

The official positioning is direct. OpenAI says GPT-5.5 is strongest in agentic coding, computer use, knowledge work, and early scientific research. It highlights coding and debugging, online research, data analysis, documents, spreadsheets, software operation, and tool use across longer tasks.¹

Greg Brockman framed it as a "new class of intelligence" that can complete difficult computer work with less micromanagement, while remaining token efficient and low latency at scale.⁶ Sam Altman framed the release around iterative deployment and democratized access to capable models, especially as cybersecurity capability keeps rising.⁵

That combination tells you where OpenAI wants the conversation to go. GPT-5.5 is not being sold as a better answer box. It is being sold as a better worker inside a tool harness.

The Benchmark Story Is Strong, but Not Clean

OpenAI's headline numbers are good.

The company reports 82.7 percent on Terminal-Bench 2.0, up from 75.1 percent for GPT-5.4, and above Claude Opus 4.7 at 69.4 percent and Gemini 3.1 Pro at 68.5 percent.¹ That benchmark matters here because it tests command-line workflows that require planning, iteration, and tool coordination. In other words, it maps pretty closely to the Codex story.

OpenAI also reports 84.9 percent on GDPval wins or ties, 78.7 percent on OSWorld-Verified, 55.6 percent on Toolathlon, 84.4 percent on BrowseComp, 51.7 percent on FrontierMath Tiers 1 to 3, 35.4 percent on FrontierMath Tier 4, and 81.8 percent on CyberGym.¹

That is a strong launch table. It is also a table that should be read carefully.

The Decoder had the best skeptical read I found. It points out that GPT-5.5 does not dominate everything. Claude Opus 4.7 leads GPT-5.5 on SWE-Bench Pro, 64.3 percent to 58.6 percent. Gemini 3.1 Pro leads the base GPT-5.5 model on BrowseComp. GDPval improves only modestly over GPT-5.4, from 83.0 percent to 84.9 percent.¹⁴

That does not make the launch weak. It makes the launch specific. GPT-5.5 looks strongest where the task is agentic, tool-heavy, and operational. It is not an across-the-board demolition of every competing model.

That is actually more useful than the normal "new best model" headline.

The Codex Angle Is the Real Story

The most interesting claims are not in the generic ChatGPT framing. They are in Codex.

OpenAI Developers described GPT-5.5 as OpenAI's strongest agentic coding model to date, saying it can carry coding tasks further end to end: understanding a codebase, making changes, debugging, testing, and validation.³ They also said GPT-5.5 is more token efficient than GPT-5.4 in Codex for most users.⁴

That is the part I would watch. Not the one-off benchmark. The real test is whether it can stay useful across the full engineering loop.

Early users are already talking in those terms. Simon Willison said he had previewed GPT-5.5 in Codex for weeks and had especially good results using it for security reviews against code written by other models.⁸ His blog post captured the awkward day-zero detail: before the API opened on April 24, GPT-5.5 was already accessible through the Codex subscription path that OpenAI appears to tolerate for tools like Codex and OpenClaw.⁷²⁸

Dan Shipper and the Every team are more bullish. Their day-zero read is that GPT-5.5 is fast, friendly, strong at coding, strong at knowledge work, and plausible as a daily driver. Shipper wrote that it has "serious conceptual clarity" and can hold complex plans across long work sessions.⁹

But Every's own caveats are important. Their review says Opus 4.7 still writes better plans, has better attention to detail on some work, and remains stronger for frontend, product design, and underspecified vibe-coding tasks. They also call out Ruby as a weak spot.¹⁰

That sounds right. GPT-5.5 may be the better default workhorse. That does not mean it is the best taste model or the best ambiguous-product partner.

My Local Gauntlets Matched the Workstation Thesis

The public benchmark table is useful, but I care more about the thing I can actually feel in a tool harness: does the model finish real work, verify it, and explain what changed?

So I ran GPT-5.5 through a small local gauntlet set inside OpenClaw. This is not a public benchmark. It is my own working set for Codex-style tasks: broken ops scenarios, a frontend component build, a security audit, and a production system design prompt. The point was not to prove universal superiority. The point was to see whether the workstation framing holds up when the model has to use files, make changes, and pass verification.²⁵²⁶

Local gauntlet	Result	What mattered
Ops Gauntlet 001: NovaPay reconciliation outage	7/7 verification, 18/18 manual score	Found five config and permission faults, fixed them, produced a clean postmortem, and ignored old OOM/TLS/MongoDB noise.
Ops Gauntlet 002: DataForge silent pipeline failure	8/8 verification, 18/18 manual score	Treated it as stale output instead of a crash, found the FIFO log trap, empty worker count, config path mismatch, missing table, and stale cache.
Frontend Build: generic React TypeScript data table	27/27	Produced a single-file component with sorting, filtering, pagination, selection, theme toggle, keyboard behavior, ARIA, responsive layout, and real TypeScript compile validation.
Security Audit: vulnerable Express app	27/27	Found all 17 planted issues with line numbers, CVSS estimates, exploitability notes, impact, and fixes.
System Design: 50,000 events/sec log aggregation	30/30	Covered all 10 requested sections with sizing math, shard counts, retention, alert routing, failure modes, a 3 month rollout, and a $7,670/month cost model under the $8,000 cap.
Total local score	120/120	Stronger than I expected, especially on verification-heavy work.

The operational runs are the part I trust most. GPT-5.5 traced the incident shape, separated current faults from stale noise, and validated the full path afterward. That is exactly what I mean by a workstation model.

The frontend run was also strong, but with a caveat. It generated a clean, compilable table component. That is engineering execution. It is not the same thing as product taste. For visual design, I still want a human pass or a taste model in the loop.

A Few Before and After Checks

The visual redesign tests were useful for a different reason. They show the line between implementation and taste. GPT-5.5 can take a page from plain project-card energy to something much closer to a portfolio case study, but the final judgment still comes down to whether the page feels intentional instead of just decorated.²⁷ Worth saying: the original versions were not junk. Those were Opus 4.5 designs, so this was a comparison between one strong model pass and another, not between competence and collapse.

Before: the SOC project page was readable, but it felt like a normal project detail page.

After: better hierarchy and framing, but it still defaults to cards, pills, and gradient accents.

Before: solid content, but the page did not yet sell the NOC dashboard concept visually.

After: this one really worked. The large type, stronger color, and dashboard visuals gave it real presence.

These images are why I would not call GPT-5.5 a pure coding model. It can move through code, content, layout, and QA in the same run. That is the workstation behavior. The limit is taste, not capability.

I pushed that a little further with two UI redraws that are closer to product-surface work than normal blog-page polish.

BroHunter started as a blunt, utility-first screen that already worked. The redesign just gave it more shape: grouped navigation, active hunts, Zeek signals, protocol mix, confidence markers, evidence queue, and a timeline that feels more deliberate.

Before: usable, but visually flat and not yet selling the investigation workflow.

After: one of my favorites. Better hierarchy, better grouping, and a much more confident hunting surface.

CyberBRIEF was a different test. The original leaned into cheesy security-page territory and felt imbalanced, so this one needed restraint more than volume. The goal was to make it feel like an editorial intelligence briefing product, calm enough to read, structured enough to scan, and distinct from the louder security-tool aesthetic.

Before: informative, but still closer to a plain report page than a polished briefing surface.

After: a big improvement. Calmer, more balanced, and much closer to a real briefing surface.

GPT-5.5 was not just filling in components or cleaning up CSS. It was moving between tone, information density, workflow cues, and product intent. That is closer to real interface work.

I still would not hand it the keys and walk away. Taste is still the part that needs a human in the loop. But the distance between "generate a working UI" and "generate a UI that feels like the product it is supposed to be" is getting smaller.

What This Means for OpenClaw and Third-Party Harnesses

This launch matters more if you run agents outside a model lab's first-party app.

OpenClaw's current docs already treat GPT-5.5 as a first-class OpenAI-family model, but the route labels matter. There are three practical paths: direct API-key billing through openai/gpt-5.5, Codex OAuth through openai-codex/gpt-5.5, and native Codex app-server behavior through openai/gpt-5.5 plus agentRuntime.id: "codex".¹⁷¹⁸

That is the cleanup I would not want to get wrong in a fanout. openai-codex/gpt-5.5 is not just an old compatibility alias. It is the recommended PI route for subscription setups. openai/gpt-5.5 is the direct OpenAI Platform route unless you explicitly force the Codex runtime. In my local OpenClaw session, GPT-5.5 is configured behind the gpt55 alias through Codex OAuth and exposed with text and image support. The docs list GPT-5.5 as a 1,000,000-token model, though OpenClaw can still set smaller runtime caps for latency and quality.¹⁷

OpenAI's Codex docs add another practical constraint: for most Codex tasks, start with gpt-5.5 when it appears in your model picker, but GPT-5.5 is currently available in Codex only when signed in with ChatGPT. It is not available with API-key authentication inside Codex, and Chat Completions support is deprecated for future Codex releases.²⁹

The bigger implication is ecosystem leverage. A lot of third-party agent harnesses have been boxed in by Anthropic's first-party gravity: Claude Code, Claude CLI, Max or Team entitlements, API-key routes, policy shifts, and uneven support for non-Anthropic tools. OpenClaw's docs still support Anthropic routes, but GPT-5.5 gives OpenClaw and similar harnesses a serious non-Claude work model with a supported subscription OAuth path. That matters for projects that cannot depend on Anthropic's ecosystem or do not want their agent stack coupled to one first-party harness.¹⁷¹⁸

OpenClaw also does more than pass the model name through. For GPT-5-family runs, it adds a shared behavior overlay across compatible providers, including openai/gpt-5.5, openrouter/openai/gpt-5.5, opencode/gpt-5.5, and similar refs. It supports WebSocket-first transport with SSE fallback, WebSocket warm-up, /fast mapped to priority processing on native OpenAI and Codex endpoints, server-side compaction for direct Responses API models, and a strict-agentic mode that retries plan-only turns when a tool action is available.¹⁷

For Hermes specifically, I do not see OpenClaw documenting a dedicated Hermes harness path. The docs show Hermes-family models through provider catalogs such as Venice, while the third-party gateway story is clearer through OpenCode, Kilo Gateway, and Vercel AI Gateway. OpenCode documents opencode/gpt-5.5, Kilo Gateway documents kilocode/openai/gpt-5.5, and OpenClaw's Vercel provider documents refs such as vercel-ai-gateway/openai/gpt-5.5. Vercel's own April 24 AI Gateway changelog exposes GPT-5.5 and GPT-5.5 Pro to AI SDK users as openai/gpt-5.5 and openai/gpt-5.5-pro.¹⁹²⁰²¹²²³⁰

That portability is the point. If GPT-5.5 is good at long-running coding, computer use, and tool work, third-party harnesses do not have to wait for Anthropic access to build credible agent workflows. They can route through native OpenAI, Codex OAuth where supported, or gateway catalogs that expose GPT-5.5.

Here is the practical cost picture as of April 28. The GPT-5.5 API prices are now on OpenAI's public pricing page, not just launch-day reporting. Short-context GPT-5.5 is $5 per million input tokens and $30 per million output tokens. Long-context GPT-5.5 is $10 per million input tokens and $45 per million output tokens. GPT-5.5 Pro matches GPT-5.4 Pro at short context and long context prices.²³

Model or route	Status	Input per 1M	Cached input per 1M	Output per 1M	100k input plus 20k output
GPT-5.5, short context	Official OpenAI API pricing	$5.00	$0.50	$30.00	$1.10
GPT-5.5, long context	Official OpenAI API pricing	$10.00	$1.00	$45.00	$1.90
GPT-5.5 Pro, short context	Official OpenAI API pricing	$30.00	Not listed	$180.00	$6.60
GPT-5.5 Pro, long context	Official OpenAI API pricing	$60.00	Not listed	$270.00	$11.40
GPT-5.4, short context	Official OpenAI API pricing	$2.50	$0.25	$15.00	$0.55
GPT-5.4, long context	Official OpenAI API pricing	$5.00	$0.50	$22.50	$0.95
GPT-5.4 Pro, short context	Official OpenAI API pricing	$30.00	Not listed	$180.00	$6.60
GPT-5.4 Pro, long context	Official OpenAI API pricing	$60.00	Not listed	$270.00	$11.40
GPT-5.3-Codex	Official OpenAI API pricing page	$1.75	$0.175	$14.00	$0.455

For Codex's token-based rate card, OpenAI lists GPT-5.5 at 125 credits per million input tokens, 12.50 credits per million cached input tokens, and 750 credits per million output tokens. GPT-5.4 is half that rate: 62.50, 6.250, and 375 credits. That lines up with the reported API story: GPT-5.5 is meaningfully more expensive per token, so the bet has to be fewer retries, fewer wasted loops, and more completed work per session.²⁴

The Price Story Is Still Annoying

Here is the less messy but still annoying part: the API is live now, and the official pricing confirms the launch reports.

Every and The Decoder had the short-context numbers right: GPT-5.5 is $5 per million input tokens and $30 per million output tokens, while GPT-5.5 Pro is $30 per million input tokens and $180 per million output tokens.¹⁰¹⁴²³

That doubles GPT-5.4's short-context base price. Long context raises the spread further: GPT-5.5 is $10 in, $45 out, compared with GPT-5.4 at $5 in, $22.50 out. OpenAI's argument is that GPT-5.5 uses fewer tokens to complete comparable Codex tasks, so the completed-task cost can still improve even when the per-token price is higher.¹²³

Maybe. That is plausible for hard tasks where retries are the real cost. It is less comforting for teams that already know their usage profile and watch token bills closely. Official pricing makes the decision easier to model, but it does not make it cheap.

Theo Browne put the skeptical developer reaction pretty cleanly: GPT-5.5 is smart, but "weird, hard to wrangle, and too expensive" at the reported $5 and $30 pricing.¹¹

That is the right tension. A model can be smarter and still lose some workflows if the cost curve or control surface feels wrong.

Safety Is Part of the Product Now

The system card matters because GPT-5.5 improves cyber and bio-relevant tasks, not only safe office work.

OpenAI says GPT-5.5 was evaluated under its Preparedness Framework, including targeted cybersecurity and biology red-teaming, and feedback from nearly 200 early-access partners.² The system card rates biological and chemical capability as High. It rates cybersecurity capability as High but below Critical. AI self-improvement remains below High.²

That is a big deal for defenders. It is also where the deployment details matter.

OpenAI says it is using stricter classifiers for higher-risk cyber activity, monitoring for impermissible use, and Trusted Access for Cyber so verified defenders can use sharper capabilities with fewer pointless refusals.¹²

There is also a caveat worth saying out loud. The system card notes that UK AISI found a universal jailbreak during testing. OpenAI updated its safeguard stack afterward, but UK AISI could not fully verify the final fix because of a configuration issue in the retest version.²

That does not mean the release is reckless. It does mean the safety story is still a live engineering problem, not a solved checkbox.

Enterprise Buyers Are the Audience

NVIDIA's post makes the enterprise angle obvious. The company says more than 10,000 NVIDIANs across engineering, product, legal, marketing, finance, sales, HR, operations, and developer programs are already using GPT-5.5-powered Codex internally.¹²

NVIDIA describes debugging cycles that used to take days closing in hours, and experimentation that used to take weeks turning into overnight progress in complex codebases.¹² That is marketing language, sure. It is also the exact buyer story OpenAI wants: not a chatbot for answers, but an agentic system that sits inside enterprise work.

Fortune added useful scale numbers from OpenAI: 4 million active Codex users, 9 million paying business ChatGPT users, more than 900 million weekly active ChatGPT users, and more than 50 million subscribers.¹³

Those numbers explain the launch cadence. GPT-5.5 arrived only weeks after GPT-5.4. The labs are not waiting for clean annual model eras anymore. They are shipping increments into massive distribution and letting the workflow layer absorb the change.

That is exciting. It is also a little exhausting.

The Early Community Reaction Is Split

Almost a week in, the outside read has settled into something more useful than launch hype.

The positive camp is not just saying "higher benchmark number." They are describing a model that feels better inside a work harness. Developer Tech's coverage repeats the pattern from OpenAI and early testers: implementation, refactors, debugging, testing, validation, fewer tokens in Codex, and longer context for real repository work.³¹ Ethan Mollick's review lands in the same place from a different angle. His strongest examples are not chat answers. They are Codex plus GPT-5.5 turning messy data into a draft academic paper, building a 101 page tabletop game, and using GPT-5.5 to build the gallery for his own model comparison.³²

That matches my own experience better than the generic chatbot coverage does. GPT-5.5 has been strong at orchestration, tool calls, reasoning through a failure, and fixing itself after verification catches something. The real improvement is not that it sounds smarter. It keeps the work loop intact longer.

The skeptical camp is also not wrong. Hacker News is doing what Hacker News does: turning the launch into a referendum on model motivation, agent harnesses, reasoning budgets, and whether modern models actually keep working when they say they will.¹⁶ Some Reddit and developer threads are excited about one-shot fixes and better Codex persistence. Others complain about usage limits, rollout friction, and a familiar feeling that the model is better but not magical.

That split is the story. People using GPT-5.5 for real multi-step work are more impressed than people sampling it like a chatbot. The model looks best when it has files, tools, tests, and a clear target state. It looks less special when the task is vague, taste-heavy, or bottlenecked by quota and cost.

This is where GPT-5.5 has to keep proving itself. The launch claims are strong. The benchmark table is strong. The early Codex reports are encouraging. But the thing people will remember is whether it finishes.

My Take

GPT-5.5 looks like OpenAI's most coherent answer yet to Claude's work-model advantage.

GPT-5.4 made OpenAI competitive again for a lot of agentic coding work. GPT-5.5 sharpens the pitch: faster than the big slow models, stronger inside Codex, better at carrying context across tools, and more practical for real workflows than a pure reasoning monster that burns time and budget.

But I would not flatten this into "OpenAI wins."

The better read is this: GPT-5.5 may become the default workhorse for people who live inside Codex-style systems. Opus may still be better when the work needs product taste, careful planning, frontend judgment, or a more opinionated collaborator. Gemini still has lanes where long-context research and web work remain competitive. The winner depends on the harness, the task, the budget, and how much human steering you want in the loop.

For builders, the practical advice is simple.

Use GPT-5.5 where persistence matters: refactors, testing loops, security review, operational docs, research synthesis, spreadsheet and document work, and agentic tasks with a clear target state.

Be more cautious where taste matters: frontend design, product direction, ambiguous prototypes, and writing that needs a sharp voice instead of smooth structure.

And do not treat launch-week model docs as frozen. GPT-5.5 went from "coming very soon" to live API in one day. Verify the current route, pricing, and auth mode before wiring production spend.

That last part is boring. It is also how you avoid building your launch-week plan on vibes.

Notes

1. OpenAI, "Introducing GPT-5.5" (April 23, 2026).
2. OpenAI, "GPT-5.5 System Card" (April 23, 2026).
3. OpenAI Developers, "GPT-5.5 is our strongest agentic coding model to date", X (April 23, 2026).
4. OpenAI Developers, "GPT-5.5 is more token efficient than GPT-5.4", X (April 23, 2026).
5. Sam Altman, "We believe in iterative deployment", X (April 23, 2026).
6. Greg Brockman, "GPT-5.5 is a new class of intelligence", X (April 23, 2026).
7. Simon Willison, "A Pelican for GPT-5.5 via the Semi-Official Codex Backdoor API", Simon Willison's Weblog (April 23, 2026).
8. Simon Willison, "I've been previewing this in Codex for a few weeks", X (April 23, 2026).
9. Dan Shipper, "GPT-5.5 'Spud' is out and it is a BEAST", X (April 23, 2026).
10. Every, "Vibe Check: GPT-5.5 Has It All" (April 23, 2026).
11. Theo Browne, "$5 per mil in, $30 per mil out", X (April 23, 2026).
12. NVIDIA, "OpenAI's New GPT-5.5 Powers Codex on NVIDIA Infrastructure, and NVIDIA Is Already Putting It to Work" (April 23, 2026).
13. Sharon Goldman, "OpenAI Launches GPT-5.5 Just Weeks after GPT-5.4 as AI Race Accelerates", Fortune (April 23, 2026).
14. Matthias Bastian, "OpenAI Unveils GPT-5.5, Claims a 'New Class of Intelligence' at Double the API Price", The Decoder (April 23, 2026).
15. Carl Franzen, "OpenAI's GPT-5.5 Is Here, and It's No Potato", VentureBeat (April 23, 2026).
16. Hacker News, "GPT-5.5" (April 23, 2026).
17. OpenClaw, "OpenAI", documentation checked April 28, 2026.
18. OpenClaw, "Model Providers", documentation checked April 28, 2026.
19. OpenClaw, "OpenCode", documentation checked April 28, 2026.
20. OpenClaw, "Kilo Gateway", documentation checked April 28, 2026.
21. OpenClaw, "Vercel AI Gateway", documentation checked April 28, 2026.
22. OpenClaw, "Venice", documentation checked April 28, 2026.
23. OpenAI, "Pricing", OpenAI API documentation (checked April 28, 2026).
24. OpenAI Help Center, "Codex Rate Card" (checked April 23, 2026).
25. Local OpenClaw benchmark artifacts for GPT-5.5 Ops Gauntlets 001 and 002, run April 23-24, 2026.
26. Local OpenClaw benchmark artifact, "GPT-5.5 Three-Gauntlet Scorecard," run April 24, 2026.
27. Local Astro preview screenshots from GPT-5.5 frontend redesign experiments, captured April 23-24, 2026.
28. OpenAI, "Changelog", OpenAI API documentation (checked April 28, 2026).
29. OpenAI Developers, "Models - Codex" (checked April 28, 2026).
30. Vercel, "GPT 5.5 on AI Gateway" (April 24, 2026).
31. Ryan Daws, "OpenAI Brings GPT-5.5 to Codex for Coding Tasks", Developer Tech (April 2026).
32. Ethan Mollick, "Sign of the Future: GPT-5.5", One Useful Thing (April 2026).

Originally published at solomonneas.dev/blog/gpt55-openai-workstation-model. Licensed under CC BY-NC-ND 4.0 - attribution required, no commercial use, no derivatives.

DEV Community