Sameer Khan

Posted on May 1 • Originally published at monkfrom.earth

Nobody Trained GPT-5.5 to Hack. It Beat Human Cyber Experts Anyway.

#security #ai #agents #cybersecurity

Nobody trained GPT-5.5 to hack. They trained it to think, and the hacking fell out. That is the only sentence in AISI's new evaluation that matters, and the only one most coverage will miss. OpenAI's GPT-5.5 just became the second AI to complete AISI's 32-step cyber range end-to-end.¹ Mythos Preview was the first, three weeks ago. Different lab, different architecture, similar score. The Mythos result wasn't an outlier. It was the first point on a curve.

TL;DR. GPT-5.5 hit 71% on AISI's expert cyber tasks, edging out Mythos Preview's 68.6%, and completed The Last Ones (AISI's 32-step corporate network attack) in 2 of 10 attempts. AISI evaluated the base model, not a cyber-permissive variant. Their framing: cyber-offensive skill is emerging as a byproduct of reasoning, not a trained capability. Nobody trained these models to hack. They trained them to think. The hacking fell out.

What Did GPT-5.5 Score on AISI's Cyber Evaluation?

71.4% on expert-level advanced tasks. Up from GPT-5.4's 52.4%. Up from Claude Opus 4.7's 48.6%. Slightly above Mythos Preview's 68.6%.

The numbers in one place:

Model	Expert-tier pass rate	TLO completion
GPT-5.5	71.4% (±8.0)	2 of 10 attempts
Mythos Preview	68.6% (±8.7)	3 of 10 attempts
GPT-5.4	52.4% (±9.8)	not reported
Claude Opus 4.7	48.6% (±10.0)	not reported

These tasks aren't gentle. They cover memory corruption exploitation, breaking cryptographic implementations, and reverse engineering stripped binaries. Things that take experienced security researchers hours, sometimes days.

Who this displaces: the bottom of the offensive-research market. Skilled red-teamers don't disappear, but the floor drops. Anything a junior could solve in a day, a model now solves in minutes, with the same answer at the end.

How Does GPT-5.5 Compare to Claude Mythos on the Same Cyber Range?

Three weeks ago I wrote that Claude Mythos became the first AI to finish AISI's 32-step cyber range end-to-end. The framing then was natural: a single model, a single milestone, a one-off result that might not generalize.

GPT-5.5 just generalized it.

Same evaluation. Different lab. Different base architecture. Comparable score. Mythos finished TLO in 3 of 10 attempts. GPT-5.5 finished it in 2 of 10. The variance is small. The trend is not.

This is the part I missed in my first read. The Mythos post implicitly treated the result as something Anthropic shipped. AISI's view, which I now think is correct: this is something the field shipped.

What Does GPT-5.5 Reverse-Engineering a VM in 11 Minutes Tell Us?

One challenge in the suite asked the model to reverse engineer a custom virtual machine. A human expert with professional tooling spent about 12 hours on it. GPT-5.5 finished in 10 minutes 22 seconds.¹

Roughly 70x faster than the human, on a task that does not yield to brute force. Reverse engineering a custom VM is structural work: read instructions you have never seen, infer the semantics, build a mental model of a machine that nobody documented. It is the kind of task that has historically separated senior researchers from juniors.

The outcome is faster attackers, not cheaper ones. They iterate more, try more targets, abandon dead ends sooner. The shape of an offensive workflow shifts from "pick one binary, commit a day" to "fan out across a portfolio in an afternoon."

Was GPT-5.5 Trained Specifically for Cyber Tasks?

Not as far as the public record goes.

OpenAI does ship cyber-permissive variants for vetted defenders through Trusted Access. The first was GPT-5.4-Cyber. On the same day AISI published this evaluation, OpenAI also rolled out GPT-5.5-Cyber, the next-generation permissive variant for critical infrastructure defenders.² Both are fine-tuned products gated behind identity verification.

AISI did not test either variant. They tested base GPT-5.5, with no cyber-specific fine-tune.¹ That distinction is the whole story.

The fine-tune is the policy on top, not the capability underneath. The offensive capability lives in the base reasoning. Cyber-specific training adds permissions, not power.

This is the strongest evidence yet that frontier offensive cyber is a side effect of general reasoning gains, not a separately trained skill. AISI states it directly: "if cyber-offensive skill is emerging as a byproduct... we should expect further increases in cyber capability from models in the near future, potentially in quick succession."¹

The honest counter: maybe both labs are quietly training cyber data into the base mix without naming it as such. Possible. But "quiet fine-tune" still produces a curve, not a one-off. Whatever's in the base, it generalizes across two labs and two architectures within three weeks.

Did GPT-5.5's Cyber Performance Plateau on the Range?

No. That's the second-most-load-bearing line in the report, and it came from inside OpenAI.

Noam Brown noted on X: "After 100 million tokens, performance was still going up. What we're seeing here is not the capability ceiling."³ AISI's own report uses similar language: performance scales with inference compute, no plateau observed at the top of the range.¹

The capability isn't capped by the model. It's capped by how much compute you spend. That's a different shape of problem than "the model can do X but no more."

Where Did GPT-5.5 Fail in AISI's Cyber Evaluation?

The Cooling Tower scenario, an industrial control system simulation with 7 steps. GPT-5.5 finished zero successful runs.¹ Industrial protocols are unfamiliar territory: different stack, different conventions, fewer training examples on the open internet.

This is the steelman for the "not yet" reading. The byproduct effect doesn't generalize uniformly across every domain. Web and binary tasks are well represented in training data. Industrial protocols are not.

The honest read is dual: corporate IT looks more exposed than it did three weeks ago. OT is still its own world.

How Does GPT-5.5 Cyber Capability Change the Defender's Window?

The window that matters is the lag between when offense gets cheap and when defense catches up. That's David Sacks's framing on X: AI cyber doesn't create new vulnerabilities, it discovers existing ones, and the equilibrium eventually settles between AI offense and AI defense.⁴

OpenAI is already shipping defender tooling ahead of more capable models, with Codex Security and the Trusted Access program. Anthropic runs Project Glasswing on the same model that scored these benchmarks. Both labs see the same curve. Both are racing the defenders onto the same plane the attackers will eventually be on.

The thing they cannot influence is timing for everyone else. Sacks's line: all the frontier models, including those out of China, will be at this capability level within roughly six months.⁴ That's the planning horizon.

What Should Security Teams Do About GPT-5.5 and the Models Coming Next?

The same baseline that AISI keeps recommending: patch, MFA, logging, segmentation. Necessary, no longer sufficient.

The new line item is treating AI-assisted offense as the default operating environment, not an emerging risk. That changes a few things in practice:

Assume reverse-engineering is fast. A binary you shipped this morning is now ~10 minutes of compute away from being read like source by anyone with API access.
Start using AI-assisted defense yourself. Codex Security has been credited with over 3,000 critical and high vulnerability fixes since launch. The same models on offense are the ones on defense. Symmetry is the only realistic strategy.
Plan for the curve, not the model. The next model will be more capable than GPT-5.5 or Mythos at this evaluation. Assume that and build for it.

Key Takeaways

GPT-5.5 hit 71.4% on AISI's expert cyber tasks, the highest score on record, slightly above Mythos Preview at 68.6%
Second AI to finish AISI's 32-step cyber range end-to-end (TLO) in 2 of 10 attempts; Mythos finished it in 3 of 10
One challenge took a human expert 12 hours; GPT-5.5 finished it in 10 minutes 22 seconds. Roughly 70x faster, same correctness
The model wasn't fine-tuned for cyber. AISI evaluated base GPT-5.5, not the cyber-permissive variant. Capability emerged from general reasoning improvements
No plateau observed at the top of the range; performance kept scaling past 100M inference tokens
GPT-5.5 failed industrial control (Cooling Tower) with zero completions, showing the byproduct effect doesn't generalize evenly across domains
Two labs, one month, same benchmark. Mythos wasn't an outlier. It was the first point on a curve

I write about how AI safety and capability actually get built on LinkedIn, X, and Instagram. If this resonated, the shorter versions are there.

DEV Community