UK AISI finds GPT-5.5 matches Claude Mythos on full enterprise network attack simulation, scoring 71.4% on expert tasks vs 68.6%.
UK AISI found GPT-5.5 matches Claude Mythos Preview in autonomously solving a full enterprise network attack simulation. OpenAI's model scored 71.4% on expert-level capture-the-flag tasks, edging out Anthropic's 68.6%.
Key facts
- GPT-5.5 scored 71.4% on expert CTF tasks vs Mythos 68.6%.
- Only second model to fully solve enterprise network simulation TLO.
- GPT-5.5 succeeded in 2 of 10 TLO attempts; Mythos in 3 of 10.
- GPT-5.4 scored 52.4%; Claude Opus 4.7 scored 48.6%.
- AISI estimates human expert needs ~20 hours for same simulation.
Full Network Attack: GPT-5.5 Matches Mythos
The UK AI Security Institute (AISI) tested OpenAI's GPT-5.5 against a battery of cyberattack evaluations, finding it is the second model after Anthropic's Claude Mythos Preview to fully complete a multi-stage enterprise attack simulation [According to AISI's published results]. On the "The Last Ones" (TLO) simulation—a 32-step network traverse across four subnets and 20 hosts—GPT-5.5 succeeded in 2 out of 10 attempts, while Claude Mythos Preview hit 3 out of 10. AISI estimates a human expert would need about 20 hours for the same task.
Expert Task Scores and Broader Trend
On AISI's 95-task capture-the-flag suite, GPT-5.5 achieved 71.4% at the Expert difficulty, versus 68.6% for Claude Mythos Preview—a gap within the statistical margin of error. For context, GPT-5.4 scored 52.4% and Claude Opus 4.7 scored 48.6%. AISI interprets these results as evidence that cyberattack capabilities are emerging as a by-product of general AI advances in autonomy, reasoning, and coding, rather than being explicitly trained for [Per AISI's analysis].
Unique Take: Capability Convergence, Not Arms Race
The AP wire would frame this as a competitive escalation between OpenAI and Anthropic. The more structural observation: both models now sit at nearly identical cyber capability levels, suggesting a ceiling imposed by current architectures—not a divergence. If GPT-5.5 and Claude Mythos converge within statistical noise on both isolated tasks and full simulations, the next delta likely requires a fundamentally different training paradigm, not more compute on the same recipe. AISI's finding that performance scales with inference compute further implies the bottleneck is inference-time reasoning, not model weights.
What to watch
Watch for AISI's next evaluation cycle, expected Q3 2026, which may include models from Google DeepMind and Mistral. Also monitor whether OpenAI or Anthropic publishes ablation studies isolating which training improvements drove the cyber capability jump—neither has done so.
Originally published on gentic.news



Top comments (0)