The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

#ai #security #agents #cybersecurity

Mike Saleme — 2026-04-20 — views my own

This week OpenAI released GPT-5.4-Cyber, positioned as the defender's counterpart to Anthropic's Claude Mythos. Anthropic is shipping Mythos only to a small number of trusted organizations. OpenAI argued the opposite: broad deployment is fine because current safeguards are sufficient.

The vendor debate is the wrong axis. The thing that should be getting airtime is buried in a single quote from AISLE and Xint at the end of the same news cycle:

"The critical variable in AI vulnerability discovery is not the model alone. It is the structured system that decides where to look, validates that findings are real and exploitable, eliminates false positives, and delivers actionable remediation."

And SANS's Rob T. Lee said the quiet part out loud:

"We need to start benchmarking how one AI model is able to find code vulnerabilities over another and how quickly they are doing it."

There is no such benchmark in public release today. That's the story.

Why the model axis is misleading

The vendor framing encourages one of two conclusions: either Mythos is dangerous and should be gated, or GPT-5.4-Cyber is safe and should be deployed. Both conclusions are derived from the model's capability in isolation, as if a capability scan is the same as a production outcome.

It isn't. A model that can find a vulnerability in a contrived benchmark and a model that can drive an end-to-end defensive workflow in a real codebase are different things. The second requires a structured system around the model: a target-selection policy, a validation loop, a false-positive filter, a remediation generator, and evidence that the remediation actually holds under regression. Without that system, model capability is an unvalidated number — and unvalidated numbers are what both vendors are currently shipping as the primary differentiator.

What a real benchmark would look like

I've been building an open-source evaluation harness for agent security over the past year (444 tests across 30 modules, covering MCP, A2A, L402, x402, and multi-agent protocols). From that experience, a benchmark for AI vulnerability discovery needs, at minimum, the following axes:

Grounding integrity. Does the model cite real CVEs, real test IDs, real patches — or does it invent plausible-looking references? This is the failure class I call citation fabrication, and it is spectacularly common. A forthcoming post-mortem on catching my own automation doing this is in the queue; for now, assume that any AI-generated security artifact that cites a specific CVE number, a specific test ID, or a specific statistic is untrustworthy until a human has verified it against a canonical source.
Exploitability validation. Does the model's reported finding come with a working proof-of-exploit, or only a plausible description? Undifferentiated findings waste more defender time than they save.
False-positive rate under ground truth. Against a corpus of known-safe code with known-unsafe injected, what's the precision? No vendor reports this publicly today.
Regression survival. Does the model's remediation hold under a second pass by the same model, by a different model, and by a traditional static analyzer?
Reproducibility. Can a third party re-run the same model on the same input and get the same result? If not, the benchmark is marketing, not measurement.
Attack surface coverage. Does the benchmark cover supply-chain, protocol-level, multi-agent, and authority-delegation failure classes, or only classic OWASP top 10?

None of those six axes is a model property. All six are benchmark properties. You can't ship "AI vulnerability discovery is safe" or "AI vulnerability discovery is dangerous" without first defining the benchmark those claims are measured against.

Why this matters now

Both vendors' releases this week are marketing launches, not scientific papers. Neither comes with the kind of benchmark a CISO would need to make a real deployment decision, and neither points at a neutral authority who could arbitrate. Meanwhile, AISLE and Xint demonstrated it's possible to replicate Mythos's results with smaller, cheaper models — a finding that should be front-page news and wasn't. That result alone invalidates the "our model is the differentiator" framing from both directions.

The third quadrant — independent evaluation, reproducible across models, measured against common criteria — is currently vacant. OWASP's Agentic Security Initiative, NIST AI Safety Institute, AIUC-1, and a handful of academic groups are the natural hosts. None of them has published a benchmark of the form Rob T. Lee is asking for, yet.

What should happen next

Vendor AI vulnerability-discovery launches should come with reproducible benchmark reports, not capability anecdotes.
Independent benchmarks should cover the six axes above (or better ones), with public methodology and public datasets.
Journalists covering the "Mythos vs GPT-5.4-Cyber" framing should ask both vendors: what third-party benchmark would you be willing to be measured against? If the answer is "none currently exists," the follow-up is: which standards body are you funding or contributing to in order to change that?
Anyone deploying either model into defensive workflows this year should assume the model is a component, not a system, and instrument their own validation harness around it.

The harness I've been building is open-source and takes CVE, A2A, MCP, x402/L402 contributions. It's one attempt. We need three or four independent ones before the word "benchmark" has any real meaning in this space.

Until then, asking "is Mythos safer than GPT-5.4-Cyber" is like asking "is a Honda safer than a Toyota" without any reference to NHTSA crash ratings. The measurement layer is the story. The models are not.

Mike Saleme is an enterprise integration architect at Salesforce and an independent researcher on agent-security verification. The agent-security harness and governance libraries referenced here (msaleme/red-team-blue-team-agent-fabric and CognitiveThoughtEngine/constitutional-agent-governance) are published under his personal account and organization. All opinions are his own.