DEV Community

AgentShield
AgentShield

Posted on • Originally published at agentshield.pro

Anthropic Published a 31.5% Hijack Rate. Most Vendors Won't Even Show You a Number.

VentureBeat ran a piece yesterday comparing prompt injection numbers across the four frontier labs. The headline that got pulled was Anthropic's: 31.5%. That's the raw attack-success rate on Anthropic's own browser agent (Claude in Chrome, Claude Cowork) before any safeguards engaged, measured against an adaptive attacker on 129 web environments held out of training.

The number sounds terrible. It also happens to be the most useful piece of AI security data anyone has published this year.

Here is why.

The Honest Number

Anthropic's Opus 4.8 system card runs 244 pages. Buried in there, they break prompt injection out by surface. Tool use, coding, computer use, browser, each tested separately, each with a raw rate and a safeguarded rate. The browser number is the eye-catcher: 31.5% raw, 0.5% with the full safeguard stack engaged. That's a 63x drop, which sounds great, except for what's tucked underneath.

The coding surface is lower. 7.03% on a single attempt, 2.09% under adaptive attack. Computer use lands somewhere in between. The point is that the four surfaces don't share a number. They have four different numbers. And those four numbers were measured by the same vendor against the same attacker class, on the same model. The variance is intrinsic to the surface, not to the methodology.

Now look at what everyone else published.

What the Other Three Labs Did

OpenAI's GPT-5.5 system card has prompt injection in one section. One surface, connectors. One number, 0.963 robustness score (higher is better). It dropped from 0.998 on GPT-5.4-thinking. The system card does not frame this as a regression. The number just moved.

Google published a Gemini 3 model card and a separate Frontier Safety Framework report. There is no injection number in either. The launch materials describe stronger resistance qualitatively, without a measured rate.

Meta runs open weights, so there's no closed-model card to begin with. The defense story sits in a separate stack (LlamaFirewall, PromptGuard 2) and the number is from AgentDojo, a 97-task benchmark. 17.6% baseline, 1.75% combined defenses. Different benchmark, different surface, different definition of "attack-success."

Lay 31.5%, 0.963, no-number, and 17.6% next to each other and you don't have a scoreboard. You have four labs measuring four different things in four different ways.

A 0.963 connectors score and a 31.5% browser rate were never on one scale.
— VentureBeat, June 1, 2026

The Cross-Vendor Grid, Compressed

Vendor Surfaces evaluated Headline number Adaptive attacker?
Anthropic (Opus 4.8) Four (tool, code, computer, browser) 31.5% raw / 0.5% safeguarded (browser, thinking on) Yes (1, 10, 100 attempts)
OpenAI (GPT-5.5) One (connectors) 0.963 robustness (down from 0.998 on GPT-5.4) No
Google (Gemini 3.x) None published None published
Meta (Llama stack) One (AgentDojo, 97 tasks) 17.6% baseline → 1.75% combined No

Anthropic is the only vendor here testing four surfaces. The only one running an adaptive attacker. The only one printing raw and safeguarded numbers side by side. And, importantly, the only one running a live external red-team bounty in parallel.

The shocking thing is not that Anthropic's number looks worse than OpenAI's. The shocking thing is that Anthropic is the one being transparent, and the comparison still looks lopsided because the others are measuring less.

The Trap Inside the 0.5%

Even Anthropic's safeguarded 0.5% comes with conditions. It is measured on Claude in Chrome and Claude Cowork, with Anthropic's full safeguard stack engaged. That is Anthropic's own integration, running on Anthropic's own browser surface, with Anthropic's own classifier in front.

If you're calling the same model through the API and building your own browser agent on top, you're not getting that 0.5%. You're getting something else. Anthropic didn't publish that number.

This isn't a critique of Anthropic. The opposite, actually. Anthropic is the only vendor who told you that the safeguarded number is tied to a specific integration. The others didn't even tell you that.

But it does mean something concrete for anyone shipping AI agents. The vendor number describes the vendor's product, in the vendor's environment, against attacks the vendor selected. Your stack is not the vendor's stack. Your prompts are not the vendor's prompts. Your attack surface is whatever your users can reach.

What Actually Works

The VB article ends with a recommendation I want to quote directly, because it's the only honest answer in the whole piece:

Run your own injection test before any agent ships.

That's it. That's the answer. Vendor numbers tell you what the vendor chose to measure. Your number is whatever an attacker actually accomplishes against your stack, in your environment, when your users are present.

This is the thing I've been building toward with AgentShield. It's a runtime classifier that sits between your agent and untrusted input. It doesn't care which LLM is behind it. The same model gets the same protection whether you call Claude, GPT-5.5, Gemini, or a local Llama variant. The number you get is the number you can measure yourself, against the attacks you actually face, in the environment your users actually use.

The benchmark I published runs 5,972 samples across six public prompt-injection datasets. F1 0.956 on five of six (the jackhhao set is analyzed separately, because the labeling disagreement on persona-override is real). F1 0.921 on the full six-set aggregate. Latency p50 2.44 ms end-to-end. The per-sample false-positive and false-negative lists are in the repo. You can see exactly where the classifier fails, and you can run the eval yourself with the scripts I shipped.

That last part is the difference. No vendor I'm aware of publishes the per-sample FP/FN list. They publish aggregates. We publish the rows.

The Real Lesson

The 31.5% number is shocking only if you thought vendor security was vendor-defined. It isn't. It never was. The frontier labs are publishing their best numbers, which is fine, but their best numbers are not your numbers.

If you're shipping AI agents in production, run your own test. Use whatever tool gets you there. If AgentShield is the right fit for your stack, great. If something else fits better, use that. What doesn't work is trusting a vendor's chosen-surface chosen-attacker chosen-metric number and shipping anyway.

Anthropic just told you what 31.5% looks like with their own classifier in front. Now go find out what it looks like with yours.


Source: Anthropic's browser agent got hijacked 31.5% of the time before safeguards engaged — Louis Columbus, VentureBeat, June 1, 2026. Underlying primary source: Claude Opus 4.8 System Card, May 28, 2026.

Discussion welcome — especially from engineers actually deploying agents in production. What's your real attack surface?

Top comments (0)