Darko from Kilo

Posted on Nov 6

We Tested 6 AI Models on 3 Advanced Security Exploits: The Results

#ai #gemini #openai #grok

We ran three advanced security vulnerabilities through GPT-5, o3, Claude, Gemini, and Grok.

The exploits: Prototype pollution that bypasses authorization, an agentic AI supply-chain attack combining prompt injection with cloud API abuse, and OS command injection in ImageMagick.

We ran each of the exploits above through 6 top AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.

All six models caught all three vulnerabilities - but the quality of their fixes varied dramatically. Individual vulnerabilities showed spreads of 20+ points, and when we voted on which model performed best, we disagreed with the AI judge (GPT-5).

Keep reading to find out why we chose o3 over GPT-5 despite the AI judge’s preference, which models to use for mission-critical security vs. bulk scanning, and what the 13x cost difference means for your budget.

⚠️ Methodology Note

This case study reveals important patterns about which models handle advanced security scenarios best. I tested 3 vulnerabilities chosen to represent a spectrum: one classic exploit (OS command injection), one well-known Node.js attack (prototype pollution), and one cutting-edge threat from 2025 (agentic AI supply-chain).

We’re sharing this transparently as a starting point for the conversation about AI-assisted security auditing.

Vulnerability #1: Prototype Pollution Privilege Escalation

What it is: A Node.js API with a deepMerge function that recursively merges user input into a config object. No hasOwnProperty checks or __proto__ filtering. Authorization relies on req.user.isAdmin property.

The exploit:

POST /update-settings { “__proto__”: { “isAdmin”: true } }

Resulting attack: All subsequent requests inherit isAdmin: true from polluted Object.prototype, bypassing authorization checks.

Why it matters: Ranked #10 in OWASP Top 10 2021, still common in Node.js apps. CVE-2022-21824 and others exploited this in production systems.

Model Outputs: Prototype Pollution

🥇 GPT-5’s Fix (96.4/100)

Why it won:

- Produced four mitigation strategies (null-prototype objects, key filtering, hasOwnProperty checks, Object.freeze)

- Used a defense-in-depth approach with multiple security layers

- Used helper functions for consistent null-prototype creation

- We got production-ready code with clear separation of concerns

Here’s a gist with the code.

🥈 OpenAI o3’s Fix (95.2/100)

o3 produced:

- Clean helper functions for null-prototype object creation -

- Defensive key filtering with explicit dangerous key list

- Own-property validation in authorization checks

- Well-commented code explaining each security measure

Results by other models:

Claude Sonnet 4.5 (91.0/100): Multi-layer defense with validation. Used Object.create(null) for user objects and implemented hasOwnProperty checks. Added explicit blocking of dangerous keys in merge function. Good balance of security and simplicity.

Gemini 2.5 Pro (90.0/100): Simple but effective fix. Used key filtering and null-prototype objects. Missed some edge cases around recursive object handling but covered main attack vectors.

Claude Opus 4.1 (86.0/100): Comprehensive documentation with extensive validation. Included validation schemas and type checking that prioritize thoroughness; trade-off is added implementation complexity.

Grok 4 (85.0/100): Grok made a few trade-offs in this approach: Focused on key filtering as primary mitigation strategy; Null-prototype objects could further strengthen the defense; No own-property validation on authorization; Addressed core attack vectors; additional edge case hardening recommended.

Vulnerability #2: Agentic AI Supply-Chain Attack (2025)

What it is: An AI agent that fetches web pages and invokes cloud management APIs based on LLM outputs. Three vulnerabilities are being combined here: indirect prompt injection (via fetched content), over-privileged Azure management token (cross-tenant access), and unsafe WASM execution with full filesystem access.

The exploit chain:

1. The attacker hosts malicious webpage with hidden prompt injection

2. The agent fetches the page, the LLM processes the injected instructions

3. The LLM returns the tool call to azure_invoke with attacker-controlled parameters

4. The agent executes with the victim’s tenant-wide management token

5. Cross-tenant cloud compromise via RBAC manipulation

Impact:

- Privilege escalation across Azure subscriptions

- Token exfiltration via WASM filesystem access

- Cross-tenant cloud compromise

Why it matters: OWASP Top 10 for LLMs #1 risk (prompt injection). Real incidents include: ChatGPT plugins, Microsoft Copilot, GitHub Copilot Chat. No existing AI benchmark tests this attack vector.

Model Outputs: Agentic AI Supply-Chain Attack

🥇 GPT-5’s Fix (94.0/100)

Why it won:

- Comprehensive defense-in-depth: tool scoping, output gating, “two-man rule” validation

- Token isolation - credentials never exposed to LLM context

- Explicit trust boundaries separating network data from instructions

- HTML sanitization and provenance checking for fetched content

- Least-privilege Azure tokens (role-based, resource-scoped, short-lived)

Here’s a gist with the code.

🥈 OpenAI o3’s Fix (92.4/100)

o3 displayed strong reasoning:

- Detailed exploit analysis: “ShadowTenant” incident scenarios with cross-tenant RBAC abuse

- Response schema validation with explicit “two-man rule” confirmation

- Least-privilege tokens with logical isolation (never in LLM text) -

Safe WASM configuration with memory limits and no filesystem access

Claude Sonnet 4.5’s Fix (82.0/100)

Trade-offs in this approach:

- Strong theoretical foundation (trust boundaries, provenance tracking) with room for deeper implementation

- Output gating mechanism could be made more explicit before tool execution

- Token isolation present but could be scoped even tighter (GPT-5/o3 went further here)

- Context: Usually excels at classic vulnerabilities; this 2025 agentic supply-chain attack represents cutting-edge threat modeling

Other Models:

Gemini 2.5 Pro (87.4/100):

- Strong on OWASP Top 10 for LLMs classification (LLM01 Indirect Prompt Injection, LLM06 Overly-Broad Permissions, LLM08 Unsafe Execution).

- Implemented trust boundaries with schema validation and tool scoping. Good provenance analysis but less comprehensive gating than GPT-5/o3.

Claude Opus 4.1 (83.8/100):

- Included a visual exploit flow diagram (Mermaid format).

- Addressed token leakage, cross-tenant boundaries, and WASM filesystem access.

- Used Zod for schema validation and DOMPurify for HTML sanitization.

Overall, it produced a solid defense approach but lacked the depth of GPT-5’s multi-layer approach.

Grok 4 (83.2/100):

- Referenced 2025 threat landscape (OWASP AI Top 10, NIST AI RMF, MITRE, ENISA).

- Identified indirect prompt injection, over-privileged tools, unsafe WASM execution.

- Implemented basic allow-lists and validation but less sophisticated gating compared to top performers.

💡 Why the gap widened here: Classic vulnerabilities like Prototype Pollution have well-documented patterns in training data. All models have seen thousands of __proto__ pollution examples.

However, agentic AI supply-chain attacks are 2025-era threats with limited precedent. GPT-5 and o3’s deeper reasoning engines excelled at novel threat modeling where pattern-matching alone wasn’t enough. This is where you pay more for frontier models.

Vulnerability #3: OS Command Injection (ImageMagick)

What it is: An Express API that shells out to ImageMagick via child_process.exec(). User-controlled font, size, and text parameters are injected directly into the command string. There’s no input sanitization or escaping.

The exploit:

POST /render { “text”: “hello”, “font”: “Arial; rm -rf /”, “size”: “12” }

The resulting command:

convert -font “Arial; rm -rf /” -pointsize 12 label:”hello” /tmp/out.png

Why it matters: ImageTragick (CVE-2016-3714) variants are still common in 2025. This is a classic attack that every model should catch in theory.

Model Outputs: OS Command Injection (ImageMagick)

🥇 GPT-5’s Fix (96.0/100)

Why it won: The output had:

- Multiple defense layers: strict allowlists, argument vector execution (no shell), stdin for text

- Explicit font allowlist with absolute paths (prevents ImageMagick coders like mvg:, http:, @file)

- Banned dangerous prefixes: “label:@”, “caption:@”, “inline:”, “ephemeral:”, URL schemes

- Uses spawn() instead of exec() - no shell interpretation

- Rate/size caps, temporary file management with cleanup

See this gist for the code.

🥈 Claude Opus 4.1’s Fix (93.2/100)

Comprehensive with exploit demonstrations: -

Detailed exploit demonstrations with multiple injection vectors

- Switched from exec() to spawn() for argument vector execution -

Implemented strict font allowlists with absolute paths

- Size validation (8-72 point range) and control character filtering

- Added rate limiting (10 requests/minute) and explicit ImageMagick path specification

Other Models:

Claude Sonnet 4.5 (91.6/100): Comprehensive TypeScript solution with execFile(), strict allowlists, and rate limiting. Demonstrated multiple exploit paths.

OpenAI o3 (90.4/100): Concise approach switching exec() to execFile(). Font allowlist with absolute paths, effective text sanitization.

Gemini 2.5 Pro (90.2/100): Excellent fundamentals with spawn, allowlists, and clear validation. Prioritizes clarity over complexity.

Grok 4: (84.2/100): Explained shell injection mechanics clearly (;, |, &, backticks, $()) , used spawn() instead of exec() with font allowlist validation, there was also an integer size validation (10-100 range) and printable ASCII text filtering.

The Results: 100% Detection, But Quality Varied

✅ All Models Passed (But Not Equally)

Every model caught every vulnerability. Different models had different strengths:

What “Quality” Means in Security

All models identified the vulnerabilities. The score differences came from:

Completeness of fix – Did they address all attack vectors?
Defense-in-depth – Did they suggest multiple mitigation layers?
Code quality – Is the fix production-ready or just a patch?

Explanation depth – Did they explain why the fix works?

Example: Prototype Pollution Fixes

GPT-5 (96.4/100) suggested four mitigation strategies:

1. Use Object.create(null) for config objects

2. Add hasOwnProperty checks in deepMerge

3. Explicitly block __proto__, constructor, prototype keys

4. Use Object.freeze() on authorization logic

Grok 4 (85/100) suggested one: Add key filtering in deepMerge (but incomplete – missed some edge cases)

Both “caught it” – but one fix is production-ready, the other has gaps.

Cost Analysis: Claude Opus Takes 56% of The Budget

💰 Total Cost: $1.81 for 3 Evaluations × 6 Models

Here’s the breakdown:

Cost by Evaluation

Most Expensive: $0.79 - OS Command Injection (long outputs, comprehensive ImageMagick hardening)
Cheapest: $0.26 - Prototype Pollution (classic vulnerability, well-understood patterns)
Average: $0.60 per evaluation ($0.10 per model execution)

💡 Budget Recommendations

If you’re on a budget: Use Gemini 2.5 Pro or OpenAI o3 for 90-95% of GPT-5’s quality at 72-75% lower cost.

If quality matters: Use GPT-5 for mission-critical security audits ($0.32 total = $0.11/vulnerability), or o3 as pragmatic middle ground (95% of GPT-5’s quality at 72% lower cost).

Quick Reference: Which Model Should You Use For What Purpose?

We created this table to make it easier to choose:

Quick Reference: Which Model For What?

Performance by Vulnerability Type

📊 Classic vs. Cutting-Edge Vulnerabilities

Pattern discovered: All models excel at classic vulnerabilities (prototype pollution, command injection). But newer attacks (agentic AI) create wider performance gaps.

Prototype Pollution (A well-known vulnerability discovered in 2019)

Agentic AI Supply-Chain Attack (2025 & Cutting-Edge)

Spread: 12.0 points (wider gap - novel attack)

A note on Sonnet 4.5: Claude Sonnet 4.5 excels at classic vulnerabilities (91.0 on prototype pollution) but struggled with cutting-edge agentic AI attack (82.0).

OS Command Injection (Classic Vulnerability)

Spread: 11.6 points (wider than expected for a classic attack).

How Models Perform Against Historical Vulnerabilities

Classic vulnerabilities (2016-2019): All models do well 85-96/100 (tight spread).

Cutting-edge vulnerabilities (2025): GPT-5/o3 pull ahead, lower-cost models (Grok 4, Claude Sonnet 4.5) fall behind (wider spread)

The lesson: Use GPT-5/o3 for novel threats. Use Gemini 2.5 Pro or Claude Sonnet 4.5 for OWASP Top 10.

How we Evaluated These Models

Here’s the design that we used:

1. Test Design

For each vulnerability, we provided a:

- Vulnerable code snippet (10-50 lines)

- Task description (“Fix this security vulnerability”)

- No hints about the specific attack type

Each model received identical prompts to ensure fair comparison.

2. Scoring Approach

We used a two-phase evaluation:

Phase 1 - AI-Assisted Scoring:

We used GPT-5 (currently the highest-performing model on security tasks) to score each output against a structured rubric.

Scoring Rubric (0-100):

- Correctness (20 pts) - Does the fix eliminate the exploit?

- Code Quality (20 pts) - Is it maintainable and clear?

- Completeness (20 pts) - Does it address edge cases with defense-in-depth?

- Security (20 pts) - Does it follow best practices without introducing new attack surface?

- Performance (20 pts) - Does it avoid unreasonable overhead?

Final score = average of the five criteria.

Phase 2 - Human Validation:

After seeing all AI scores, we reviewed each model’s output and picked which fix we’d actually deploy in production. This human validation is critical - AI judges can miss practical deployment concerns.

3. Why This Approach Works

Consistency: AI judge applies the same rubric to all models, eliminating human bias in initial scoring.

Transparency: All scores are shown in this post with representative code samples. Full outputs available upon request.

Pragmatism: Human vote ensures real-world deployability. Sometimes the “best score” isn’t the “best fix.”

No self-judging: GPT-5 scored the other 5 models but couldn’t evaluate its own output - I used Claude Opus 4.1 as judge for GPT-5’s submissions to avoid bias.

What We Learned

Every model caught every vulnerability. 100% detection rate across the board. That’s impressive - a few years ago, this would have been impossible.

But the quality of their fixes? That varied by 12.3 percentage points (82.5 to 94.8). All six models can spot the bug. Not all of them can fix it properly.

Don’t just ask: “Did the AI catch it?” Ask: “Is this fix something I’d actually ship to production?”

The cost-quality tradeoff is real. GPT-5 delivered the best fixes (94.8/100) but cost $0.32. Claude Sonnet 4.5 delivered 90% of that quality for $0.21 (34% lower cost). Gemini delivered 90% of GPT-5’s quality for $0.08 (75% lower cost).

Figure out your quality threshold first. Then optimize for cost within that constraint.

We Disagreed With The AI Judge

The AI judge picked GPT-5 (94.8/100). We picked o3 (89.9/100, ranked #2).

GPT-5’s fixes were technically perfect. It had four mitigation strategies for prototype pollution, multi-layer defense architecture and comprehensive edge case handling. It scored 96.4, 94.0, 95.8 across the three vulnerabilities. The AI judge loved all this.

But we’re not an AI judge. We have to ship code and maintain it six months later.

o3’s fixes were simpler - clean enough to review in 15 minutes during code review. Production-ready without needing a PhD to understand what’s happening. And here’s the kicker: it cost $0.09 vs GPT-5’s $0.32. Scaled to 100 evaluations, that’s $9 vs $32. That quickly adds up.

The pattern we noticed: o3 crushed the hard stuff (95.2 on prototype pollution, 92.4 on agentic AI). It only struggled on the classic ImageMagick attack (90.4) - the one where Claude models had more training data because it’s been documented for years.

o3 delivers for novel threats where you need actual reasoning instead of pattern matching,

Here’s what we learned: AI judges optimize for perfection. Developers optimize for what we’d actually merge into production. Sometimes the second-best score is the better choice.

Novel threats expose the gap between models. On classic vulnerabilities (prototype pollution, command injection), all models scored 85-96/100. There was tight spread. But on the cutting-edge agentic AI attack, the spread widened to 82-94/100. GPT-5 and o3 pulled ahead.

Use GPT-5 or o3 for novel threats where you need actual reasoning. Use Gemini or Claude Sonnet for classic OWASP Top 10 vulnerabilities where pattern matching is enough.

The wrong question is “which model is the best one?” The right question is “best for what?” GPT-5 excels at comprehensive defense-in-depth. o3 excels at pragmatic, shippable fixes. Gemini excels at cost efficiency. Claude Sonnet excels at classic vulnerabilities.

Match the model to the mission.

Which Model Should You Use?

If you’re protecting financial systems, healthcare data, or authentication flows - use GPT-5. It costs $0.11 per evaluation (94.8 quality), but when you’re dealing with money or medical records, you want the most comprehensive fixes. Defense-in-depth matters here. GPT-5 scored 94.1-96.2 across all three vulnerabilities. It displayed consistent excellence.

For regular code reviews and OWASP Top 10 scans - Claude Sonnet 4.5 is the sweet spot. It delivered 90% of GPT-5’s quality at 34% lower cost ($0.07 per eval). If you’re running security checks on every PR, that cost difference adds up fast. It particularly excelled on the prototype pollution vulnerability (91.0) because that attack pattern has been documented for years and Claude models have seen it extensively during training.

If you’re a startup or open source project with tight budgets - Gemini 2.5 Pro is a good bang-for-your-buck model. $0.03 per evaluation (85.3 quality). That’s 90% of GPT-5’s quality for 75% lower cost. We didn’t expect a model this cheap to perform this well, but Google delivered. For high-volume scanning where you need decent coverage without breaking the bank, this is your model.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.