Red Team Frameworks and Plugins
XBOW Benchmark vs OWASP WSTG
A Framework Comparison for AI-Augmented Penetration Testing
Table of Contents
- Purpose & Scope
- Core Dimensions Compared
- Vulnerability Coverage Overlap
- XBOW Category Exploit Rates
- OWASP WSTG Category Overview
- How They Complement Each Other
- The Fundamental Tension
1. Purpose & Scope
XBOW Benchmark is an evaluation framework — it measures how well an AI hacking agent can autonomously find and exploit vulnerabilities. It answers:
"How capable is this tool?"
It is empirical, binary, and time-bound.
OWASP WSTG is a testing methodology — it defines what a thorough web application pentest should cover. It answers:
"What should be tested, and how?"
It is prescriptive, comprehensive, and human-authored.
They operate at different layers: XBOW grades the agent, WSTG governs the engagement.
2. Core Dimensions Compared
| Dimension | XBOW Benchmark | OWASP WSTG |
|---|---|---|
| Primary audience | AI/tool developers, red teams evaluating agents | Pentesters, security engineers, auditors |
| Output | Exploit rate, time-to-exploit score | Test checklist, finding evidence, remediation guidance |
| Pass/fail model | Binary — working PoC or nothing | Tiered — finding severity + evidence quality |
| Scope | Web vulns: SQLi, XSS, IDOR, SSRF, SSTI, auth bypass | 90+ test cases across 12 categories incl. cryptography, business logic, client-side |
| Business logic coverage | Minimal — pattern-matching exploits only | Extensive — OTG-BUSLOGIC is a full category |
| Maintenance | Continuous, benchmark evolves with agent capability | Versioned releases (currently v4.2); slower update cycle |
| Reproducibility | High — isolated Docker environments, deterministic flags | Moderate — results depend heavily on tester skill |
| Evidence standard | Flag capture proves exploitation | Requires PoC + HTTP evidence + impact narrative |
| Optimised for | Speed and autonomy | Completeness and rigour |
3. Vulnerability Coverage Overlap
Both frameworks cover the high-signal web categories — but XBOW tests exploitation depth while WSTG tests coverage breadth.
XBOW stress-tests the categories AI handles best: injection flaws, broken authorisation, SSRF, and information disclosure. WSTG adds the human-dependent domains XBOW leaves out: business logic abuse, complex multi-step auth flows, WebSocket attacks, and cryptographic weaknesses that require contextual reasoning.
Key gap: An agent that scores 80% on XBOW still cannot reliably handle roughly 25–30% of WSTG's total test surface.
XBOW → WSTG Test ID Mapping
SQLi → OTG-INPVAL-005, OTG-INPVAL-006
XSS (Reflected) → OTG-INPVAL-001, OTG-CLIENT-002
IDOR → OTG-AUTHZ-001, OTG-AUTHZ-002
SSRF → OTG-INPVAL-019
Auth Bypass → OTG-AUTHN-001 through OTG-AUTHN-010
Info Disclosure → OTG-INFO, OTG-ERR
SSTI → OTG-INPVAL-018
4. XBOW Category Exploit Rates
Current top-agent performance as of early 2026:
| Vulnerability Category | Approx. Success Rate | Notes |
|---|---|---|
| SQL Injection | ~85% | Error-based + blind both covered |
| IDOR / AuthZ | ~80% | Sequential ID assumptions a weak point |
| Cross-Site Scripting | ~70% | Context-aware payload selection required |
| SSRF | ~60% | Redirect handling can block agents |
| Auth Bypass | ~45% | JWT alg:none, default creds viable; SSO chains not |
| Business Logic | ~15% | Pattern-matching fails; requires intent reasoning |
5. OWASP WSTG Category Overview
| Category | Code | AI Viability | Notes |
|---|---|---|---|
| Information Gathering | OTG-INFO | ✅ High | Mechanical, high-volume — ideal for agents |
| Configuration & Deployment | OTG-CONF | ✅ High | HTTP methods, cloud storage checks |
| Identity Management | OTG-IDENT | 🔶 Medium | Account enumeration viable; lockout logic varies |
| Authentication | OTG-AUTHN | 🔶 Medium | Default creds yes; MFA/SSO chains require humans |
| Authorisation | OTG-AUTHZ | ✅ High | IDOR and privilege escalation well-covered |
| Session Management | OTG-SESS | ✅ High | Cookie attributes, CSRF, fixation |
| Input Validation | OTG-INPVAL | ✅ High | SQLi, XSS, SSRF, XXE, SSTI |
| Error Handling | OTG-ERR | ✅ High | Stack traces, verbose errors |
| Cryptography | OTG-CRYPT | 🔶 Medium | Weak ciphers detectable; nuanced analysis requires humans |
| Business Logic | OTG-BUSLOGIC | ❌ Low | Human domain — requires understanding of intent |
| Client-Side Testing | OTG-CLIENT | 🔶 Medium | DOM XSS viable; clickjacking context-dependent |
| API Testing | OTG-API | ✅ High | GraphQL introspection, mass assignment well-covered |
6. How They Complement Each Other
The two frameworks are most powerful when used together rather than as alternatives.
XBOW as a Pre-Engagement Calibration Tool
Before deploying an AI agent on a live engagement, run it against the benchmark to understand where it is reliable and where it is not. Do not trust it on auth bypass if its XBOW score in that category is below 50%.
WSTG as the Engagement Contract
Structure pentest scope, evidence packs, and the final report against WSTG test IDs. This gives clients an auditable methodology regardless of whether findings were surfaced by an agent or a human.
Recommended Handoff Points
| AI Agent handles autonomously | Human pentester handles |
|---|---|
| OTG-INPVAL (injection flaws) | OTG-BUSLOGIC (all tests) |
| OTG-AUTHZ (IDOR, privilege escalation) | OTG-AUTHN (SSO, MFA, OAuth chains) |
| OTG-INFO (recon and fingerprinting) | Any finding requiring understanding of intended app behaviour |
| OTG-ERR (error disclosure) | CVSS scoring and impact narrative |
| OTG-API (GraphQL, mass assignment) | Client communication and triage of false positives |
7. The Fundamental Tension
XBOW optimises for speed and autonomy — it rewards an agent that finds a working SQLi in 90 seconds.
WSTG optimises for completeness and rigour — it rewards a tester who documents every negative result alongside every finding.
These goals can conflict:
- An AI agent chasing high XBOW scores may skip the slow, methodical negative-confirmation work that WSTG requires.
- A pentest report that covers only the things the agent could exploit autonomously is not a WSTG-compliant report — it is an automated scanner output.
The Core Distinction
| XBOW Benchmark | Tells you what AI can do offensively |
| OWASP WSTG | Tells you what a pentest must cover professionally |
Both are necessary. Neither is sufficient alone.
Top comments (0)