DEV Community

teum
teum

Posted on

numasec Wants to Be the Claude Code of Penetration Testing

An open-source MCP-native AI agent that chains exploits, not just lists them. · FrancescoStabile/numasec

The Gap Nobody Talks About
Every developer in 2025 has an AI pair programmer. Claude Code writes your functions, Copilot catches your typos, Cursor helps you navigate a codebase you inherited at 9am on a Monday. The tooling for writing software has been completely reinvented.

Security hasn't.

Sure, there are LLM wrappers that will tell you to "check for SQL injection" or generate a generic OWASP checklist. But that's not penetration testing — that's a textbook with a chat interface. Real pentesting is about chaining — finding the leaked API key in a JavaScript bundle, using it to trigger an SSRF, pivoting to cloud metadata, and landing account takeover. It's adversarial reasoning, not search.

numasec is the first open-source project I've seen that's actually built for that adversarial loop, not bolted onto it.


What numasec Actually Does
The pitch is blunt: "Like Claude Code, but for pentesting." That framing is either incredibly confident or a recipe for disappointment. After digging through the repository, I'd say it earns more of it than you'd expect from a 33-star project.

Here's the concrete setup: you clone the repo, install the Python tooling via pip install numasec, build the TypeScript agent layer with Bun, and launch an interactive TUI. You pick your LLM — DeepSeek, Claude, GPT, Ollama, any OpenAI-compatible endpoint — type pentest https://yourapp.com, and the agent takes over.

Under the hood, numasec ships with 33 security tools and 34 attack templates, coordinated by a deterministic planner based on the CHECKMATE paper from late 2024. This is the architectural detail that separates numasec from "I asked GPT-4 to hack this site." The CHECKMATE methodology pins the methodology down deterministically — the AI handles analysis and adaptation, not the attack sequence. That's a meaningful distinction. It means the agent isn't hallucinating a pentest methodology on the fly; it's executing a structured plan with LLM-powered reasoning filling the gaps.

The tool coverage is legitimately broad. On the injection side: SQL (blind, time-based, union, error-based), NoSQL, OS command injection, SSTI, XXE, GraphQL introspection, and CRLF. On authentication: JWT attacks including alg:none, weak HS256, and kid path traversal; OAuth misconfiguration; credential spraying; IDOR; CSRF; privilege escalation. Client and server-side: XSS in all three flavors, SSRF with cloud metadata detection, CORS misconfigs, path traversal, HTTP request smuggling, race conditions, file upload bypass.

Every finding gets a CWE ID, CVSS 3.1 score, OWASP Top 10 category, and a MITRE ATT&CK technique. That's not fluff — that's the difference between a finding that gets filed and a finding that gets fixed.


The MCP Architecture Is the Real Story
Here's what I think most people will miss in a first pass: numasec isn't just an AI that runs security tools. It's MCP-native.

Model Context Protocol is the same extensibility layer that Claude Code and Cursor use. numasec ships its 33 built-in tools over MCP and lets you connect any external MCP server. This means if you've built custom tooling for your internal attack surface — say, a proprietary scanner for your API gateway — you can wire it in without forking the project. Same protocol, same interface.

This is genuinely forward-thinking architecture. Most security automation tools are monolithic and extension-hostile. numasec is betting that MCP becomes the standard for agentic tool composition, and that bet looks increasingly reasonable in 2025.

The stack is a hybrid: Python for the security tooling layer, TypeScript/Bun for the agent runtime. You can install via pip install numasec, pull a Docker image (docker run -it francescosta/numasec), or build from source. The CI is live on GitHub Actions and the release tagging looks active — the latest push was April 2026, so this isn't an abandoned research prototype.


The Benchmarks: Impressive, With Caveats
The numbers are the headline: 96% recall on OWASP Juice Shop v17 (25 out of 26 ground-truth vulnerabilities), 100% on DVWA across all 7 vulnerability categories, and full coverage on WebGoat. The benchmarks are reproducible — they live in tests/benchmarks/ and you can run them yourself.

I'll be direct: these are controlled environments designed to contain known, documented vulnerabilities. Juice Shop, DVWA, and WebGoat are intentionally vulnerable applications built for exactly this kind of testing. Performance against production applications with custom authentication flows, WAFs, rate limiting, and non-standard architectures will be lower — sometimes significantly. A 96% recall against Juice Shop does not translate to 96% recall against your fintech app's staging environment.

That said, outperforming "most manual security assessments" on standardized benchmarks is a real claim. Most bug bounty hunters and junior pentesters miss more than 4% of Juice Shop vulnerabilities. The bar numasec is clearing isn't fake.


Who Should NOT Use This
Let me be honest about the failure modes, because they matter.

If you need compliance-grade reporting, numasec is not there yet. The output is structured and CWE-tagged, but a 33-star MIT project isn't your SOC 2 audit tool.

If your target has aggressive WAF rules or bot detection, the agent's automated traffic patterns will get rate-limited or blocked before it chains anything interesting. Evasion isn't a listed capability.

If you're a non-technical security buyer looking for a SaaS dashboard, this is a CLI tool with a TUI. The setup requires Bun, Python, and some comfort with environment configuration. It's built for practitioners.

And obviously: only use this against applications you have explicit authorization to test. The README says "ethical hacking." That word "ethical" is load-bearing.


Verdict: Watch This One
Numasec is 33 stars away from obscurity and several architectural decisions ahead of where most security automation projects are. The MCP-native design, the CHECKMATE-grounded planner, and the exploit-chaining focus make it a fundamentally different artifact than "GPT with Burp Suite."

If you're a security engineer wanting to automate reconnaissance against your own staging environments, try it today. If you're a bug bounty hunter looking to scale coverage on web targets, the benchmark numbers suggest real signal. If you're an AI tooling builder curious about how agentic systems handle adversarial reasoning tasks, the architecture is worth studying even if you never run a pentest.

The project is early. The star count tells you that. But the architecture tells you someone thought carefully before writing the first line of code. That's rarer than it should be.

The CHECKMATE methodology pins the attack sequence down deterministically — the AI handles analysis, not the methodology. That's what separates numasec from 'I asked GPT-4 to hack this site.'

penetration-testing, ai-agents, mcp, devsecops, open-source, security-automation, bug-bounty, llm-tooling

Top comments (0)