razashariff

Posted on Mar 13

We Scanned 27 AI Agent Frameworks Against OWASP Agentic AI Top 10 — Here Are the Results

#opensource #ai #security #github

AI agents are everywhere. CrewAI has 45K+ GitHub stars. AutoGPT has 182K+. LangChain sits at 100K+. But here is the question nobody seems to be asking: how secure are these frameworks?

OWASP released the Agentic AI Top 10 in 2025, identifying the most critical security risks in autonomous AI systems. We built a free scanner that checks agent code against all of them.

The results were not great.

The Numbers

We scanned 27 of the most popular agent frameworks and SDKs:

9 FAIL (critical findings -- exec(), os.system(), no sandboxing)
9 WARN (high-severity issues -- supply chain risks, prompt injection vectors)
9 PASS (clean scan)
31 total OWASP violations across all frameworks

The full registry with every framework, verdict, risk score, and OWASP mapping is live at registry.agentsign.dev.

What We Check

12 detection rules, each mapped to a specific OWASP Agentic AI risk:

Rule	OWASP	Severity	What it catches
AS-001	AA-03	CRITICAL	Unsafe code execution (`exec`, `eval`, `os.system`)
AS-002	AA-05	HIGH	Hardcoded secrets and API keys
AS-003	AA-04	MEDIUM	Excessive permissions
AS-004	AA-02	HIGH	Prompt injection via file input
AS-005	AA-02	CRITICAL	Known injection patterns (SQL, XSS, command)
AS-006	AA-09	HIGH	Code execution without sandboxing
AS-007	AA-06	LOW	Supply chain without integrity checks
AS-008	AA-01	HIGH	Excessive agency / auto-approval
AS-009	AA-07	MEDIUM	Unsafe output handling (XSS via agent output)
AS-010	AA-08	MEDIUM	Insufficient logging/monitoring
AS-011	AA-10	HIGH	Data exfiltration patterns
AS-012	MCP-07	HIGH	MCP server without authentication

Notable Findings

Some of the most-starred projects have the most critical findings:

Open Interpreter (57K stars): Risk score 80/100. exec(), os.system(), child_process, no sandbox, excessive agency. This is a code agent that runs commands on your machine by design, but the scan flags that there are no isolation mechanisms.
AutoGPT (182K stars): Risk score 65/100. exec(), os.system(), no sandbox. The most-starred AI agent framework fails on unsafe code execution.
LangChain (100K stars): WARN verdict. Supply chain risks and prompt injection vectors. Not critical, but worth monitoring.
Anthropic SDK, Vercel AI SDK, Google ADK: All PASS with clean scans. These frameworks were designed with security constraints from the start.

How to Scan Your Own Agent

No signup. No API key. Three ways to use it:

1. GitHub Action (recommended)

Create .github/workflows/agentsign.yml:

name: AgentSign Security Scan
on: [push, pull_request]

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: razashariff/agentsign-action@v1
        with:
          path: '.'
          fail-on: 'FAIL'

Every push and PR gets scanned. FAIL blocks the merge. Outputs include verdict, risk score, and findings count.

2. cURL

curl -X POST https://registry.agentsign.dev/api/scan \
  -H "Content-Type: application/json" \
  -d '{"code": "exec(user_input)", "name": "my-agent"}'

Returns:

{
  "verdict": "FAIL",
  "risk_score": 40,
  "findings": [
    {
      "rule": "AS-001",
      "owasp": "AA-03",
      "severity": "CRITICAL",
      "detail": "Dangerous code patterns: exec()"
    }
  ]
}

3. Shields.io Badge

Add a live security badge to your README:

![AgentSign](https://img.shields.io/endpoint?url=https://registry.agentsign.dev/api/badge/YOUR-AGENT-NAME)

PASS = green. WARN = yellow. FAIL = red. Cached 5 minutes.

API Endpoints

All public, all free, rate-limited at 30 req/min:

Method	Endpoint	Description
POST	`/api/scan`	Scan code against 12 OWASP rules (max 50KB)
GET	`/api/badge/:name`	Shields.io-compatible badge endpoint
GET	`/api/rules/version`	Current rules version and count
GET	`/api/registry`	Full registry as JSON

Why This Matters

The OWASP Agentic AI Top 10 exists because these are real attack vectors. Agents that call exec() without sandboxing can be hijacked through prompt injection. Agents with hardcoded secrets leak them. Agents without logging leave no audit trail.

As agents get more autonomous -- booking flights, writing code, managing infrastructure -- the blast radius of a compromised agent grows. Static analysis is not a silver bullet, but it is the minimum. If your agent framework fails basic pattern matching against known risks, that is worth knowing.

DEV Community