DEV Community

onfafanutifafa
onfafanutifafa

Posted on

I benchmarked Python AI-app security scanners. Here's what each catches.

This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches.

The four tools

Bandit (PyCQA) — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only.

Semgrep — multi-language SAST with community rule packs. Hand-written rules, free, fast.

vulnhuntr (Protect AI, open source) — the stated category leader for LLM-driven AI-app static analysis. Python only.

getdebug — pattern-based regex prefilters in JS/TS + Python (new in 0.4.0). Plus optional local-LLM SAST via Ollama (free, on-device) and hosted (paid).

Test 1 — paired vulnerable/safe fixtures

10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output).

Tool        TP  FP  FN   Precision  Recall
getdebug     5   0   0    100%       100%
bandit       1   1   4    50%        20%
semgrep      1   1   4    50%        20%
vulnhuntr    —   —   —    (unable to complete; see below)
Enter fullscreen mode Exit fullscreen mode

Bandit and Semgrep both catch the unsafe-tool-output fixture via their generic subprocess.run(shell=True) rules. That's a TP on the vulnerable variant. But they also fire on the safe variant — the allowlist-then-run pattern:

# Safe pattern — Bandit + Semgrep both flag this as a FP
ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"}
def handle(tool_call):
    cmd = ALLOWED.get(tool_call.input.tag)
    if not cmd: return "rejected"
    return subprocess.run(cmd, shell=True, capture_output=True).stdout
Enter fullscreen mode Exit fullscreen mode

Neither tool knows cmd came from a static dict, not the model. They see shell=True and fire. getdebug's regex specifically requires the tool_call.input.X / block.input.X reference in the sink arg, so the allowlist-then-run pattern stays clean.

Both tools miss the other four behavioural categories (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream) entirely. The rule packs don't contain patterns for {"role": "system", "content": f"...{name}..."}. That's the gap.

Test 2 — real-world signal/noise

We ran all three (working) tools against simonw/llm, Simon Willison's clean CLI for LLMs, 48 Python files.

Tool        Total findings    Signal
bandit      1,189            1,158 are 'assert_used' (pytest);
                              zero AI-app coverage
semgrep     3                3 generic-SAST hits;
                              zero AI-app coverage
getdebug    6                6 AI-app findings: 1 prompt-injection,
                              5 unbounded-stream
Enter fullscreen mode Exit fullscreen mode

Bandit's 1,189 findings on 48 files is almost entirely the assert_used warning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized.

About vulnhuntr

vulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one:

  • --llm claude-code mode (no-API-key option) crashes with ModuleNotFoundError in 1.2.2.
  • --llm gpt with gpt-4o-mini fails pydantic-validation on the response.
  • --llm gpt with gpt-4o hits OpenAI's default 30K TPM rate limit on small accounts.
  • Default file-selection heuristic identifies "network-exposed" entry points — simonw/llm is a CLI, so vulnhuntr selected zero files to analyze.

We'll re-benchmark when its 2026 stack stabilises.

What this means for you

If you ship Python code that calls an LLM, run all three. They're complementary:

bandit -r .                              # general Python hygiene
semgrep --config auto .                  # cross-language SAST coverage
npx @getdebug/cli@0.4.0 analyze .       # AI-app behavioural patterns
Enter fullscreen mode Exit fullscreen mode

None of them subsume the others. The first two catch general SAST; getdebug catches the "serialised the whole user object into the prompt" class that you can't hand-write a sustainable rule for in generic SAST.

Reproduce every number at getdebug.dev/bench. Corpus and methodology are open at getdebug-ai/codesecbench.

read it here https://www.getdebug.dev/blog/python-ai-app-prefilters

Top comments (0)