This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches.
The four tools
Bandit (PyCQA) — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only.
Semgrep — multi-language SAST with community rule packs. Hand-written rules, free, fast.
vulnhuntr (Protect AI, open source) — the stated category leader for LLM-driven AI-app static analysis. Python only.
getdebug — pattern-based regex prefilters in JS/TS + Python (new in 0.4.0). Plus optional local-LLM SAST via Ollama (free, on-device) and hosted (paid).
Test 1 — paired vulnerable/safe fixtures
10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output).
Tool TP FP FN Precision Recall
getdebug 5 0 0 100% 100%
bandit 1 1 4 50% 20%
semgrep 1 1 4 50% 20%
vulnhuntr — — — (unable to complete; see below)
Bandit and Semgrep both catch the unsafe-tool-output fixture via their generic subprocess.run(shell=True) rules. That's a TP on the vulnerable variant. But they also fire on the safe variant — the allowlist-then-run pattern:
# Safe pattern — Bandit + Semgrep both flag this as a FP
ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"}
def handle(tool_call):
cmd = ALLOWED.get(tool_call.input.tag)
if not cmd: return "rejected"
return subprocess.run(cmd, shell=True, capture_output=True).stdout
Neither tool knows cmd came from a static dict, not the model. They see shell=True and fire. getdebug's regex specifically requires the tool_call.input.X / block.input.X reference in the sink arg, so the allowlist-then-run pattern stays clean.
Both tools miss the other four behavioural categories (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream) entirely. The rule packs don't contain patterns for {"role": "system", "content": f"...{name}..."}. That's the gap.
Test 2 — real-world signal/noise
We ran all three (working) tools against simonw/llm, Simon Willison's clean CLI for LLMs, 48 Python files.
Tool Total findings Signal
bandit 1,189 1,158 are 'assert_used' (pytest);
zero AI-app coverage
semgrep 3 3 generic-SAST hits;
zero AI-app coverage
getdebug 6 6 AI-app findings: 1 prompt-injection,
5 unbounded-stream
Bandit's 1,189 findings on 48 files is almost entirely the assert_used warning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized.
About vulnhuntr
vulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one:
-
--llm claude-codemode (no-API-key option) crashes withModuleNotFoundErrorin 1.2.2. -
--llm gptwithgpt-4o-minifails pydantic-validation on the response. -
--llm gptwithgpt-4ohits OpenAI's default 30K TPM rate limit on small accounts. - Default file-selection heuristic identifies "network-exposed" entry points — simonw/llm is a CLI, so vulnhuntr selected zero files to analyze.
We'll re-benchmark when its 2026 stack stabilises.
What this means for you
If you ship Python code that calls an LLM, run all three. They're complementary:
bandit -r . # general Python hygiene
semgrep --config auto . # cross-language SAST coverage
npx @getdebug/cli@0.4.0 analyze . # AI-app behavioural patterns
None of them subsume the others. The first two catch general SAST; getdebug catches the "serialised the whole user object into the prompt" class that you can't hand-write a sustainable rule for in generic SAST.
Reproduce every number at getdebug.dev/bench. Corpus and methodology are open at getdebug-ai/codesecbench.
read it here https://www.getdebug.dev/blog/python-ai-app-prefilters
Top comments (0)