We ran 24 scanners against 26 real Python apps, ~700 labelled vulnerabilities, and scored them on how many they actually caught.
Disclosure: we built the benchmark and our own scanner is in it, which is exactly why the whole thing is open source. Rerun it yourself.
Top of the board by recall (% of real vulns found)
- Kolega Enterprise - 95%
- GPT-5.5 (agentic) - 58%
- GLM-5.1 (agentic) - 56%
- DeepSeek V4 Flash (agentic) - 55%
Claude Opus 4.8 (agentic) - 52%
...and the bottom of the board:Semgrep - 19%
Snyk - 17%
-
SonarQube - 6%
TLDR
The SAST tools most teams actually run (Semgrep, Snyk, SonarQube) each found under 1 in 5 real vulnerabilities. SonarQube found about 1 in 16. A general-purpose LLM with zero security training, just dropped into an agent loop, found roughly 3x more than the dedicated scanners.
Why? Pattern matchers only catch what matches a known signature, and most real bugs (broken access control, auth that breaks across files, logic flaws) are not a pattern. They are the code not meaning what the author thought it meant.
One result that stuck out: Grok 4.20 had the best precision of anything tested (93%, it basically never cried wolf) but only 26% recall. So you can be extremely precise and still miss three quarters of the bugs. A clean report does not mean secure code.
Full leaderboard, methodology and the raw data: https://realvuln.com
Top comments (0)