We benchmarked 24 SAST tools on ~700 real vulnerabilities. The 3 best known ones came last

#security #sast #devops #devsecops

We ran 24 scanners against 26 real Python apps, ~700 labelled vulnerabilities, and scored them on how many they actually caught.

Disclosure: we built the benchmark and our own scanner is in it, which is exactly why the whole thing is open source. Rerun it yourself.

Top of the board by recall (% of real vulns found)

Kolega Enterprise - 95%
GPT-5.5 (agentic) - 58%
GLM-5.1 (agentic) - 56%
DeepSeek V4 Flash (agentic) - 55%
Claude Opus 4.8 (agentic) - 52%
...and the bottom of the board:
Semgrep - 19%
Snyk - 17%
SonarQube - 6%

TLDR

The SAST tools most teams actually run (Semgrep, Snyk, SonarQube) each found under 1 in 5 real vulnerabilities. SonarQube found about 1 in 16. A general-purpose LLM with zero security training, just dropped into an agent loop, found roughly 3x more than the dedicated scanners.

Why? Pattern matchers only catch what matches a known signature, and most real bugs (broken access control, auth that breaks across files, logic flaws) are not a pattern. They are the code not meaning what the author thought it meant.

One result that stuck out: Grok 4.20 had the best precision of anything tested (93%, it basically never cried wolf) but only 26% recall. So you can be extremely precise and still miss three quarters of the bugs. A clean report does not mean secure code.

Full leaderboard, methodology and the raw data: https://realvuln.com