DEV Community

Muhammad Hasan
Muhammad Hasan

Posted on

We benchmarked 24 SAST tools on ~700 real vulnerabilities. The 3 best known ones came last

We ran 24 scanners against 26 real Python apps, ~700 labelled vulnerabilities, and scored them on how many they actually caught.

Disclosure: we built the benchmark and our own scanner is in it, which is exactly why the whole thing is open source. Rerun it yourself.

Top of the board by recall (% of real vulns found)

  1. Kolega Enterprise - 95%
  2. GPT-5.5 (agentic) - 58%
  3. GLM-5.1 (agentic) - 56%
  4. DeepSeek V4 Flash (agentic) - 55%
  5. Claude Opus 4.8 (agentic) - 52%
    ...and the bottom of the board:

  6. Semgrep - 19%

  7. Snyk - 17%

  8. SonarQube - 6%

    TLDR

The SAST tools most teams actually run (Semgrep, Snyk, SonarQube) each found under 1 in 5 real vulnerabilities. SonarQube found about 1 in 16. A general-purpose LLM with zero security training, just dropped into an agent loop, found roughly 3x more than the dedicated scanners.

Why? Pattern matchers only catch what matches a known signature, and most real bugs (broken access control, auth that breaks across files, logic flaws) are not a pattern. They are the code not meaning what the author thought it meant.

One result that stuck out: Grok 4.20 had the best precision of anything tested (93%, it basically never cried wolf) but only 26% recall. So you can be extremely precise and still miss three quarters of the bugs. A clean report does not mean secure code.

Full leaderboard, methodology and the raw data: https://realvuln.com

Top comments (0)