I ran the same code through an LLM-based security scanner 5 times. I got 5 different severity scores.
That was the moment I decided to throw away the LLM approach entirely and build a pure static analysis engine. Here's why — and what happened when I did.
The Problem: LLM Scanners Are Non-Deterministic
When I first started building a security scanner for AI-generated code, the obvious approach was to use an LLM. Feed it code, ask it to find vulnerabilities. Simple.
It worked — sort of. The LLM did find real issues. But:
Run 1: "Found 3 critical vulnerabilities. Risk score: 85/100."
Run 2: "Found 5 vulnerabilities, 2 critical. Risk score: 72/100."
Run 3: "Found 4 vulnerabilities. Risk score: 91/100."
Same code. Different results every time.
This is the p-hacking problem in AI code analysis. If your scanner gives different answers depending on when you run it, which answer do you trust? Can you show this to a security team and say "the code is safe" when the next run might disagree?
Impact: Non-reproducible results make LLM scanners useless for compliance, CI/CD integration, and any workflow that needs consistent answers.
The Real Cost of LLM-Based Scanning
Beyond non-determinism, the LLM approach had other problems:
Latency
Each scan required an API call — 3-15 seconds per analysis. For a developer running scans on every commit, that's unbearable.
Cost
GPT-4 API calls for code analysis: roughly $0.03-0.10 per scan. At 100 scans/day, that's $3-10/day just in API costs. For a SaaS product charging $29/month, the margins don't work.
Context Window Limits
Large codebases hit token limits. You end up chunking code and losing cross-file context — which is exactly where many vulnerabilities hide.
Hallucinations
The LLM occasionally invented vulnerabilities that didn't exist. "I found a SQL injection on line 47" — line 47 was a comment.
Impact: High cost + high latency + false positives = a scanner nobody wants to use.
The Static Analysis Alternative
I stepped back and asked: what do I actually need to detect?
Not complex semantic vulnerabilities. Not novel zero-days. I need to catch the specific, repeatable patterns that AI code assistants produce:
- Hardcoded API keys matching known formats
- Shell execution functions with string interpolation
- Disabled SSL/TLS verification flags
- Empty catch blocks
- Known typosquatted package names
- Persistence mechanisms (crontab, systemd, git hooks)
These are all pattern-matching problems. Regex and AST analysis handle them perfectly.
The Architecture
Input Code
↓
Line-by-line scan (regex patterns)
↓
Category classification (14 categories)
↓
Severity × Confidence scoring
↓
Composite risk detection (cross-category analysis)
↓
Final risk rank: SAFE / CAUTION / DANGEROUS / CRITICAL
No API calls. No tokens. No temperature settings. Just deterministic pattern matching.
Impact: Same code → same result, every time. Zero variance.
The Results: Static vs. LLM
After building both approaches and testing them against the same dataset:
| Metric | LLM Scanner | Static Analysis |
|---|---|---|
| Scan time | 3-15 seconds | < 50ms |
| Reproducibility | ~60% consistent | 100% |
| API cost per scan | $0.03-0.10 | $0 |
| False positive rate | ~15% | ~5% |
| Detection categories | Flexible but unpredictable | 14 fixed categories |
| CI/CD friendly | Barely | Fully |
The static approach won on every metric that matters for production use.
Where LLM Still Wins
To be fair, LLM-based analysis handles novel vulnerability patterns better. If there's a completely new attack vector that no rule covers, an LLM might catch it while static analysis won't.
But for the known, repeatable patterns that make up 90%+ of AI-generated code vulnerabilities? Static analysis is faster, cheaper, and more reliable.
Design Decisions That Mattered
1. Confidence Coefficients
Not all pattern matches are equally meaningful. A regex match for AKIA (AWS key prefix) in a variable assignment is high-confidence. The same string in a comment is low-confidence.
Each detection rule has both severity (how bad is this?) and confidence (how sure are we?). The final score combines both:
Score = Severity Points × Confidence Coefficient
This dramatically reduced false positives compared to naive pattern matching.
2. Composite Risk Detection
Individual findings tell part of the story. But some combinations are worse than the sum of their parts:
- External communication + hardcoded secrets = probable data exfiltration
- Obfuscation + shell execution = likely malicious payload
- Persistence mechanism + external connection = potential backdoor
The scanner detects these cross-category patterns and applies bonus risk scores. This catches sophisticated threats that single-pattern scanners miss.
3. Category-Specific Tuning
Not all vulnerability categories behave the same way:
- Secret leakage patterns are highly specific (known key formats) → high confidence
- Code quality patterns (TODO markers, debug logs) are noisy → lower severity
- Ransomware patterns (encryption + file ops + shadow deletion) → composite detection
Each of the 14 categories has independently tuned rules based on real-world testing.
Impact: 93 rules across 14 categories, each calibrated for precision over recall.
Lessons Learned
Start with the boring solution. Regex isn't exciting. But it's fast, free, deterministic, and handles 90%+ of the problem.
Reproducibility matters more than cleverness. A scanner that gives consistent results is infinitely more useful than one that's sometimes brilliant.
Score the confidence, not just the severity. A high-severity match with low confidence should score lower than a medium-severity match with high confidence.
Cross-category analysis catches what single rules miss. The most dangerous code combines multiple suspicious patterns.
What's Next
I've packaged this engine into CodeHeal, a security scanner specifically built for AI-generated code. Static analysis, 14 detection categories, deterministic results.
If you're reviewing AI-generated code and want consistent, instant security checks:
Try CodeHeal free — no account required →
Previously: Why AI-Generated Code is a Security Minefield
Top comments (0)