When I started building a security scanner for AI-generated code, I did what everyone does in 2026: I threw an LLM at it.
That was a mistake. Here's why I ripped it out and replaced it with static analysis — and why the results are objectively better.
The LLM Approach (Week 1)
The idea was simple: feed code into an LLM, ask it to identify security vulnerabilities, return a severity score. Modern, elegant, "AI-powered."
I built the prototype in a day. It worked... sort of.
Input: eval(user_input)
Run 1: Severity 8.5 - "Critical command injection vulnerability"
Run 2: Severity 6.2 - "Moderate risk, depends on context"
Run 3: Severity 9.1 - "Extremely dangerous, immediate fix required"
Run 4: Severity 7.0 - "High risk injection vector"
Run 5: Severity 8.5 - "Critical vulnerability"
Same code. Five runs. Five different answers. The severity scores ranged from 6.2 to 9.1.
This is not a security tool. This is a random number generator with opinions.
The p-Hacking Problem
If you're not familiar with p-hacking in research: it's when you run experiments multiple times and cherry-pick the results that support your hypothesis. LLM-based code analysis has the same fundamental problem.
I ran a systematic test: the same 20 code samples, scanned 5 times each. The results were devastating:
- Score variance: Average deviation of ±1.8 points on a 10-point scale
- Category disagreement: 23% of the time, the LLM categorized the same vulnerability differently across runs
- False negative rate: On run 3, it completely missed a SQL injection that it caught on runs 1, 2, 4, and 5
When your security scanner gives different results depending on when you run it, you can't trust any of the results.
The Breaking Point
The moment I decided to abandon the LLM approach was embarrassingly simple.
I had a test file with an obvious eval(input()) — the textbook example of command injection. I ran the scan 10 times to check consistency. Eight times it flagged it correctly. Twice it said "low risk, as this pattern is common in REPL implementations."
A security scanner that sometimes thinks eval(input()) is fine is worse than no scanner at all. It gives you false confidence.
Starting Over with Static Analysis
I went back to basics. Pattern matching. Regular expressions. Abstract syntax analysis. The kind of "boring" technology that's been catching vulnerabilities since the 1970s.
Here's what changed immediately:
Determinism
Input: eval(user_input)
Run 1: CRITICAL - Command injection (score: 20)
Run 2: CRITICAL - Command injection (score: 20)
Run 3: CRITICAL - Command injection (score: 20)
...
Run 100: CRITICAL - Command injection (score: 20)
Same input, same output. Every. Single. Time. This is what a security tool should do.
Speed
| Approach | Time per scan | Cost per scan |
|---|---|---|
| LLM-based | 3-8 seconds | $0.002-0.01 |
| Static analysis | 15-50ms | $0.00 |
That's not a small difference. It's the difference between "scan on every commit" and "scan when you remember to."
Coverage
This surprised me the most. I expected the LLM to catch more edge cases. It didn't.
The LLM was great at explaining why something was dangerous. But it was inconsistent at detecting it in the first place. Static analysis with well-crafted patterns caught more vulnerabilities more reliably.
I ended up with 14 categories and 93 detection rules covering:
- Command injection and code execution
- Obfuscation and encoding tricks
- Data exfiltration patterns
- Cryptographic weaknesses
- Destructive file operations
- And 9 more categories specific to AI-generated code patterns
What Static Analysis Does Better
1. No Hallucinated Vulnerabilities
LLMs sometimes report vulnerabilities that don't exist. They see a pattern that looks like it could be dangerous and flag it, even when the context makes it safe. Static analysis only fires on exact pattern matches — no imagination, no hallucination.
2. Composite Risk Detection
One thing I built into the static engine that LLMs struggled with: detecting when multiple low-severity findings combine into a high-severity risk.
For example: reading environment variables (low risk) + making HTTP calls (low risk) + base64 encoding (low risk) = potential credential exfiltration (critical risk).
The LLM would sometimes catch this composite pattern, sometimes not. The static engine catches it every time because the rules are explicit.
3. AI-Specific Patterns
LLMs analyzing LLM-generated code have a blind spot: they share the same training data. The patterns that AI code assistants produce are patterns the analyzing LLM considers "normal."
Static analysis doesn't have this bias. A hardcoded API key is a hardcoded API key, regardless of whether a human or AI wrote it.
What I Lost (And Why It's Okay)
No Natural Language Explanations
The LLM could explain why eval() is dangerous in plain English, with context about how an attacker might exploit it. Static analysis just says "Command injection detected, line 42."
My solution: Pre-written descriptions for each rule. Not as dynamic, but consistent and accurate.
No Context-Aware Analysis
The LLM could sometimes understand that eval("2 + 2") with a hardcoded string is less dangerous than eval(user_input). Static analysis treats both as matches.
My solution: Confidence levels. High confidence for clear-cut cases (eval(input())), medium for ambiguous ones (eval() with non-obvious arguments).
No New Vulnerability Discovery
Static analysis only finds what you tell it to look for. It won't discover novel attack vectors.
My solution: This is fine for the target use case. AI-generated code tends to repeat the same vulnerability patterns. I don't need to discover zero-days — I need to catch the same 93 mistakes that AI keeps making.
The Numbers After 3 Months
| Metric | LLM Approach | Static Analysis |
|---|---|---|
| Consistency | ~77% same result | 100% same result |
| Speed | 3-8 sec | 15-50ms |
| Cost per scan | $0.002-0.01 | $0.00 |
| False positive rate | ~12% | ~5% |
| False negative rate | ~8% | ~3% |
| Rules/patterns | "Vibes" | 93 explicit rules |
The static analysis approach is better in literally every measurable dimension except "sounds impressive on a landing page."
When to Use LLMs for Security
I'm not saying LLMs are useless for security. They're great for:
- Code review assistance: Explaining findings in natural language
- Threat modeling: Brainstorming attack vectors
- Documentation: Generating security guidelines
But for automated scanning — where you need speed, consistency, and reliability — static analysis wins. It's not even close.
The Uncomfortable Industry Truth
The security tool market is rushing to add "AI-powered" to every product. But for pattern-based vulnerability detection, the AI adds latency, cost, and inconsistency without improving accuracy.
Sometimes the boring solution is the right one.
Try the Static Analysis Approach
CodeHeal is the scanner I built after ditching the LLM approach. 14 categories, 93 rules, deterministic results, zero API costs. Paste your code and see for yourself.
Previously: Why AI-Generated Code is a Security Minefield
Top comments (0)