ayame0328

Posted on Mar 3

How I Replaced LLM-Based Code Analysis with Static Analysis (And Got Better Results)

#ai #security #programming #showdev

When I started building a security scanner for AI-generated code, I did what everyone does in 2026: I threw an LLM at it.

That was a mistake. Here's why I ripped it out and replaced it with static analysis — and why the results are objectively better.

The LLM Approach (Week 1)

The idea was simple: feed code into an LLM, ask it to identify security vulnerabilities, return a severity score. Modern, elegant, "AI-powered."

I built the prototype in a day. It worked... sort of.

Input: eval(user_input)
Run 1: Severity 8.5 - "Critical command injection vulnerability"
Run 2: Severity 6.2 - "Moderate risk, depends on context"
Run 3: Severity 9.1 - "Extremely dangerous, immediate fix required"
Run 4: Severity 7.0 - "High risk injection vector"
Run 5: Severity 8.5 - "Critical vulnerability"

Same code. Five runs. Five different answers. The severity scores ranged from 6.2 to 9.1.

This is not a security tool. This is a random number generator with opinions.

The p-Hacking Problem

If you're not familiar with p-hacking in research: it's when you run experiments multiple times and cherry-pick the results that support your hypothesis. LLM-based code analysis has the same fundamental problem.

I ran a systematic test: the same 20 code samples, scanned 5 times each. The results were devastating:

Score variance: Average deviation of ±1.8 points on a 10-point scale
Category disagreement: 23% of the time, the LLM categorized the same vulnerability differently across runs
False negative rate: On run 3, it completely missed a SQL injection that it caught on runs 1, 2, 4, and 5

When your security scanner gives different results depending on when you run it, you can't trust any of the results.

The Breaking Point

The moment I decided to abandon the LLM approach was embarrassingly simple.

I had a test file with an obvious eval(input()) — the textbook example of command injection. I ran the scan 10 times to check consistency. Eight times it flagged it correctly. Twice it said "low risk, as this pattern is common in REPL implementations."

A security scanner that sometimes thinks eval(input()) is fine is worse than no scanner at all. It gives you false confidence.

Starting Over with Static Analysis

I went back to basics. Pattern matching. Regular expressions. Abstract syntax analysis. The kind of "boring" technology that's been catching vulnerabilities since the 1970s.

Here's what changed immediately:

Determinism

Input: eval(user_input)
Run 1: CRITICAL - Command injection (score: 20)
Run 2: CRITICAL - Command injection (score: 20)
Run 3: CRITICAL - Command injection (score: 20)
...
Run 100: CRITICAL - Command injection (score: 20)

Same input, same output. Every. Single. Time. This is what a security tool should do.

Speed

Approach	Time per scan	Cost per scan
LLM-based	3-8 seconds	$0.002-0.01
Static analysis	15-50ms	$0.00

That's not a small difference. It's the difference between "scan on every commit" and "scan when you remember to."

Coverage

This surprised me the most. I expected the LLM to catch more edge cases. It didn't.

The LLM was great at explaining why something was dangerous. But it was inconsistent at detecting it in the first place. Static analysis with well-crafted patterns caught more vulnerabilities more reliably.

I ended up with 14 categories and 93 detection rules covering:

Command injection and code execution
Obfuscation and encoding tricks
Data exfiltration patterns
Cryptographic weaknesses
Destructive file operations
And 9 more categories specific to AI-generated code patterns

What Static Analysis Does Better

1. No Hallucinated Vulnerabilities

LLMs sometimes report vulnerabilities that don't exist. They see a pattern that looks like it could be dangerous and flag it, even when the context makes it safe. Static analysis only fires on exact pattern matches — no imagination, no hallucination.

2. Composite Risk Detection

One thing I built into the static engine that LLMs struggled with: detecting when multiple low-severity findings combine into a high-severity risk.

For example: reading environment variables (low risk) + making HTTP calls (low risk) + base64 encoding (low risk) = potential credential exfiltration (critical risk).

The LLM would sometimes catch this composite pattern, sometimes not. The static engine catches it every time because the rules are explicit.

3. AI-Specific Patterns

LLMs analyzing LLM-generated code have a blind spot: they share the same training data. The patterns that AI code assistants produce are patterns the analyzing LLM considers "normal."

Static analysis doesn't have this bias. A hardcoded API key is a hardcoded API key, regardless of whether a human or AI wrote it.

What I Lost (And Why It's Okay)

No Natural Language Explanations

The LLM could explain why eval() is dangerous in plain English, with context about how an attacker might exploit it. Static analysis just says "Command injection detected, line 42."

My solution: Pre-written descriptions for each rule. Not as dynamic, but consistent and accurate.

No Context-Aware Analysis

The LLM could sometimes understand that eval("2 + 2") with a hardcoded string is less dangerous than eval(user_input). Static analysis treats both as matches.

My solution: Confidence levels. High confidence for clear-cut cases (eval(input())), medium for ambiguous ones (eval() with non-obvious arguments).

No New Vulnerability Discovery

Static analysis only finds what you tell it to look for. It won't discover novel attack vectors.

My solution: This is fine for the target use case. AI-generated code tends to repeat the same vulnerability patterns. I don't need to discover zero-days — I need to catch the same 93 mistakes that AI keeps making.

The Numbers After 3 Months

Metric	LLM Approach	Static Analysis
Consistency	~77% same result	100% same result
Speed	3-8 sec	15-50ms
Cost per scan	$0.002-0.01	$0.00
False positive rate	~12%	~5%
False negative rate	~8%	~3%
Rules/patterns	"Vibes"	93 explicit rules

The static analysis approach is better in literally every measurable dimension except "sounds impressive on a landing page."

When to Use LLMs for Security

I'm not saying LLMs are useless for security. They're great for:

Code review assistance: Explaining findings in natural language
Threat modeling: Brainstorming attack vectors
Documentation: Generating security guidelines

But for automated scanning — where you need speed, consistency, and reliability — static analysis wins. It's not even close.

The Uncomfortable Industry Truth

The security tool market is rushing to add "AI-powered" to every product. But for pattern-based vulnerability detection, the AI adds latency, cost, and inconsistency without improving accuracy.

Sometimes the boring solution is the right one.

Try the Static Analysis Approach

CodeHeal is the scanner I built after ditching the LLM approach. 14 categories, 93 rules, deterministic results, zero API costs. Paste your code and see for yourself.

Scan your code free →

Previously: Why AI-Generated Code is a Security Minefield