ayame0328

Posted on Mar 13

How I Replaced LLM-Based Code Analysis with Static Analysis — and Got Better Results

#codequality #llm #security #softwareengineering

I ran the same code through an LLM-based security scanner 5 times. I got 5 different severity scores.

That was the moment I decided to throw away the LLM approach entirely and build a pure static analysis engine. Here's why — and what happened when I did.

The Problem: LLM Scanners Are Non-Deterministic

When I first started building a security scanner for AI-generated code, the obvious approach was to use an LLM. Feed it code, ask it to find vulnerabilities. Simple.

It worked — sort of. The LLM did find real issues. But:

Run 1: "Found 3 critical vulnerabilities. Risk score: 85/100."
Run 2: "Found 5 vulnerabilities, 2 critical. Risk score: 72/100."
Run 3: "Found 4 vulnerabilities. Risk score: 91/100."

Same code. Different results every time.

This is the p-hacking problem in AI code analysis. If your scanner gives different answers depending on when you run it, which answer do you trust? Can you show this to a security team and say "the code is safe" when the next run might disagree?

Impact: Non-reproducible results make LLM scanners useless for compliance, CI/CD integration, and any workflow that needs consistent answers.

The Real Cost of LLM-Based Scanning

Beyond non-determinism, the LLM approach had other problems:

Latency

Each scan required an API call — 3-15 seconds per analysis. For a developer running scans on every commit, that's unbearable.

Cost

GPT-4 API calls for code analysis: roughly $0.03-0.10 per scan. At 100 scans/day, that's $3-10/day just in API costs. For a SaaS product charging $29/month, the margins don't work.

Context Window Limits

Large codebases hit token limits. You end up chunking code and losing cross-file context — which is exactly where many vulnerabilities hide.

Hallucinations

The LLM occasionally invented vulnerabilities that didn't exist. "I found a SQL injection on line 47" — line 47 was a comment.

Impact: High cost + high latency + false positives = a scanner nobody wants to use.

The Static Analysis Alternative

I stepped back and asked: what do I actually need to detect?

Not complex semantic vulnerabilities. Not novel zero-days. I need to catch the specific, repeatable patterns that AI code assistants produce:

Hardcoded API keys matching known formats
Shell execution functions with string interpolation
Disabled SSL/TLS verification flags
Empty catch blocks
Known typosquatted package names
Persistence mechanisms (crontab, systemd, git hooks)

These are all pattern-matching problems. Regex and AST analysis handle them perfectly.

The Architecture

Input Code
    ↓
Line-by-line scan (regex patterns)
    ↓
Category classification (14 categories)
    ↓
Severity × Confidence scoring
    ↓
Composite risk detection (cross-category analysis)
    ↓
Final risk rank: SAFE / CAUTION / DANGEROUS / CRITICAL

No API calls. No tokens. No temperature settings. Just deterministic pattern matching.

Impact: Same code → same result, every time. Zero variance.

The Results: Static vs. LLM

After building both approaches and testing them against the same dataset:

Metric	LLM Scanner	Static Analysis
Scan time	3-15 seconds	< 50ms
Reproducibility	~60% consistent	100%
API cost per scan	$0.03-0.10	$0
False positive rate	~15%	~5%
Detection categories	Flexible but unpredictable	14 fixed categories
CI/CD friendly	Barely	Fully

The static approach won on every metric that matters for production use.

Where LLM Still Wins

To be fair, LLM-based analysis handles novel vulnerability patterns better. If there's a completely new attack vector that no rule covers, an LLM might catch it while static analysis won't.

But for the known, repeatable patterns that make up 90%+ of AI-generated code vulnerabilities? Static analysis is faster, cheaper, and more reliable.

Design Decisions That Mattered

1. Confidence Coefficients

Not all pattern matches are equally meaningful. A regex match for AKIA (AWS key prefix) in a variable assignment is high-confidence. The same string in a comment is low-confidence.

Each detection rule has both severity (how bad is this?) and confidence (how sure are we?). The final score combines both:

Score = Severity Points × Confidence Coefficient

This dramatically reduced false positives compared to naive pattern matching.

2. Composite Risk Detection

Individual findings tell part of the story. But some combinations are worse than the sum of their parts:

External communication + hardcoded secrets = probable data exfiltration
Obfuscation + shell execution = likely malicious payload
Persistence mechanism + external connection = potential backdoor

The scanner detects these cross-category patterns and applies bonus risk scores. This catches sophisticated threats that single-pattern scanners miss.

3. Category-Specific Tuning

Not all vulnerability categories behave the same way:

Secret leakage patterns are highly specific (known key formats) → high confidence
Code quality patterns (TODO markers, debug logs) are noisy → lower severity
Ransomware patterns (encryption + file ops + shadow deletion) → composite detection

Each of the 14 categories has independently tuned rules based on real-world testing.

Impact: 93 rules across 14 categories, each calibrated for precision over recall.

Lessons Learned

Start with the boring solution. Regex isn't exciting. But it's fast, free, deterministic, and handles 90%+ of the problem.
Reproducibility matters more than cleverness. A scanner that gives consistent results is infinitely more useful than one that's sometimes brilliant.
Score the confidence, not just the severity. A high-severity match with low confidence should score lower than a medium-severity match with high confidence.
Cross-category analysis catches what single rules miss. The most dangerous code combines multiple suspicious patterns.

What's Next

I've packaged this engine into CodeHeal, a security scanner specifically built for AI-generated code. Static analysis, 14 detection categories, deterministic results.

If you're reviewing AI-generated code and want consistent, instant security checks:

Try CodeHeal free — no account required →

Previously: Why AI-Generated Code is a Security Minefield