DEV Community

Juan Torchia
Juan Torchia

Posted on • Originally published at juanchi.dev

LLMs generating security reports: I ran the same prompt on my own code

LLMs generating security reports: I ran the same prompt on my own code

I made an architecture mistake that took me three weeks to see — and I only saw it because an LLM pointed it out first. I'm not telling you that to seem humble. I'm telling you because that same LLM ignored a real vulnerability I had exposed on a Railway endpoint for two months.

That contrast — catching something minor, missing something major — is exactly the problem I want to tear apart today.

A few days ago, Hacker News reported something that stopped me cold: Linux kernel committers were receiving LLM-generated security reports, with 115 upvotes and a heated debate. The thread was about whether those reports were noise or signal. Most people focused on false positives. Nobody talked about false negatives.

That's the thesis I actually care about.


LLM security reports on real code: the experiment I ran

I took the exact pattern described in the HN thread — an LLM acting as a security reviewer over a diff or a code file — and applied it to three parts of my own production infrastructure: a Next.js webhook handler, an auth module I wrote during the pandemic when I was still learning to code seriously, and a Railway wrapper that handles environment variables.

The prompt I used was deliberately simple. I didn't want to give it extra context or help it along. I wanted to see what it could find on its own:

# Base prompt for security review
PROMPT = """
You are a security engineer reviewing this code.
List real vulnerabilities, ordered by severity.
Do not give general context. Do not explain what SQL injection is.
Just list what you SEE in this specific code.
"""
Enter fullscreen mode Exit fullscreen mode

I ran it against Claude Opus 4 and GPT-4o. The results weren't identical — which is already information.

What they found (real)

Claude flagged three things in the webhook handler:

  1. Missing signature verification on the incoming payload — REAL. I knew about it but had left it "for later." That later had been going on for two months.
  2. A console.log(req.body) that could log sensitive data on certain requests — REAL, and I hadn't noticed it.
  3. A rate limiter implemented in memory (no Redis) that doesn't survive a container restart — REAL.

GPT-4o found the same three, plus one that was noise:

  1. It flagged that I was using Math.random() to generate session IDs — FALSE POSITIVE. That wasn't being used for sessions, it was for internal log correlation IDs. Security-irrelevant.

Up to that point, the experiment seemed to validate the process. Three real findings, one false positive. Reasonable.

What they did NOT find (and that's where the problem lives)

The old auth module — the one I wrote in 2021 when I transitioned from infrastructure to development — had something deeper. It had token comparison logic that was vulnerable to timing attacks on certain code paths:

// This looks harmless. It isn't.
// Direct string comparison is vulnerable to timing attacks
// because JavaScript can short-circuit on the first differing byte
function validateToken(receivedToken: string, expectedToken: string): boolean {
  // ❌ Vulnerable: direct comparison
  return receivedToken === expectedToken;

  // ✅ Correct: constant-time comparison
  // return crypto.timingSafeEqual(
  //   Buffer.from(receivedToken),
  //   Buffer.from(expectedToken)
  // );
}
Enter fullscreen mode Exit fullscreen mode

Neither model flagged it. Neither one.

Why? Because the code "looked correct." The function returns a boolean, compares two strings, is clearly named. A fast reviewer — human or LLM — walks right past it.


The real problem: false negatives give you an excuse

When the Linux kernel starts receiving LLM-generated security reports, the natural debate is "how many are false positives?" That's a reasonable question. But it's the wrong question.

The question that matters is: how many real vulnerabilities are NOT showing up in those reports?

Because a false positive you can discard. It's annoying, you lose time, but it doesn't hurt you. A false negative — a vulnerability the LLM didn't see — gives you something worse: the feeling that you already reviewed it. That the code is clean. That you can deploy without worry.

That's exactly what happened to me with the Vercel breach. Not the breach itself, but the mental logic surrounding it: the incident broke my infrastructure, yes, but more than that it broke my excuse. The excuse that "someone already reviewed this."

When I ran the LLM over my code and got back three real findings, my first instinct was to think: "good, now I know my problems." But the timing attack was still there. Invisible. With an implicit "reviewed by AI" stamp on it.

My thesis, stated plainly: the danger of LLM security reports isn't that they generate noise. It's that they generate confidence.


What kinds of vulnerabilities LLMs see poorly

After the experiment, I got methodical. I tested more code. I tried different prompts. I gave it context, no context, explicit chain-of-thought. Here's the pattern that emerged:

They see well:

  • Hardcoded secrets in code (API keys, plaintext passwords)
  • Missing validation on obvious inputs
  • SQL queries concatenated with string interpolation
  • Dependencies with known CVEs (if they're in the training data)
  • Logs exposing sensitive data

They see poorly:

  • Vulnerabilities that depend on execution context (race conditions, timing attacks)
  • Authorization problems that require understanding the business model
  • Implicit access control logic (what the code does NOT do, not what it does)
  • Vulnerabilities in the interaction between two modules the LLM doesn't see together

That last category strikes me as the most dangerous. When I used the same pattern I applied in CrabTrap — an LLM as an intermediate judge in front of my agent — I learned that LLMs are good at evaluating what's in front of them. They're bad at reasoning about what's missing or about emergent behavior in systems.

A security review is no different.


Common mistakes when using LLMs to review security

1. Giving it the file instead of the system

The timing attack the models missed was in a file reviewed in isolation. If I'd passed the full flow — from endpoint to validation — maybe it would have caught it. Maybe.

2. Interpreting silence as approval

"Found nothing" doesn't mean "nothing's there." It means "found nothing in what it processed." The distinction matters.

3. Not specifying the threat model

An LLM without context assumes a generic threat model. It doesn't know if the adversary is a script kiddie or a well-resourced team with time. The prompt I built was deliberately neutral — that was my mistake too.

4. Trusting a single model

GPT-4o and Claude found different things. That alone tells you neither has complete coverage. Running them as independent queries and comparing outputs is more honest than trusting just one.

5. Not iterating the prompt by code type

A webhook handler needs a different prompt than an auth module. The local context changes which vulnerabilities are actually relevant.


FAQ: LLM security reports and code analysis

Can LLMs replace a real pentest?

No. Not even close. An LLM can do a first pass over static code and catch obvious problems. A pentest involves execution context, real interaction with the system, privilege escalation, runtime behavior analysis. They're different tools for different moments. The LLM is useful before the pentest, not instead of it.

How reliable are AI-generated security reports for production code?

Depends on what you expect from them. For finding hardcoded secrets, unvalidated inputs, or obvious injection patterns: pretty reliable. For finding logic vulnerabilities, implicit authorization problems, or timing bugs: don't use them as your only source. My experiment gave three true positives and one false positive — but the most serious vulnerability didn't make it into the report.

Does it make sense to send LLM-generated security reports to open source projects like the kernel?

It's a question that divides the ecosystem, and for good reason. If the report is verified by a human before being sent and describes a real vulnerability: yes, it adds value. If it's a raw LLM output without human curation sent to maintainers who already have packed queues: it's noise with a real human cost. The problem isn't that the LLM generates it. The problem is when that human verification step disappears from the chain.

What prompt gives the best results for LLM security reviews?

In my tests, the most useful prompts have three components: threat model specification ("assume the attacker has access to logs but not to source code"), scope restriction ("don't explain general concepts, only what you see in this code"), and a request for evidence ("for each finding, cite the exact line and explain the concrete attack vector"). Without that, the outputs are generic and hard to act on.

Do LLMs see vulnerabilities better than static security linters like Semgrep or Bandit?

Complementary, not superior. Semgrep and Bandit are deterministic: if you define a rule, it applies it every time, no hallucinations. LLMs have more contextual reasoning ability but are non-deterministic and can invent problems or miss patterns not represented in training. My current stack runs them in parallel: Semgrep in CI for automatic coverage, LLM for contextual review on critical PRs.

Is it worth automating LLM security reviews in the CI/CD pipeline?

Carefully. The token cost of reviewing every commit can scale fast — I have logs of what each design decision costs in my agent and the numbers were eye-opening. For automatic CI, the best approach is a selective trigger: files that touch authentication, secret handling, or input validation. Not the full diff on every push.


What I accepted, what I don't buy, and the honest trade-off

I accepted that LLMs are useful as a first review layer. They're better than reviewing nothing. They found three real problems in my code that I'd been putting off — and "later" had been going on for two months.

What I don't buy is the narrative that LLM-as-security-reviewer is sufficient. Or worse: that it's equivalent to expert human review. That narrative exists because it's convenient — for vendors, for teams with deadlines, for anyone who wants the feeling that a security process exists without the cost of actually running it well.

The timing attack the models ignored wasn't esoteric. It's a known, documented pattern with a one-line fix. They missed it because it was implicit in the behavior, not explicit in the syntax.

The honest trade-off is this: an LLM security report gives you coverage over what's visible. The invisible stays invisible — and now it comes with a "reviewed" stamp on it.

That stamp — that's what bothers me. And it's what pushed me to run this experiment instead of sitting back with that first reassuring pass.

If you're building something with a real attack surface — a public endpoint, token handling, user data — don't let an LLM security report be the end of the process. Use it as the starting point. The difference matters more than it looks.


If you want to see how I built the intermediate evaluation system that uses LLM-as-judge in my production agent, the CrabTrap post has the full technical details. And if the token cost question in automated review workflows concerns you, the numbers I measured in my own logs give context for why selective triggering matters.


This article was originally published on juanchi.dev

Top comments (0)