Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

N-Day-Bench: Can LLMs Find Real Vulnerabilities in Real Code?

#english #reflections #devops #llm

There's a huge difference between the plumber who fixes your pipes and the one who breaks them. But if the same plumber can do either depending on whether you ask or not, you've got a process problem, not a tool problem.

That's exactly what happened to me. I approved three PRs in the same sprint. All three had hardcoded keys. All three came with partial suggestions from Copilot or Claude. And when someone finally flagged the problem in code review — weeks later — my first thought was: why didn't either of them catch it earlier? Worse: would they have caught it if someone had asked them directly?

N-Day-Bench tries to answer exactly that question. And the answer left me with more questions than I started with.

What N-Day-Bench Actually Measures

N-Day-Bench is a benchmark published in early 2025 that evaluates whether LLMs can identify real vulnerabilities — not synthetic ones, not CTF challenges — in actual production codebases. "N-Day" because it works with already-known vulnerabilities (they have assigned CVEs), not zero-days.

The methodology is more honest than most:

They take real CVEs with real affected code
They give the models relevant context (not the entire repo, just the pertinent files)
They ask the model to identify the vulnerability without hinting at the CVE
They measure whether the model finds the correct problem, not whether it generates plausible-sounding security text

That last point matters. A lot of security benchmarks are satisfied if the model mentions the right type of vulnerability. N-Day-Bench requires precision: correct file, approximate line, real exploitation mechanism.

The published results show that the best models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro in the evaluated versions) correctly identify between 20% and 35% of vulnerabilities when queried directly. Sounds low. But compare it to an average developer doing manual code review on someone else's code — the number isn't all that different.

The real problem is somewhere else entirely.

The Gap Nobody Mentions in the Papers

There's something N-Day-Bench doesn't measure directly but that you can infer from the data: the difference between generation mode and audit mode.

When an LLM is completing code — which is how we use it 90% of the time — it's not in critical mode. It's in collaborative mode. Its implicit goal is to produce code that works and is coherent with the surrounding context. Security is a secondary constraint unless you explicitly push it to the front.

When you specifically ask it to audit, the frame shifts. The same model, with the same code, finds things it didn't flag while generating it.

// Example of what happened to me — reconstructed
// Generation: "complete this DB connection function"
const connectDB = async () => {
  return await mongoose.connect(
    'mongodb://admin:MyPassword123@prod-server:27017/mydb', // the model completed this
    { useNewUrlParser: true }
  );
};

// Audit: "find security problems in this code"
// Response from the same model:
// "Line 3: hardcoded credential in the connection string.
//  Attack vector: exposure in repositories, logs, stack traces.
//  Severity: CRITICAL. Fix: use environment variables."

Same model. Same code. Different prompt, different output.

That's not a bug in the model. It's a bug in me. I didn't put it in audit mode while reviewing those PRs.

The Numbers That Bother Me

Back to N-Day-Bench. The 20–35% detection rate sounds reasonable until you look at what kind of vulnerabilities it's missing.

Models are reasonably good with:

SQL injection in classic patterns
Hardcoded credentials (the benchmark confirms this)
Obvious XSS in templates
Dependencies with known CVEs if you give them the package.json

Models consistently fail with:

Business logic vulnerabilities (the code is "correct" but the flow is exploitable)
Subtle race conditions
Authorization problems that require understanding the full data model
Vulnerabilities that emerge from the interaction between components, not from a single isolated component

That second group is exactly the kind of vulnerability that wrecks you in production. It's not the hardcoded password — you catch that with a grep. It's the endpoint that validates permissions correctly but, combined with an "import configuration" feature, gives you arbitrary path traversal.

N-Day-Bench confirms what I suspected: LLMs are good as a first line of defense against the obvious stuff. They're terrible as substitutes for a real security review.

What I Changed in My Workflow After Reading the Paper

I'm not a security researcher. I'm an architect who learned this the hard way — the same way I learned infrastructure by running rm -rf on a server my first week of hosting work, the same way I learned about cold starts migrating from Vercel to Railway over a weekend.

What I added:

# Pre-commit hook I added to the project
# Doesn't replace anything, it's just the first line of defense

#!/bin/bash
echo "Running basic pre-commit audit..."

# Obvious secrets
git diff --cached | grep -iE \
  '(password|secret|key|token)\s*[:=]\s*["\x27][^"\x27]{8,}' \
  && echo "⚠️  Possible hardcoded credential detected" \
  && exit 1

# For the LLM review, this goes in the PR template:
# "Paste the new files into Claude with this prompt:
#  'You are a security auditor. Find security vulnerabilities
#  in this code. Be specific: file, line, exploitation mechanism.
#  Don't tell me to use HTTPS — I already know that. Give me the non-obvious stuff.'"

The real change isn't technical. It's that the PR template now has a mandatory section: "Security audit prompt output." You can't merge without pasting it in. It forces the frame shift on the model.

Common Mistakes When Using LLMs for Security Review

Mistake 1: Generic prompt. "Does this code have any security issues?" is the worst possible prompt. The model will list OWASP best practices you already know. Better: "Assume I'm an attacker with read access to this repo. How would you exploit this specific code?"

Mistake 2: Insufficient context. You send an isolated function. The model can't detect vulnerabilities that depend on broader context. Send at least the files that interact directly with that code.

Mistake 3: Trusting silence. If the model didn't find anything, it doesn't mean there's nothing there. It means it didn't find anything with that prompt and that context. N-Day-Bench shows that 65–80% of real vulnerabilities pass right through the LLM filter.

Mistake 4: Not iterating. If the model says "I don't see any issues," ask again with the frame shifted: "What unexpected input could break this function?" or "How would you abuse the error handling here?"

Mistake 5: Only using it on new code. The most dangerous vulnerabilities tend to live in old code nobody touches. That code has no tests, no context, and nobody ever puts it in the PR template.

For context on how I think about tools and their limitations, my approach with Docker follows the same logic: understand what the tool actually measures before trusting that it measures what you need.

FAQ: LLMs and Vulnerability Detection

Does N-Day-Bench test real production vulnerabilities or constructed examples?
Real vulnerabilities with assigned CVEs. That's what sets it apart from earlier benchmarks. They take the affected code from the commit that introduced the bug, give the model relevant context, and verify whether it can identify the same problem that the researcher who reported the CVE found. This isn't an academic exercise.

Which model performs best on the benchmark?
In the evaluated versions, frontier models (GPT-4o, Claude 3.5 Sonnet) stay in similar ranges — 30–35% under optimal conditions. The difference between models is smaller than the difference between good and bad prompts with the same model. That's technically interesting and practically important.

Does it make sense to use LLMs for security review if they only find 35%?
Depends on what that 35% replaces. If it replaces zero review, it's a massive improvement. If it replaces a dedicated security engineer, it's a risk. The benchmark doesn't say LLMs are bad at security — it says they're good at a specific subset of vulnerabilities. Using them well means knowing that subset.

Why does the same model generate vulnerable code and then find it in an audit?
The prompt frame changes the behavior. In generation mode, the goal is to complete functional, coherent code. In audit mode, the goal is to find problems. It's not model inconsistency — it's that you're asking it to do two different things. N-Day-Bench operates exclusively in audit mode, which is the one that matters for security review.

Does this replace SAST tools like Semgrep or Snyk?
No, and N-Day-Bench doesn't claim otherwise. SAST is deterministic — it looks for known patterns with high precision. LLMs are probabilistic — they can reason about context and semantics but with less consistency. They're complementary. SAST for the known and systematic, LLMs for reasoning about business logic and emergent patterns.

Does the benchmark account for false positive costs?
There's a real limitation in the paper here: it measures recall (how many real bugs it found) but doesn't measure precision in the same way (how many alerts were noise). In practice, a model that generates 50 alerts per PR with 2 real ones is worse than one that generates 5 with 2 real ones. The authors acknowledge this gap, and future versions of the benchmark should address it.

What I'd Do Differently

My criticism isn't of N-Day-Bench the paper — it's methodologically honest. It's of how it's going to be read.

The headline "LLMs can find real vulnerabilities" is going to generate confidence where it should generate process. Teams are going to read the 35% as "we run the code through the model and we're done." It doesn't work like that. Same thing with open data — when I sonified Buenos Aires bus traffic I learned that having the data isn't the same as understanding it. The model has the security data. Using it well requires design.

What I'd do differently on a team today:

Security review prompt library — don't invent the prompt every time. Have 5–6 battle-tested prompts that shift the model's frame in different ways.
Mandatory LLM audit in PR template — like I did, but with specific prompts, not "does it have problems?"
Categorize by vulnerability type — use LLMs for what they're good at (credentials, obvious XSS, known patterns) and SAST + human review for business logic.
Don't treat silence as safety — explicitly document what you audited, with what tool, with what scope.

The gap between "finding" and "not committing" is real. I was lying to myself. But the lie wasn't that models are useless for security — it was that the way I was using them was wrong.

The difference between the plumber who fixes and the one who breaks isn't the plumber. It's who's supervising and what you're asking them to do.

That I can control. And now I do.

DEV Community