Alan West

Posted on May 12

How to verify AI-discovered vulnerabilities aren't just training data echoes

#ai #security #llm #devops

The setup

Last month a friend DM'd me a screenshot. An AI security agent had "discovered" a vulnerability in a popular open-source project. The agent walked through exploitation steps, suggested a patch, the whole nine yards. Looked legit.

Then someone pointed out the CVE ID it kept almost-quoting was from years earlier.

This is going to keep happening. As we wire LLMs into vulnerability research workflows, we run into a problem that doesn't have a clean analogue in traditional static analysis: the tool you're using may have already seen the answer in its training data, and it cannot reliably tell you which findings came from reasoning and which came from memory.

I've spent the last few months adding AI-assisted triage to a security workflow at a contracting gig. Here's what I've learned about not getting fooled.

Why this happens (the root cause)

LLMs train on whatever crawlable text is on the open internet. That includes:

The full NVD database
GitHub Security Advisories
CVE writeups on blogs
Bug bounty disclosures (after the embargo lifts)
Mailing list archives (oss-security, full-disclosure, etc.)
Project changelogs and commit messages

If a CVE was disclosed before a model's training cutoff, the model has very likely seen a description of the bug, the patch, and probably someone's analysis of it. When you point that same model at the vulnerable file, it isn't always finding the bug — sometimes it's recognizing it.

The tricky part: the model usually can't tell you which is which. It generates the same confident output either way. There's no internal flag for "I retrieved this from memory" versus "I derived this from the code in front of me."

This is the same phenomenon that makes LLMs unreliable for leaked benchmark questions — if the benchmark made it into training, the model "solves" it by recall. The security version just has higher stakes.

The validation workflow

Here's the rough process I run on any AI-flagged finding before it gets escalated. None of this is exotic — it's stuff I wish I'd been doing from day one.

Step 1: Check the public databases first

Before you trust any finding, fuzzy-match the bug fingerprint against known CVEs. The NVD publishes JSON data feeds you can pull locally:

import json
from difflib import SequenceMatcher
from pathlib import Path

# NVD yearly feeds: https://nvd.nist.gov/vuln/data-feeds
def load_nvd_feed(year: int) -> list[dict]:
    path = Path(f"nvdcve-1.1-{year}.json")
    return json.loads(path.read_text())["CVE_Items"]

def similarity(a: str, b: str) -> float:
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

def find_matches(ai_description: str, package: str, threshold: float = 0.55):
    matches = []
    for year in range(2010, 2027):
        for item in load_nvd_feed(year):
            desc = item["cve"]["description"]["description_data"][0]["value"]
            # Cheap pre-filter: only compare CVEs that mention the package
            if package.lower() not in desc.lower():
                continue
            score = similarity(ai_description, desc)
            if score >= threshold:
                matches.append((score, item["cve"]["CVE_data_meta"]["ID"], desc))
    return sorted(matches, reverse=True)

hits = find_matches(ai_finding, package="openssl")
for score, cve_id, desc in hits[:5]:
    print(f"{score:.2f}  {cve_id}: {desc[:120]}...")

If you get a hit above ~0.6 similarity, your "discovery" is almost certainly a memorized CVE. SequenceMatcher is dumb but it catches the obvious cases. For better recall use sentence embeddings (the sentence-transformers library works fine) but start with the dumb thing — it's faster to debug.

Step 2: Check the timeline

Git history doesn't lie. If the model says "this buffer overflow in parse_packet," run blame on the offending lines and check what the file looked like at different points in time:

# When was the suspect line introduced?
git log --all --follow -p -- path/to/file.c | head -200

# Did a security fix already land near this code?
git log --all --source --remotes --grep="security\|CVE" \
    -- path/to/file.c

If a fix landed for this exact code path years ago and the model is "discovering" it against modern source, you've already got your answer. Either the bug is fixed (and the model is recalling the pre-fix version), or there's a regression — which is worth knowing either way, but it's not a novel discovery.

Step 3: Force the model to reason from scratch

Here's a trick that's saved me a lot of time. Run the analysis again with the package name and obvious identifiers redacted. Replace function names with hashes:

import re
import hashlib

def anonymize(source: str, package: str) -> str:
    # Strip package name and CVE-ish identifiers the model could pattern-match on
    source = re.sub(rf"\b{package}\b", "PACKAGE_X", source, flags=re.I)
    source = re.sub(r"CVE-\d{4}-\d+", "CVE-REDACTED", source)

    # Hash long identifiers so memorized function names don't trigger recall
    def hash_ident(m: re.Match) -> str:
        return "fn_" + hashlib.sha256(m.group(0).encode()).hexdigest()[:8]

    return re.sub(r"\b[a-z_][a-z0-9_]{6,}\b", hash_ident, source)

If the model still flags the same vulnerability class on the anonymized code, the finding is probably grounded in the code in front of it. If it suddenly can't find anything, you were getting recall.

This isn't bulletproof — distinctive code structure can still trigger memory — but it filters out a lot of noise. I haven't tested this thoroughly against every model family, so calibrate your threshold against findings you already know the answer to.

Prevention: building this into your workflow

A few habits that have stuck:

Treat AI findings as leads, not conclusions. Same as a static analyzer warning. You wouldn't ship a fix for a gosec G104 without reading the code; don't ship one for an LLM finding either.
Note the model's training cutoff in the report. Any CVE disclosed before that date is suspect by default.
Cross-check against multiple sources. NVD, GitHub Advisory DB, the project's own security page (for FreeBSD that's freebsd.org/security).
Require a working PoC before triaging as P1. If the model can't produce a reproducer that actually runs against the current code, the finding is theoretical at best.
Log the prompt and full output. When you eventually find out a "discovery" was a memory hit, you want to know what the prompt looked like so you can adjust.

The uncomfortable truth

Even when an AI tool does genuinely identify a real bug, you usually can't tell from the output alone whether it reasoned its way there or got lucky with memorization. That isn't a bug in any specific tool — it's a property of how these models work. The validation step isn't optional and it isn't going away.

The good news is that the validation is straightforward. The bad news is that I keep meeting teams who skip it because the AI sounded confident.

Don't skip it.

DEV Community