Why one-shot LLM security audits keep missing real bugs

#ai #security #opensource #devops

The 3am audit that found nothing

Last month I inherited a Node service that had been in production for two years with zero security review. Classic. My first move was the lazy one: dump the repo into an LLM and ask "find me security issues."

I got back a 40-item list. Most of it was garbage. A handful of "SQL injection" findings on code that used parameterized queries. A panicky warning about a hardcoded JWT secret that was actually a test fixture. Two real bugs buried in the noise, both of which I missed on first read because I'd already started ignoring the output.

This is the problem with single-pass LLM security scanning. It feels productive. It produces output. The output is mostly wrong, and the few right answers get lost.

Let me explain why this happens and what the open-source security community has converged on as a fix.

Why single-shot prompting fails at security

A vulnerability isn't a string pattern. It's a property of how data flows through a system. Whether a eval(userInput) is exploitable depends on what userInput is upstream, what middleware ran before it, what the deployment context is. A model that sees one file at a time, or even the whole repo in one prompt, can't reason about that reliably.

Three specific failure modes I keep seeing:

Context collapse. When you stuff 200 files into one prompt, the model treats the whole thing as a soup. It loses the call graph. The thing it flags in utils/sanitize.js is actually safe because every caller already validated the input.
No verification step. The model proposes a finding and then moves on. There's no "prove it" stage. Nothing constructs an exploit path or even checks whether the suspicious sink is reachable.
Hallucinated taint. I've watched models invent function signatures that don't exist, then warn me about a vulnerability in the made-up function. Once it's in the output, it looks identical to a real finding.

The root cause is that security analysis is a pipeline, not a single question. Static analyzers have known this for decades — that's why tools like CodeQL split things into extraction, query, and result interpretation phases.

The multi-stage agent pattern

The pattern that's emerging in projects like evilsocket/audit is to split the work across multiple specialized agent stages, each with a narrow job and its own context window. You can think of it as a security-focused take on the same agentic patterns we already use for code search and refactoring.

Here's the rough shape. I'm not going to claim this is exactly how audit implements its 8 stages internally — go read the repo for the specifics — but this is the structure that keeps showing up:


[1] Recon       -> map the codebase, identify entry points
[2] Surface     -> list sources (user input) and sinks (dangerous calls)
[3] Triage      -> pick candidate flows worth deep analysis
[4] Deep read   -> trace each candidate through the call graph
[5] Hypothesis  -> propose specific vulnerability + payload
[6] Verify      -> attempt to construct a concrete exploit path
[7] Filter      -> drop unverified / unreachable findings
[8] Report      -> structured output with evidence

The critical insight: each stage gets a fresh context focused on its job. Stage 4 doesn't need to remember that stage 1 found 200 files — it just needs the candidate flow it's analyzing. This is what keeps the model honest.

A minimal version you can build today

You don't need a giant framework to try this. Here's a stripped-down sketch in Python that captures the core idea — a two-stage pipeline that surfaces candidates first, then verifies them in isolated calls:


import anthropic

client = anthropic.Anthropic()
MODEL = "claude-opus-4-7"

def find_candidates(file_path: str, source: str) -> list[dict]:
    # Stage 1: cheap pass, just locate suspicious sinks
    resp = client.messages.create(
        model=MODEL,
        max_tokens=2048,
        system="You list potentially dangerous sinks. No analysis yet.",
        messages=[{
            "role": "user",
            "content": f"File: {file_path}\n\n{source}\n\n"
                       "List each line number containing a sink "
                       "(exec, eval, subprocess, raw SQL, etc). "
                       "Output JSON: [{line, sink_type}]."
        }]
    )
    return parse_json(resp.content[0].text)

def verify_candidate(file_path: str, source: str, candidate: dict) -> dict | None:
    # Stage 2: focused analysis on ONE candidate at a time
    # Fresh context = no contamination from other findings
    resp = client.messages.create(
        model=MODEL,
        max_tokens=4096,
        system=(
            "You verify whether a sink is actually exploitable. "
            "Trace the data backwards. If you cannot construct "
            "a concrete input that reaches the sink unsanitized, "
            "respond with 'NOT_EXPLOITABLE' and stop."
        ),
        messages=[{
            "role": "user",
            "content": f"File: {file_path}\nSink at line {candidate['line']}: "
                       f"{candidate['sink_type']}\n\n{source}\n\n"
                       "Provide: 1) data flow, 2) example payload, "
                       "3) confidence (low/med/high)."
        }]
    )
    return parse_finding(resp.content[0].text)  # returns None if not exploitable

The key moves:

Stage 1 is cheap and broad. It doesn't try to be right about exploitability. It just finds things to look at.
Stage 2 runs once per candidate. No cross-contamination. The model can't confuse findings between unrelated files.
The system prompt explicitly allows "not exploitable." This is huge. If your prompt rewards finding bugs, you'll find bugs that aren't there. You have to give the model an honorable way to say nothing's wrong.

I ran a version of this against the same Node service that produced the 40-item garbage list. Got back 6 findings. 4 were real. The signal-to-noise ratio went from "unusable" to "I can fix these before lunch."

Where this still falls down

Be honest with yourself about the limits:

Cross-file taint is hard. Even staged agents struggle when the source and sink live in different services. You probably still need a real SAST tool (Semgrep is my default) for whole-repo data flow.
Auth and business logic bugs are invisible. No agent can tell you that /admin/delete should require a role check unless it knows your role model. That context has to come from you.
Cost adds up. A 200-file repo with 50 candidates means 50+ deep-analysis calls. Run it on diffs in CI, not on the whole repo every commit.

Prevention: bake it into the pipeline

The trick is to stop treating security review as a quarterly event. A few things I've started doing:

Run the candidate-finding stage on every PR diff. It's cheap. It catches the obvious stuff.
Reserve the deep-verification stage for diffs that touch authentication, file I/O, or anything that takes user input.
Keep a .security-ignore file for known-safe patterns (test fixtures, intentional eval in sandboxes). Re-flagging the same false positive every week trains your team to ignore the tool.
Whatever the agent reports, treat it as a hypothesis, not a verdict. The human still owns the call.

Multi-stage agents won't replace your security team. But they'll stop wasting your security team's time on hallucinated SQL injection in code that doesn't touch a database. That alone is worth the afternoon it takes to wire one up.