Sam Bishop

Posted on Mar 31

100K Attack Paths: What Happens When You Let AI Think Like a Pentester

#ai #cybersecurity #security #testing

For years, security teams have been counting vulnerabilities. The question we should have been asking is how they connect, and whether an attacker could chain them into something catastrophic.

The Ceiling Every Security Leader Eventually Hits

There is a particular kind of frustration that sets in after you have invested seriously in a security program and still cannot answer the one question that actually matters: if a sophisticated attacker came after us today, how far would they get?

You have the scanners. You have the SAST integrated into the CI/CD pipeline. You run DAST against your applications regularly. The vulnerability backlog is long but managed. On paper, the posture looks reasonable. And yet that question, how far would they get, remains genuinely unanswerable, because none of your tools are designed to answer it. They are designed to find individual flaws. They are not designed to reason about how those flaws connect.

Real-world attacks are almost never single-step events. The breaches that make headlines, the ones that result in data exfiltration, ransomware deployment, or full infrastructure takeover, are almost always the product of chains. A low-severity information disclosure on one microservice feeds into an authentication bypass on another, which opens a path to a privileged API endpoint, which ultimately reaches the database nobody was thinking about. Each individual link in that chain might be rated "Medium" in isolation. Together, they are a critical breach waiting to happen.

This is the ceiling. You can scan ten thousand endpoints in an hour, but you still cannot see the chain.

Why Traditional Tools Were Never Built for This

To understand what needs to change, it helps to be precise about what traditional security tools actually do and what they fundamentally cannot do.

Most legacy scanning tools are stateless. When a scanner tests one endpoint, it has no memory of what it found on the endpoint it tested five minutes ago. It does not accumulate context. It does not notice that the JavaScript file it just crawled contained a commented-out developer note referencing an internal staging API. It has no intellectual curiosity to pivot based on that discovery. It finishes its checklist and moves on.

This architectural limitation creates what you might call an exploitability gap. Security teams end up with thousands of medium-severity findings and no reliable way to know which three of them, combined in the right sequence, would give an attacker the keys to the entire cloud environment. We have been producing enormous volumes of data while remaining genuinely blind to the insight buried inside it.

Human pentesters solve this problem, but only partially, and at significant cost. A skilled tester brings exactly the kind of stateful reasoning that scanners lack. They remember what they found. They form hypotheses. They adapt. But they are also expensive, scarce, and bounded by time. A two-week engagement against a complex microservices application is not enough time to explore the full depth of what is possible. It never has been.

What It Actually Means for AI To Think Like a Pentester

The phrase "AI-powered security" has been stretched so far in marketing materials that it has nearly lost meaning. A machine learning model that reprioritizes your vulnerability queue is not the same thing as an AI that reasons about your application. The distinction matters enormously in practice, so it is worth being specific about what agentic AI pentesting actually does differently, because the gap between the two is not incremental. It is architectural.

In an agentic system, the large language model acts as a reasoning engine, not a script runner. It does not execute a predefined list of checks. It forms hypotheses, acts on them, observes the results, and adapts its approach based on what it finds. That cognitive loop, running continuously across thousands of requests, is what produces the behavior that actually resembles how a skilled human attacker thinks.

The process looks something like this in practice. During reconnaissance, the AI is not just crawling for links. It is analyzing API documentation, reading response headers, observing how the application behaves under different inputs, and building a model of the application's intent. From that model, it generates hypotheses: given that this application uses GraphQL and exposes a debug parameter, what are the most plausible paths to a broken object-level authorization vulnerability?

It then tests those hypotheses. And here is where the behavior diverges most sharply from traditional automation: when a test fails, say a WAF blocks a standard payload, the AI does not log the block and move on. It reasons about the rejection. It considers whether an obfuscated payload or an alternative encoding might slip through. It tries a different angle. This is adaptation, not iteration.

Most critically, it maintains state. If the AI discovers a partial UUID buried in a response header during one request, it stores that. When it later encounters an administrative endpoint, it attempts to use that UUID to impersonate a privileged user. The connection between those two discoveries, separated by potentially thousands of requests, is exactly the kind of connection that stateless tools miss entirely and human pentesters might catch only if they have enough time.

What 100K Attack Paths Actually Means

The figure sounds like a marketing number until you work through the math of a moderately complex application. Consider an enterprise system with fifty API endpoints, five user roles, and ten meaningful input variations per endpoint. The number of possible sequences through that state space, covering different orderings, different role combinations, and different input chains, expands combinatorially into the hundreds of thousands before you have even accounted for timing, environmental conditions, or multi-step chaining across services.

Traditional automation covers the intended paths, the ones the developers built and tested. Skilled human pentesters cover the obvious deviant paths, the ones that violate assumptions in predictable ways. An Agentic AI Penetration testing Tool has the computational capacity to explore the deep deviant paths: the one-in-a-hundred-thousand sequence that triggers a race condition, surfaces an unhandled exception, or chains a low-severity disclosure into a critical authorization bypass.

The goal is not to report a hundred thousand issues. Nobody wants that report and nobody would read it. The goal is for the AI to internally explore that space, discard the paths that lead to dead ends, and surface the small number, perhaps ten, perhaps twenty, that represent verified, exploitable chains with real business impact. The hundred thousand paths are the search space. The output is the insight.

From Vulnerability Reports to Verified Exploit Chains

One of the most quietly exhausting parts of working in security is the false positive debate. A scanner flags a high-severity finding. A developer pushes back. The security team spends days producing evidence that the finding is real. The developer spends days arguing that it is not exploitable in their specific configuration. Meanwhile the actual risk sits unaddressed.

Agentic AI changes this dynamic fundamentally by producing proof rather than assertions. Because the AI is reasoning its way through the attack rather than pattern-matching against a signature database, it does not report that a SQL injection might exist. It reports the exact payload it used, the database version it successfully queried, and a reproducible sequence of steps that demonstrates the unauthorized access. The finding is not a theory. It is a dossier.

When a developer receives a ticket that includes the precise steps an attacker would follow, the data they would access, and a reproducible demonstration of the exploit, the conversation changes entirely. There is no longer a debate about whether the risk is real. The discussion moves immediately to how quickly it can be fixed. That compression of the time between finding and remediation is one of the most significant practical benefits of proof-based validation, and it is something that traditional scanning, regardless of how sophisticated, simply cannot deliver.

How Agentic AI Fits into a Security Program That Already Exists

A reasonable concern when evaluating agentic AI is whether it replaces the security investments an organization has already made. It does not, and any vendor claiming otherwise is oversimplifying. What it does is make those existing investments more valuable by connecting them.

SAST tools find flaws in code, but they lack the runtime context to know whether those flaws are actually reachable from the outside. Agentic AI can take a SAST finding and attempt to reach it through the application's actual attack surface. If it cannot get there, the priority of that fix drops. If it can, and if it can chain that flaw with something else to produce a meaningful exploit, the priority rises accordingly. This is the kind of contextual prioritization that security teams have been trying to do manually for years.

DAST tools provide breadth, meaning continuous wide-coverage scanning that catches obvious issues as they are introduced. Agentic AI provides depth, meaning focused reasoning-driven validation of the complex scenarios that broad scanning cannot reach. These are complementary functions, not competing ones. The instinct to treat them as alternatives is the same instinct that once led teams to choose between SAST and DAST, when the answer was always to use both and understand what each is actually good for.

Human pentesters remain essential, but their role sharpens considerably when agentic AI handles the exploratory work. Instead of spending engagement hours on reconnaissance and surface-level testing, skilled testers can focus entirely on the scenarios that genuinely require human creativity: business logic abuse that requires understanding of domain-specific context, social engineering, physical security, and the kind of lateral thinking that no AI system yet replicates reliably. Agentic AI does not make human pentesters redundant. It makes them significantly more effective by giving them a pre-mapped terrain to work from rather than a blank canvas.

The Governance Problem Nobody Talks About Enough

Giving an AI system the ability to reason like an attacker against your own infrastructure is genuinely powerful. It is also, if done carelessly, genuinely dangerous. This is the part of the agentic AI conversation that deserves more honest attention than it typically receives.

The risks are not hypothetical. An AI system exploring attack paths without proper execution boundaries could inadvertently trigger denial-of-service conditions by generating too many requests too quickly. It could interact with production systems in ways that affect real users. It could follow a reasoning chain into territory that was never intended to be in scope. These are not arguments against agentic AI. They are arguments for building and deploying it with governance as a first principle rather than an afterthought.

The right architecture for this involves several layers working together. Execution should be confined to staging and non-production environments, so that exploit validation never touches live user data or operational systems. Scope boundaries, the equivalent of rules of engagement in a traditional pentest, need to be enforced at the platform level, not just defined in a configuration file that a misconfigured deployment can ignore. Rate limits prevent exploratory behavior from becoming accidental load testing. And every decision the AI makes during its reasoning process should be logged in a form that a security team can audit and a compliance officer can review without needing to understand the underlying model architecture.

The explainability question is particularly important. A CISO authorizing agentic AI testing needs to be able to answer, at any point, why the system chose a specific attack path. Not because they distrust the AI, but because they are accountable for what it does. Platforms that provide a full reasoning trace, a step-by-step log of the AI's internal logic from hypothesis through execution to finding, give security leaders the visibility they need to operate responsibly. Black-box AI has no place in security testing, regardless of how impressive its output might be.

What Changes When You Can See the Whole Chain

There is a meaningful difference between knowing that vulnerabilities exist in your application and knowing how an attacker would actually use them. Security programs that operate only at the first level, finding individual flaws, scoring them by severity, and working through the backlog, are measuring the wrong thing. They are measuring the presence of problems, not the presence of risk.

The mental model shift that agentic AI enables is moving from a static inventory of weaknesses to a dynamic map of exposure. That map answers different questions. Not "how many critical findings do we have this quarter" but "what is the worst thing an attacker could do to us right now, and how would they do it." Those are the questions that boards ask after a breach. The organizations investing in attack path reasoning are the ones that can answer them before one happens.

That shift in framing also changes how security communicates with the rest of the business. A finding that says "SQL injection vulnerability present in checkout API" lands differently than "we validated a four-step exploit chain that allows an unauthenticated attacker to access the payment records of any customer." The second framing conveys actual business risk. It is the kind of language that drives remediation priority, informs engineering investment decisions, and justifies security budget in terms that a non-technical executive can genuinely understand.

The Question Worth Sitting With

Exploring a hundred thousand attack paths is not a vanity metric. It is the only honest response to the complexity of modern application environments, which are genuinely too intricate for any human team to map exhaustively, no matter how skilled or well-resourced.

The organizations that will handle the next decade of threat evolution well are not necessarily the ones with the largest security budgets. They are the ones that stopped asking "what vulnerabilities do we have" and started asking "how would an attacker actually get to our most critical assets" and then built programs designed to answer that question continuously, with proof, at scale.

The ceiling that every security leader hits eventually is not a budget ceiling or a talent ceiling. It is a reasoning ceiling. Agentic AI is the first technology that meaningfully raises it.

DEV Community