DEV Community

Bala Paranj
Bala Paranj

Posted on

Claude Mythos is Not a Silver Bullet. LLMs Can Find Bugs in Your Code. That's One Class of Security Problem.

Anthropic recently published a guide on using LLMs to secure source code. It describes a six-step loop — threat model, sandbox, discovery, verification, triage, patching — for using Claude Opus to find vulnerabilities in codebases. The process is well-designed. The separation of discovery and verification into independent agents is better than most approaches. The emphasis on threat modeling before scanning is correct. The practical advice (pin your dependencies, sandbox faithfully, don't let the discovery agent self-censor) comes from real experience with real teams.

For the class of problem it addresses — finding known vulnerability patterns in source code — this is a strong guide. Teams should read it and use it.

The concern is with the framing. "Using LLMs to secure source code" presents one class of security problem as if it were the security problem. That framing makes several other classes of security problem — arguably the ones that cause the most damaging breaches — invisible. Not rejected. Not deprioritized. Invisible. When a frontier AI lab publishes a guide that frames security as "find bugs in code and fix them," the industry follows, and the invisible problems stay invisible.

What the guide covers well

The guide addresses pattern-recognizable source code defects: buffer overflows, SQL injections, missing input validation, use-after-free, type confusion — vulnerability classes that have a recognizable shape in code. LLMs are capable here because their training data contains thousands of examples of each pattern. A model that has seen ten thousand SQL injections can spot the ten-thousand-and-first. Discovery is straightforward to parallelize, as the guide notes, and the bottleneck has shifted to verification and patching.

This is real, useful, important work. Source code vulnerabilities are dangerous. Finding them faster is better than finding them slower. None of what follows diminishes that.

But source code bugs are one class of security problem among several, and the others don't respond to this approach at all.

The classes the guide doesn't see

Configuration posture

Does your cloud infrastructure honor the rules you declared? A public S3 bucket isn't a source code vulnerability. There's no buffer overflow to find, no injection to detect. The code that provisioned the bucket might be flawless. The configuration is the problem — and it either matches business intent or it doesn't.

LLM-powered source code scanning cannot find a configuration that violates a posture rule, because the violation isn't in the source code. It's in the deployed state. A different tool, evaluating the configuration against declared invariants, is the right instrument for this class. This guide doesn't mention it, because the guide's frame starts and ends with source code.

Compound and relational risk

Does two safe-looking configurations combine into a dangerous path? An IAM role that can assume another role that can read a sensitive bucket isn't a bug in any service's source code. Each resource is configured correctly in isolation. The risk exists only in the relationship — in how they compose.

An LLM scanning each service's code individually would find nothing wrong in any of them, because nothing is wrong in any of them. The vulnerability is an edge in the configuration graph, not a node in the source tree. Source code scanning, however sophisticated, checks nodes. Compound risk lives in edges. The tool that catches it is one that evaluates the graph — deterministically, across resources — not one that reads each file looking for bug patterns.

Intent verification

Does the deployed state match what the business intended? "Make this bucket public" is correct for a CDN and a breach for customer data. The source code that provisions it might be identical. The LLM scanning it would see the same code. The intent is opposite, and the intent is not in the code — it's in a business decision the model was never part of.

A security tool that checks whether the configuration matches declared intent catches the mismatch. A tool that scans source code for known vulnerability patterns doesn't, because there's no vulnerability pattern to match — the code does what it was told to do. The question is whether what it was told is what the business meant. That's a different class of problem.

Architectural correctness

Is the system designed correctly, not just bug-free? A system can have zero source code vulnerabilities — every buffer bounded, every input validated, every injection prevented — and still be architecturally wrong. Wrong trust boundaries. Wrong blast radius for a failure. Wrong assumptions about what's internal versus external. A microservice that trusts everything on the internal network, by design, has no source code vulnerability. It has a design flaw that a source code scanner cannot see, because the flaw is in the architecture, not the code.

The four things the industry combines into one

Underneath this confusion is a failure to make a classification that is simple enough to fit on a napkin. The security industry routinely combines four distinct things under code or source code and then builds tools and guides as if they were one problem:

Application code — the source code of the software. Python, Go, Java, the functions and logic. Buffer overflows, SQL injections, and use-after-free vulnerabilities live here. LLM scanning works here because these bugs have recognizable patterns.

Application config — the configuration that governs how the application behaves. Feature flags, connection strings, retry policies, session timeouts, CORS settings. A misconfigured session timeout isn't a code bug — the code correctly reads the config and applies it. The config value is wrong. Scanning the application code doesn't find it.

Infrastructure code — Infrastructure as Code: Terraform, CloudFormation, Pulumi, the templates that provision cloud resources. Tools like Checkov and tfsec scan these templates for misconfigurations before deployment. This is a real and useful category, but it's checking the instructions for building infrastructure, not the state of what was built.

Infrastructure config — the actual deployed state of cloud resources. IAM policies, S3 bucket settings, security groups, VPC configurations, trust relationships. This exists in production right now, regardless of what the Terraform said should exist. Drift happens. Manual changes happen. The deployed state is the reality an attacker sees.

Each of these has different vulnerability classes, needs different tools, operates on different epistemics, and changes at a different rate. Application code changes with every commit. Infrastructure config changes when someone modifies a cloud console setting or when drift accumulates silently. A tool built for one doesn't cover the others, because they're different problems.

The Anthropic guide covers the first one — application code. By framing it as securing source code without distinguishing these four, it lets the reader assume the same approach covers all of them. It doesn't. An LLM scanning your Go code for injection vulnerabilities cannot see that your S3 bucket's deployed configuration violates your posture rules. A tool checking your Terraform templates cannot see that the deployed state has drifted from what the template specified. A tool evaluating your deployed infrastructure config against declared invariants cannot find a buffer overflow in your application code. Each tool is right for its quadrant and wrong for the other three.

The market confusion exists because vendors, guides, and thought leadership routinely say code when they mean one of these four, and the audience hears "all of them." An industry that made this four-way distinction explicit — in every guide, every product description, every conference talk — would immediately clarify which tools cover which problems, where the gaps are, and what "comprehensive security" requires: different instruments for each quadrant, not one instrument applied to all four.

Where the tools actually land

                        Code                    Configuration
                ┌───────────────────────┬───────────────────────┐
                │                       │                       │
                │  LLM scanning         │  Largely uncovered    │
                │  (Claude/Mythos)      │                       │
  Application   │  SAST, SCA            │  No mainstream tool   │
                │                       │  checks whether app   │
                │  ← Anthropic's guide  │  config matches       │
                │    lives here         │  declared intent      │
                │                       │                       │
                ├───────────────────────┼───────────────────────┤
                │                       │                       │
                │  IaC linters          │  CSPM scanners        │
                │  (Checkov, tfsec,     │  (Prowler, Wiz)       │
 Infrastructure │   Trivy)              │  check individual     │
                │                       │  resource settings    │
                │  Checks the template  │                       │
                │  before deploy        │  Compound/relational  │
                │                       │  risk across resources│
                │                       │  ← the gap            │
                └───────────────────────┴───────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Mythos fits in one cell: Application × Code. It is a more powerful version of the tool in the top-left quadrant. It does not move into the other three. A 10x improvement in the top-left leaves the bottom-right — where compound cross-resource risk lives in the actual deployed state — exactly as uncovered as it was before.

Where the risk actually lives

                     Low Impact              High Impact
                ┌───────────────────────┬───────────────────────┐
                │                       │                       │
                │  Individual misconfigs │  Known CVEs under     │
                │  that are noise —     │  active exploitation  │
   Visible      │  open port on an      │                       │
  (tools find   │  internal test box,   │  EPSS high-probability│
   it, dashboards│  suppression issues   │  findings             │
   show it)     │  teams mute           │                       │
                │                       │  ← industry focus     │
                ├───────────────────────┼───────────────────────┤
                │                       │                       │
                │  Stale app config     │  Compound cross-      │
                │  that's technically   │  resource paths        │
   Invisible    │  wrong but unreachable│                       │
  (no tool in   │                       │  Intent mismatches    │
   the pipeline │                       │                       │
   sees it)     │                       │  Architectural trust- │
                │                       │  boundary violations  │
                │                       │                       │
                │                       │  ← where breaches live│
                └───────────────────────┴───────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The industry's tools and attention concentrate in the top row. Dashboards show these findings. Metrics track them. Teams triage and remediate them. EPSS improves prioritization within the top-right. LLM scanning accelerates discovery within the top-left.

The bottom-right — invisible and high impact — is where the breach reports live. Compound paths that no single-resource check flags. Configurations that are correct in isolation but wrong in combination. Architectural trust boundaries that exist only as assumptions, never declared, never verified. These risks produce no dashboard finding, no scanner alert, no EPSS score. They surface as breaches, not as findings.

A more powerful model makes the top-left faster. Better prediction makes the top-right smarter. Neither touches the bottom-right. That quadrant needs a different kind of tool — one that evaluates the deployed configuration graph against declared invariants, deterministically, including the cross-resource paths that live in the edges rather than the nodes.

Pattern-recognizable vs intent-dependent

The deeper issue is that the guide treats all security problems as pattern-recognizable — problems where the vulnerability has a shape the model can learn from examples. For source code bugs, this is true. A SQL injection has a shape. A buffer overflow has a shape. An LLM trained on thousands of examples matches the shape.

But configuration posture, compound risk, and intent verification are intent-dependent problems. The verdict depends not on whether the code matches a known vulnerability pattern, but on whether it matches what the operator intended. A public bucket is a vulnerability or a feature, depending on the intent. An overprivileged role is a risk or a deliberate choice, depending on the context. A cross-account trust relationship is an attack vector or a required integration, depending on the business decision. No model trained on vulnerability patterns can distinguish between these, because the difference isn't in the pattern — it's in the intent, which the model doesn't have.

For pattern-recognizable problems, LLMs are a genuine advance. For intent-dependent problems, they're the wrong instrument, because the answer depends on information the model structurally doesn't have: what did the operator mean this configuration to do?

A more powerful model is a better tool for one quadrant, not a silver bullet for all four

Anthropic's newest model, Mythos, is being positioned as a step change in vulnerability discovery. The Roytman article on O'Reilly Radar says to "plan for an order of magnitude more findings over the next 24 months." The capabilities are real — finding a 27-year-old bug in OpenBSD or a 16-year-old bug in FFmpeg that millions of fuzzing runs missed is genuinely impressive work.

But a model that is ten times better at finding application code vulnerabilities is ten times better at quadrant one. It does nothing for quadrants two, three, and four. A more powerful engine in a car that has no brakes is a faster car with no brakes — the improvement makes the missing subsystems more consequential, not less.

Worse, an order of magnitude more findings flowing into a reactive pipeline that already has a 6% patch rate doesn't improve security — it overwhelms it. The teams receiving those findings still need to verify, triage, and patch them. If discovery scales by 10x and the downstream capacity doesn't, the backlog grows by 10x, alert fatigue intensifies, and the signal-to-noise ratio drops. The Roytman article recognizes this and prescribes better prediction for triage. But neither the prediction model nor the more powerful discovery model asks the question that would reduce the input volume: does the configuration that exposes these vulnerabilities to the network satisfy the rules we declared? A vulnerability in code that's unreachable due to correct configuration posture is a finding that never needed to enter the backlog.

There are no silver bullets in security. There never have been. Fred Brooks said it about software engineering in 1986, and it's no less true when the tool is a frontier AI model. A model that finds more bugs faster is a better version of one tool in one quadrant. Security is a property of all four quadrants simultaneously, and the quadrants that have no tool don't become less important because the quadrant that got a more powerful one. They become more exposed — because the attention, the budget, and the narrative all flow toward the impressive capability, and away from the unglamorous work of verifying configuration, checking compound risk, and declaring intent.

The industry's job is not to celebrate the silver bullet. It's to ask: what class of problem does this solve, what classes does it not, and what covers the rest? That question is simple, it fits on a napkin, and nobody with a silver bullet to sell is going to ask it for you.

The market already demonstrated the cost of not asking it. When Anthropic's Claude Mythos leaked in late March 2026, cybersecurity stocks lost billions in days — CrowdStrike fell 7%, Palo Alto Networks 6%, Tenable 9%, the iShares Cybersecurity ETF dropped 4.5%. The market's logic was the four-quadrant conflation in action: if AI finds vulnerabilities, cybersecurity companies are obsolete. But CrowdStrike does endpoint protection on live networks. Zscaler inspects live traffic through a Zero Trust Exchange. Rubrik does data resilience and recovery. None of them operate in Application × Code — the one quadrant Mythos actually covers. The market treated all of security as one quadrant, panicked, and then corrected: six weeks later, the same ETFs hit record highs as enterprises responded to the new threat by increasing security spend, not decreasing it. The selloff wasn't wrong about the technology. It was wrong about the classification — the same classification error this article exists to prevent.

The 6% signal

The guide's own data tells a story worth pausing on. "As of May 22, 2026, we had disclosed 1,596 vulnerabilities. To our knowledge, 97 of these have been patched." That's a 6% patch rate. Discovery is easy. Everything after discovery is hard.

The guide's response is to improve the post-discovery pipeline — better verification, better triage, better patching. That's correct for the reactive model. But the 6% also asks a different question the guide doesn't consider: how many of those 1,596 would have been prevented if the configuration and architecture that allowed them to exist had been verified against declared rules before deployment? A buffer overflow in source code can't be prevented by configuration verification. But a misconfigured trust boundary that exposes the vulnerable service to the internet — turning a local bug into a remote exploit — could have been caught as a posture violation before the source code bug mattered.

The reactive loop (find → verify → triage → patch) is necessary for the bugs that exist. Verification of configuration posture reduces the number of bugs that are reachable and exploitable in the first place. The two are complementary, not competitive. The guide presents only the first and frames it as comprehensive.

The stochastic acknowledgment

The guide makes an honest admission: "Models are stochastic, and a large codebase can have a long tail of vulnerabilities that continue to trickle in even when the code is unchanged." Each scan produces different results. Run it again and you find different bugs.

For a bug-hunting tool, that's acceptable — more runs, wider coverage. But this is also an acknowledgment that the tool cannot provide assurance. An auditor doesn't want "we found 47 bugs this run and might find different ones next run." They want "this configuration satisfies these rules, deterministically, reproducibly, same input, same verdict, every time." The guide doesn't distinguish between these two use cases — discovery (stochastic, more runs is better) and assurance (deterministic, reproducibility is the requirement) — because it treats all of security as discovery.

Vassilev's NIST proof (June 9, 2026) formalizes this: no finite set of rules governing a stochastic system is universally complete. Discovery via LLM will always have a long tail. That's fine for discovery. It's disqualifying for assurance. The two serve different purposes and require different instruments.

What the industry needs to hear

LLM-powered source code scanning is a real advance for a real class of security problems. Use it. The guide is well-written and the process is sound for its domain.

But the industry needs to hold two things simultaneously:

First, this solves bug-finding, not security. Security is a property of the whole system — source code, configuration, architecture, intent, composition. A system with zero source code bugs and a misconfigured trust boundary is not secure. A system with zero source code bugs and a compound IAM path to sensitive data is not secure. Securing source code is one layer. Treating it as the whole is how the other layers stay unaddressed.

Second, different classes of problems need different instruments. Pattern-recognizable source code bugs → LLM scanning (the guide's domain). Configuration posture against declared rules → deterministic verification. Compound/relational risk across resources → graph evaluation. Intent verification → declared intent checked mechanically. Each has its own tool, its own epistemics, and its own strengths. Lumping them into "use LLMs to secure code" makes three of the four invisible.

The guide's closing says: "We believe it's getting easier for models to find and exploit vulnerabilities in code. Thus, our work as defenders is to find and fix the vulnerabilities in our code before adversaries exploit them." That's true for the vulnerabilities that live in code. The ones that live in configuration, in composition, in the gap between intent and state — those require a different sentence entirely. And until someone writes that sentence with equal authority, the industry will keep optimizing one class and leaving the others to be discovered in breach reports.


References: Anthropic, "Using LLMs to secure source code" (May 27, 2026); Vassilev, NIST/IEEE Security and Privacy (June 9, 2026). If you work on a class of security problem this article says is missing from the LLM-scanning frame — configuration posture, compound risk, intent verification, architectural correctness — and you have a different view of how it relates to source code scanning, that's the conversation worth having.

Top comments (0)