Gerus Lab

Posted on Apr 27

AI Found 271 Bugs in Firefox. Here's What Everyone Got Wrong About It.

#ai #security #programming #webdev

Last week, Mozilla dropped a bombshell: Anthropic's Mythos AI model found 271 vulnerabilities in Firefox 150. Headlines screamed. Twitter exploded. "AI is replacing security researchers!" "Zero-days are numbered!" "$20,000 bugs found automatically!"

We spent a day digging into the actual CVE data, commit history, and Mozilla's security advisories. What we found is more nuanced — and in some ways more interesting — than the hype suggests.

The Real Numbers

Let's start with what Mozilla actually claimed: Mythos found 271 vulnerabilities in Firefox 150. That sounds enormous. But when you trace those 271 to actual CVEs, the picture shifts:

CVE-2026-6746: Use-after-free in DOM Core & HTML (1 bug, Anthropic-reported directly)
CVE-2026-6784: Memory safety bugs in Firefox 150 + Thunderbird 150 (55 bugs)
CVE-2026-6785: Memory safety bugs across Firefox ESR branches (154 bugs)
CVE-2026-6786: More memory safety bugs across ESR + stable (107 bugs)

That's 317 bugs across four CVEs. But here's the thing — CVEs 6785 and 6786 cover ESR releases and Thunderbird too, not just Firefox 150. So "271 Firefox 150 vulnerabilities" is a number that doesn't cleanly map to those CVEs.

Additionally, many of these "memory safety bugs" are the kind that get caught by fuzzing tools like AFL++ or libFuzzer routinely. They're real issues — but calling each one a "vulnerability found by AI" conflates automated pattern-matching with genuine exploit discovery.

A deep-dive analysis by a security researcher who spent hours going through commit history found that the actual novel findings from Anthropic directly were closer to 3 CVEs. The rest were swept up in aggregate memory safety audits.

What Mythos Actually Did Well

None of this is to say Mythos isn't impressive. It is. The use-after-free CVE-2026-6746 in particular — a classic memory corruption class that leads to RCE — was directly attributed to Anthropic. That's a real find.

What Mythos demonstrated is that LLMs running in an agentic loop (Anthropic's description: "thousands of complementary runs") can:

Navigate large codebases at scale without human fatigue
Pattern-match against known vulnerability classes faster than a human auditor
Generate and verify hypotheses about memory layout and pointer aliasing
Triage findings and push only promising leads to human reviewers

This is augmentation, not replacement. The "$20,000 finding" wasn't a single magic insight — it was a long campaign of automated runs that collectively produced several dozen trophies, with humans validating the interesting ones.

The Balance-of-Power Question

Here's what the Mozilla post ("The Zero-Days Are Numbered") gets at, even if the headline overstates it: the economics of vulnerability research are shifting.

Traditional offensive security required:

Specialized humans (expensive, scarce)
Long audit cycles (weeks to months per codebase)
Deep expertise in specific architectures

AI-augmented security now offers:

24/7 automated analysis
Parallelized hypothesis generation
Consistent coverage of boring code paths humans skip

For defenders, this is net positive. Running Mythos-class tools over your codebase before shipping is becoming as standard as running a linter. We already do this at Gerus-lab on every significant release — we integrated automated LLM-assisted code review into our CI pipeline six months ago, and it's caught three real issues that our human review missed.

For attackers, the same tools are available. The asymmetry isn't that defenders now win — it's that the floor for both sides rises dramatically.

What This Means for How We Build

At Gerus-lab, we work across Web3, AI systems, and SaaS backends where security isn't optional. Our takeaways from the Mythos findings:

1. Memory safety matters even if you write TypeScript

Most of our frontend stack is JS/TS. But our infrastructure runs Rust services, and we integrate C++ libraries via FFI in some blockchain tooling. The Firefox findings are a reminder that memory safety bugs remain the dominant vulnerability class in 2026. If you're calling native code anywhere, audit it with automated tools.

2. "AI found it" ≠ "AI understands it"

The aggregate CVEs contain bugs that were found by fuzzing, not by reasoning. Mythos's contribution was the coordination layer — running many probes, aggregating results, prioritizing. Understanding why a use-after-free is exploitable still requires human expertise. Don't outsource your threat modeling to an LLM yet.

3. Coverage beats cleverness

The most valuable thing AI security tools do isn't finding the sophisticated 0-day. It's exhaustively covering the boring attack surface. The 154-bug CVE-6785 wasn't dramatic — it was methodical. Your CI pipeline should be doing the same.

Here's the basic structure of how we run automated security review in our CI pipeline at Gerus-lab:

# .github/workflows/security.yml
name: AI Security Review

on: [push, pull_request]

jobs:
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Static Analysis
        run: |
          semgrep --config=auto --json > semgrep-results.json

      - name: LLM-Assisted Review
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          git diff HEAD~1 | python scripts/llm_security_review.py \
            --focus "memory safety, injection, auth bypass" \
            --output security-llm-review.json

      - name: Aggregate & Triage
        run: |
          python scripts/triage.py \
            --semgrep semgrep-results.json \
            --llm security-llm-review.json \
            --min-severity medium \
            --fail-on-critical

This won't find a Firefox-level use-after-free (you need deeper static analysis for that). But for web services, APIs, and smart contracts, it catches the class of bugs that make it into production.

4. The real gap is exploit validation

Mythos found bugs. Mozilla patched them. But "exploitable in practice" is a different bar. CVE-2026-6746 is potentially RCE, but exploit development is still a human-intensive process. The AI-to-weaponized-exploit pipeline isn't automated yet. That gap is narrowing, but it exists.

The Honest Assessment

Mozilla's "The Zero-Days Are Numbered" thesis is probably correct in the long run. But "numbered" doesn't mean "tomorrow." What Mythos demonstrated is that AI can run large-scale, systematic security audits faster and cheaper than human teams — for certain bug classes, in certain codebases.

The 271 number was inflated by aggregate CVEs and cross-version coverage. The real Anthropic-direct contribution was smaller but still meaningful. The most important finding isn't a specific CVE — it's that AI-assisted security auditing is now production-grade for well-structured codebases.

For teams like ours at Gerus-lab building on Web3, AI, and SaaS stacks, the practical move is:

Integrate automated LLM code review into CI now
Use it for coverage, not cleverness
Keep human experts for threat modeling and exploit assessment
Follow the actual CVE data, not the headlines

The zero-days aren't numbered yet. But the era of "we didn't have time to audit that module" is ending.

Building something that needs real security review, not just checkbox compliance? We do that at Gerus-lab — from TON smart contract audits to backend hardening for SaaS products. See our work or reach out.

DEV Community