Zhijie Wong

Posted on Apr 22

Why Pattern-Matching Scanners Miss Structural Bugs (and What I Built Instead)

#opensource #ai #security #typescript

TL;DR

Pattern-matching scanners (Semgrep, Snyk, CodeQL) find what their rulebook encodes. Bugs that arrive as structural variants — the sink is three calls away, the taint flows through an unusual shape, the CVE matters but the pattern doesn't match verbatim — slip through.

I built mythos-agent, an open-source AI code reviewer (MIT, TypeScript, GitHub), to layer an LLM-based hypothesis stage on top of a traditional SAST foundation. This post is the technical writeup: what the pipeline looks like, what bug classes it surfaces that regex-only scanners miss, and where it still gets things wrong.

npx mythos-agent scan     # pattern scan, no API key
npx mythos-agent hunt     # full AI hypothesis + analyzer pipeline

1. The problem: rulebook coverage vs. bug space

A pattern scanner's ruleset is a finite set of (sink, source, condition) triples. A security reviewer reading the same code carries a much larger implicit model — they notice that this DB transaction reads and writes the same row without locking, that this handler joins a user-supplied path against a config root without resolving symlinks, that this eval receives a value that's been stringified three functions upstream.

Concrete example. Semgrep's default TypeScript ruleset catches this:

app.get('/run', (req, res) => {
  eval(req.query.code);           // flagged: eval() on request input
});

It does not catch this, even though it's the same bug:

function normalise(input: unknown) {
  return String(input).trim();
}

function buildPayload(raw: string) {
  return normalise(raw);
}

app.get('/run', (req, res) => {
  const payload = buildPayload(req.query.code as string);
  new Function(payload)();        // not flagged: sink ≠ eval, source is 2 calls away
});

The pattern rule is looking for eval(<tainted>) literally. The real bug is <any dynamic-code sink>(<tainted, possibly transformed, possibly renamed>). You can write a Semgrep rule for this variant — but you can only write rules for variants you've already thought of. The space of "things that behave like eval" is open-ended.

2. The approach: hypothesis generation per function

The mythos-agent pipeline is four stages:

Recon → Hypothesize → Analyze → Exploit (optional)

The interesting stage is Hypothesize. For each function the parser extracts, a prompted LLM agent produces specific, code-grounded security claims — not CWE labels, but statements about this code:

"This handler reads req.query.path and passes it to fs.readFileSync via path.join(ROOT, userPath) without resolving symlinks. Potential path traversal if the filesystem contains symlinks pointing outside ROOT."

"This transaction reads balance at line 42 and writes balance - amount at line 51, without wrapping in SELECT … FOR UPDATE or an equivalent lock. Potential TOCTOU race allowing double-spend under concurrent requests."

The hypotheses are inputs to the next stage, not outputs to the user.

3. The analyzer: grading hypotheses against the code

A separate analyzer agent re-reads the function with the hypothesis attached and decides whether the claim actually holds given the control flow, input reachability, and sink characteristics. Findings get a confidence score in [0, 1]; --severity high only surfaces results above a threshold.

This two-stage split matters. The hypothesis stage is allowed to be speculative — it's cheap to generate a hypothesis that turns out to be wrong, and the analyzer will filter it. The analyzer stage is allowed to be conservative. Running them together in a single prompt collapses the useful separation: the model both proposes and evaluates, and in practice that means it emits plausibility-matched false positives.

Example output (real, from scanning a test corpus):

 ✗ src/api/transfer.ts:38   [HIGH, conf 0.88]
   Hypothesis: read-modify-write of `balance` without row lock;
               concurrent requests can double-spend.
   Evidence:   line 42 reads `balance`, line 51 writes `balance - amount`;
               no FOR UPDATE / transaction isolation in scope.
   Suggested:  wrap in BEGIN ... SELECT ... FOR UPDATE ... COMMIT,
               or use SERIALIZABLE isolation level.

4. Structural variant analysis

Given a reference CVE (from NVD, or a user-supplied patch), the variant analyzer searches the codebase for AST-shape-similar regions with semantic-role matching on inputs/sinks. Similar in spirit to what Google Project Zero described in the public Big Sleep writeup, applied to an open-source TypeScript toolchain.

The use case this actually solves: "we patched bug X in module A; are there other places in the codebase that look like module A before the patch?" Regex search over git diff misses these because the variant can rename the variables, reorder the statements, split a helper out, etc.

5. What's in the box

43 scanner categories (15 production-wired, 28 experimental): SQL injection, SSRF, path traversal, command injection, XSS, JWT algorithm confusion, session handling, race conditions, crypto audit, secrets, IaC misconfig, supply chain, AI/LLM security, API security, cloud misconfig, zero trust, privacy/GDPR, GraphQL, WebSocket, CORS, OAuth, SSTI, and more.
329+ built-in rules across 8 languages (TypeScript, JavaScript, Python, Go, Java, PHP, C/C++, Rust). Rules compose — "SQL injection" is N smaller rules, not one regex.
Output: SARIF 2.1.0 (drop-in for GitHub Code Scanning), HTML reports, JSON for piping.
Backends: Claude, GPT-4o, Ollama, or any OpenAI-compatible endpoint. Pattern-only mode works offline without any API key — the hypothesis stage is opt-in.
Releases are Sigstore-signed (cosign) with CycloneDX SBOMs attached to each GitHub release.

6. Where it still gets things wrong

Hypothesis-driven scanning is not free. Honest limits:

Dynamically-typed languages (Python, JS) produce more noise than statically-typed ones. Type information is a signal the analyzer leans on heavily; without it, confidence scores drift lower and the high-severity filter leaves more on the floor.
Inter-procedural taint across package boundaries still loses signal. If the tainted value crosses into a third-party dep with no source, the hypothesis stage has to reason about the dep's public surface, and it often over-generates.
Cost. Running the hypothesis stage across a 100k-LOC codebase with Claude or GPT-4o is not free. The --severity high filter helps; incremental scans on changed files help more. CI integration should scope to diff-only by default.

7. Try it

One command, no install, no API key needed for pattern-only mode:

npx mythos-agent quick       # 10-second security check
npx mythos-agent scan        # full pattern scan
npx mythos-agent hunt        # AI-guided scan (needs a model endpoint)
npx mythos-agent fix --apply # AI-generated patches for high-confidence findings

GitHub: https://github.com/mythos-agent/mythos-agent
Landing / docs: https://mythos-agent.com
Community (EN): https://mythos-agent.com/discord
Community (CN · 飞书): https://mythos-agent.com/feishu
Releases: Sigstore-signed, SBOM attached

MIT licensed. v4.0.0 shipped today. If you have a codebase you'd want tested against hypothesis generation (public or a redacted snippet), open an issue or a discussion — I'm specifically looking for cases where the analyzer produces unexpected false positives, since those are the most useful signal for tuning the prompt.

Questions I'd value technical feedback on

For per-function hypothesis generation, where has the "speculate then analyze" split produced the most noise in systems you've built or used?
For structural variant analysis on dynamically-typed languages, what's your experience with AST-shape normalisation to get useful similarity scores across Python or JS?
Which SARIF 2.1.0 consumers beyond GitHub Code Scanning actually render SARIF well, and which silently drop half the fields?

Thanks for reading. ⭐Star on GitHub if this is useful; open an issue if you find a bug.

DEV Community