TL;DR
Pattern-matching scanners (Semgrep, Snyk, CodeQL) find what their rulebook encodes. Bugs that arrive as structural variants — the sink is three calls away, the taint flows through an unusual shape, the CVE matters but the pattern doesn't match verbatim — slip through.
I built mythos-agent, an open-source AI code reviewer (MIT, TypeScript, GitHub), to layer an LLM-based hypothesis stage on top of a traditional SAST foundation. This post is the technical writeup: what the pipeline looks like, what bug classes it surfaces that regex-only scanners miss, and where it still gets things wrong.
npx mythos-agent scan # pattern scan, no API key
npx mythos-agent hunt # full AI hypothesis + analyzer pipeline
1. The problem: rulebook coverage vs. bug space
A pattern scanner's ruleset is a finite set of (sink, source, condition) triples. A security reviewer reading the same code carries a much larger implicit model — they notice that this DB transaction reads and writes the same row without locking, that this handler joins a user-supplied path against a config root without resolving symlinks, that this eval receives a value that's been stringified three functions upstream.
Concrete example. Semgrep's default TypeScript ruleset catches this:
app.get('/run', (req, res) => {
eval(req.query.code); // flagged: eval() on request input
});
It does not catch this, even though it's the same bug:
function normalise(input: unknown) {
return String(input).trim();
}
function buildPayload(raw: string) {
return normalise(raw);
}
app.get('/run', (req, res) => {
const payload = buildPayload(req.query.code as string);
new Function(payload)(); // not flagged: sink ≠ eval, source is 2 calls away
});
The pattern rule is looking for eval(<tainted>) literally. The real bug is <any dynamic-code sink>(<tainted, possibly transformed, possibly renamed>). You can write a Semgrep rule for this variant — but you can only write rules for variants you've already thought of. The space of "things that behave like eval" is open-ended.
2. The approach: hypothesis generation per function
The mythos-agent pipeline is four stages:
Recon → Hypothesize → Analyze → Exploit (optional)
The interesting stage is Hypothesize. For each function the parser extracts, a prompted LLM agent produces specific, code-grounded security claims — not CWE labels, but statements about this code:
"This handler reads
req.query.pathand passes it tofs.readFileSyncviapath.join(ROOT, userPath)without resolving symlinks. Potential path traversal if the filesystem contains symlinks pointing outsideROOT.""This transaction reads
balanceat line 42 and writesbalance - amountat line 51, without wrapping inSELECT … FOR UPDATEor an equivalent lock. Potential TOCTOU race allowing double-spend under concurrent requests."
The hypotheses are inputs to the next stage, not outputs to the user.
3. The analyzer: grading hypotheses against the code
A separate analyzer agent re-reads the function with the hypothesis attached and decides whether the claim actually holds given the control flow, input reachability, and sink characteristics. Findings get a confidence score in [0, 1]; --severity high only surfaces results above a threshold.
This two-stage split matters. The hypothesis stage is allowed to be speculative — it's cheap to generate a hypothesis that turns out to be wrong, and the analyzer will filter it. The analyzer stage is allowed to be conservative. Running them together in a single prompt collapses the useful separation: the model both proposes and evaluates, and in practice that means it emits plausibility-matched false positives.
Example output (real, from scanning a test corpus):
✗ src/api/transfer.ts:38 [HIGH, conf 0.88]
Hypothesis: read-modify-write of `balance` without row lock;
concurrent requests can double-spend.
Evidence: line 42 reads `balance`, line 51 writes `balance - amount`;
no FOR UPDATE / transaction isolation in scope.
Suggested: wrap in BEGIN ... SELECT ... FOR UPDATE ... COMMIT,
or use SERIALIZABLE isolation level.
4. Structural variant analysis
Given a reference CVE (from NVD, or a user-supplied patch), the variant analyzer searches the codebase for AST-shape-similar regions with semantic-role matching on inputs/sinks. Similar in spirit to what Google Project Zero described in the public Big Sleep writeup, applied to an open-source TypeScript toolchain.
The use case this actually solves: "we patched bug X in module A; are there other places in the codebase that look like module A before the patch?" Regex search over git diff misses these because the variant can rename the variables, reorder the statements, split a helper out, etc.
5. What's in the box
- 43 scanner categories (15 production-wired, 28 experimental): SQL injection, SSRF, path traversal, command injection, XSS, JWT algorithm confusion, session handling, race conditions, crypto audit, secrets, IaC misconfig, supply chain, AI/LLM security, API security, cloud misconfig, zero trust, privacy/GDPR, GraphQL, WebSocket, CORS, OAuth, SSTI, and more.
- 329+ built-in rules across 8 languages (TypeScript, JavaScript, Python, Go, Java, PHP, C/C++, Rust). Rules compose — "SQL injection" is N smaller rules, not one regex.
- Output: SARIF 2.1.0 (drop-in for GitHub Code Scanning), HTML reports, JSON for piping.
- Backends: Claude, GPT-4o, Ollama, or any OpenAI-compatible endpoint. Pattern-only mode works offline without any API key — the hypothesis stage is opt-in.
- Releases are Sigstore-signed (cosign) with CycloneDX SBOMs attached to each GitHub release.
6. Where it still gets things wrong
Hypothesis-driven scanning is not free. Honest limits:
- Dynamically-typed languages (Python, JS) produce more noise than statically-typed ones. Type information is a signal the analyzer leans on heavily; without it, confidence scores drift lower and the high-severity filter leaves more on the floor.
- Inter-procedural taint across package boundaries still loses signal. If the tainted value crosses into a third-party dep with no source, the hypothesis stage has to reason about the dep's public surface, and it often over-generates.
-
Cost. Running the hypothesis stage across a 100k-LOC codebase with Claude or GPT-4o is not free. The
--severity highfilter helps; incremental scans on changed files help more. CI integration should scope to diff-only by default.
7. Try it
One command, no install, no API key needed for pattern-only mode:
npx mythos-agent quick # 10-second security check
npx mythos-agent scan # full pattern scan
npx mythos-agent hunt # AI-guided scan (needs a model endpoint)
npx mythos-agent fix --apply # AI-generated patches for high-confidence findings
- GitHub: https://github.com/mythos-agent/mythos-agent
- Landing / docs: https://mythos-agent.com
- Community (EN): https://mythos-agent.com/discord
- Community (CN · 飞书): https://mythos-agent.com/feishu
- Releases: Sigstore-signed, SBOM attached
MIT licensed. v4.0.0 shipped today. If you have a codebase you'd want tested against hypothesis generation (public or a redacted snippet), open an issue or a discussion — I'm specifically looking for cases where the analyzer produces unexpected false positives, since those are the most useful signal for tuning the prompt.
Questions I'd value technical feedback on
- For per-function hypothesis generation, where has the "speculate then analyze" split produced the most noise in systems you've built or used?
- For structural variant analysis on dynamically-typed languages, what's your experience with AST-shape normalisation to get useful similarity scores across Python or JS?
- Which SARIF 2.1.0 consumers beyond GitHub Code Scanning actually render SARIF well, and which silently drop half the fields?
Thanks for reading. ⭐Star on GitHub if this is useful; open an issue if you find a bug.

Top comments (0)