I Love Semgrep. So I Compiled Its Rules to Native Code.
I love Semgrep. It has thousands of community-contributed security rules that catch real vulnerabilities. But every time I ran it on a large codebase, I'd wait... and wait.
The problem? Semgrep interprets YAML rules at runtime using Python. For a 500K line monorepo, that meant 4+ minutes per scan.
So I asked myself: what if I compiled those rules to native code instead?
The Idea
Semgrep rules are just pattern matching. A rule like this:
rules:
- id: sql-injection
pattern: execute($QUERY)
message: "Possible SQL injection"
Says "find any call to execute() with one argument." That's not fundamentally different from what tree-sitter does with its query language.
What if I translated Semgrep patterns into tree-sitter queries at build time, embedded them in the binary, and matched against ASTs directly?
The Hard Part: Metavariables
Semgrep uses $VARIABLES to capture arbitrary code:
eval($USER_INPUT)
This matches eval(x), eval(foo.bar), eval(getInput()) — anything.
Tree-sitter queries don't have metavariables. They have captures:
(call_expression
function: (identifier) @func
arguments: (arguments (_) @arg))
The @func and @arg are captures — they grab whatever matches that position.
So I built a translator. It parses Semgrep patterns, identifies metavariables, and generates tree-sitter queries with captures in the right places.
// Simplified version of the pattern compiler
fn compile_pattern(semgrep: &str) -> TreeSitterQuery {
let ast = parse_semgrep_pattern(semgrep);
let mut query = String::new();
for node in ast.walk() {
match node {
Metavar(name) => {
// $X becomes (_) @x
query.push_str(&format!("(_) @{}", name.to_lowercase()));
}
Literal(text) => {
query.push_str(&format!("\"{}\"", text));
}
// ... more cases
}
}
TreeSitterQuery::new(&query)
}
The Ellipsis Problem
Semgrep's ... operator matches "zero or more of anything":
func($ARG, ...)
This matches func(a), func(a, b), func(a, b, c, d, e).
Tree-sitter queries can't express this directly. For these patterns, I fall back to walking the AST manually and checking if the structure matches.
Not as fast as native queries, but still faster than Python interpretation.
Build-Time Compilation
The magic happens in build.rs. At compile time:
- Parse all 647 Semgrep YAML files
- Translate each pattern to a tree-sitter query (or AST walker)
- Serialize everything to a binary blob
- Embed it with
include_bytes!()
// In the compiled binary
static RULES: &[u8] = include_bytes!("compiled_rules.bin");
// At runtime - instant loading
fn load_rules() -> RuleSet {
bincode::deserialize(RULES).unwrap()
}
No file I/O. No YAML parsing. No pattern compilation. The rules are just there.
Results
On a 500K LOC monorepo:
| Tool | Time |
|---|---|
| Semgrep | 4m 12s |
| RMA | 23s |
About 10x faster. The difference gets bigger as codebases grow.
What's Still Rough
- False positives on generated code (working on better heuristics)
- Some Semgrep features aren't supported yet (taint mode is partial)
- Error messages could be clearer
Try It
cargo install rma-cli
rma scan .
Or with the interactive TUI:
rma scan . --interactive
It's MIT licensed: github.com/bumahkib7/rust-monorepo-analyzer
Would love feedback, especially if you try it on your own projects. What rules are missing? Too many false positives? Let me know.
If you're interested in the pattern compiler implementation, check out crates/rules/build.rs in the repo.
Top comments (0)