RustDev

Posted on Feb 5

How I Compiled 647 Semgrep Rules to Native Rust

#rust #security #opensource #tutorial

I Love Semgrep. So I Compiled Its Rules to Native Code.

I love Semgrep. It has thousands of community-contributed security rules that catch real vulnerabilities. But every time I ran it on a large codebase, I'd wait... and wait.

The problem? Semgrep interprets YAML rules at runtime using Python. For a 500K line monorepo, that meant 4+ minutes per scan.

So I asked myself: what if I compiled those rules to native code instead?

The Idea

Semgrep rules are just pattern matching. A rule like this:

rules:
  - id: sql-injection
    pattern: execute($QUERY)
    message: "Possible SQL injection"

Says "find any call to execute() with one argument." That's not fundamentally different from what tree-sitter does with its query language.

What if I translated Semgrep patterns into tree-sitter queries at build time, embedded them in the binary, and matched against ASTs directly?

The Hard Part: Metavariables

Semgrep uses $VARIABLES to capture arbitrary code:

eval($USER_INPUT)

This matches eval(x), eval(foo.bar), eval(getInput()) — anything.

Tree-sitter queries don't have metavariables. They have captures:

(call_expression
  function: (identifier) @func
  arguments: (arguments (_) @arg))

The @func and @arg are captures — they grab whatever matches that position.

So I built a translator. It parses Semgrep patterns, identifies metavariables, and generates tree-sitter queries with captures in the right places.

// Simplified version of the pattern compiler
fn compile_pattern(semgrep: &str) -> TreeSitterQuery {
    let ast = parse_semgrep_pattern(semgrep);
    let mut query = String::new();
    for node in ast.walk() {
        match node {
            Metavar(name) => {
                // $X becomes (_) @x
                query.push_str(&format!("(_) @{}", name.to_lowercase()));
            }
            Literal(text) => {
                query.push_str(&format!("\"{}\"", text));
            }
            // ... more cases
        }
    }
    TreeSitterQuery::new(&query)
}

The Ellipsis Problem

Semgrep's ... operator matches "zero or more of anything":

func($ARG, ...)

This matches func(a), func(a, b), func(a, b, c, d, e).

Tree-sitter queries can't express this directly. For these patterns, I fall back to walking the AST manually and checking if the structure matches.

Not as fast as native queries, but still faster than Python interpretation.

Build-Time Compilation

The magic happens in build.rs. At compile time:

Parse all 647 Semgrep YAML files
Translate each pattern to a tree-sitter query (or AST walker)
Serialize everything to a binary blob
Embed it with include_bytes!()

// In the compiled binary
static RULES: &[u8] = include_bytes!("compiled_rules.bin");

// At runtime - instant loading
fn load_rules() -> RuleSet {
    bincode::deserialize(RULES).unwrap()
}

No file I/O. No YAML parsing. No pattern compilation. The rules are just there.

Results

On a 500K LOC monorepo:

Tool	Time
Semgrep	4m 12s
RMA	23s

About 10x faster. The difference gets bigger as codebases grow.

What's Still Rough

False positives on generated code (working on better heuristics)
Some Semgrep features aren't supported yet (taint mode is partial)
Error messages could be clearer

Try It

cargo install rma-cli
rma scan .

Or with the interactive TUI:

rma scan . --interactive

It's MIT licensed: github.com/bumahkib7/rust-monorepo-analyzer

Would love feedback, especially if you try it on your own projects. What rules are missing? Too many false positives? Let me know.

If you're interested in the pattern compiler implementation, check out crates/rules/build.rs in the repo.

DEV Community