DEV Community

Bukhari bin mahmoud Kibuka
Bukhari bin mahmoud Kibuka

Posted on

How I Compiled 647 Semgrep Rules to Native Rust

I Love Semgrep. So I Compiled Its Rules to Native Code.

I love Semgrep. It has thousands of community-contributed security rules that catch real vulnerabilities. But every time I ran it on a large codebase, I'd wait... and wait.

The problem? Semgrep interprets YAML rules at runtime using Python. For a 500K line monorepo, that meant 4+ minutes per scan.

So I asked myself: what if I compiled those rules to native code instead?

The Idea

Semgrep rules are just pattern matching. A rule like this:

rules:
  - id: sql-injection
    pattern: execute($QUERY)
    message: "Possible SQL injection"
Enter fullscreen mode Exit fullscreen mode

Says "find any call to execute() with one argument." That's not fundamentally different from what tree-sitter does with its query language.

What if I translated Semgrep patterns into tree-sitter queries at build time, embedded them in the binary, and matched against ASTs directly?

The Hard Part: Metavariables

Semgrep uses $VARIABLES to capture arbitrary code:

eval($USER_INPUT)
Enter fullscreen mode Exit fullscreen mode

This matches eval(x), eval(foo.bar), eval(getInput()) — anything.

Tree-sitter queries don't have metavariables. They have captures:

(call_expression
  function: (identifier) @func
  arguments: (arguments (_) @arg))
Enter fullscreen mode Exit fullscreen mode

The @func and @arg are captures — they grab whatever matches that position.

So I built a translator. It parses Semgrep patterns, identifies metavariables, and generates tree-sitter queries with captures in the right places.

// Simplified version of the pattern compiler
fn compile_pattern(semgrep: &str) -> TreeSitterQuery {
    let ast = parse_semgrep_pattern(semgrep);
    let mut query = String::new();
    for node in ast.walk() {
        match node {
            Metavar(name) => {
                // $X becomes (_) @x
                query.push_str(&format!("(_) @{}", name.to_lowercase()));
            }
            Literal(text) => {
                query.push_str(&format!("\"{}\"", text));
            }
            // ... more cases
        }
    }
    TreeSitterQuery::new(&query)
}
Enter fullscreen mode Exit fullscreen mode

The Ellipsis Problem

Semgrep's ... operator matches "zero or more of anything":

func($ARG, ...)
Enter fullscreen mode Exit fullscreen mode

This matches func(a), func(a, b), func(a, b, c, d, e).

Tree-sitter queries can't express this directly. For these patterns, I fall back to walking the AST manually and checking if the structure matches.

Not as fast as native queries, but still faster than Python interpretation.

Build-Time Compilation

The magic happens in build.rs. At compile time:

  1. Parse all 647 Semgrep YAML files
  2. Translate each pattern to a tree-sitter query (or AST walker)
  3. Serialize everything to a binary blob
  4. Embed it with include_bytes!()
// In the compiled binary
static RULES: &[u8] = include_bytes!("compiled_rules.bin");

// At runtime - instant loading
fn load_rules() -> RuleSet {
    bincode::deserialize(RULES).unwrap()
}
Enter fullscreen mode Exit fullscreen mode

No file I/O. No YAML parsing. No pattern compilation. The rules are just there.

Results

On a 500K LOC monorepo:

Tool Time
Semgrep 4m 12s
RMA 23s

About 10x faster. The difference gets bigger as codebases grow.

What's Still Rough

  • False positives on generated code (working on better heuristics)
  • Some Semgrep features aren't supported yet (taint mode is partial)
  • Error messages could be clearer

Try It

cargo install rma-cli
rma scan .
Enter fullscreen mode Exit fullscreen mode

Or with the interactive TUI:

rma scan . --interactive
Enter fullscreen mode Exit fullscreen mode

It's MIT licensed: github.com/bumahkib7/rust-monorepo-analyzer

Would love feedback, especially if you try it on your own projects. What rules are missing? Too many false positives? Let me know.


If you're interested in the pattern compiler implementation, check out crates/rules/build.rs in the repo.

Top comments (0)