SEN LLC

Posted on Apr 16

Writing a Regex Playground in 200 Lines of Rust

#rust #regex #cli #tutorial

Writing a Regex Playground in 200 Lines of Rust

A small CLI that does what grep -E refuses to: shows you the contents of every capture group, including named ones, and lets you try --replace and --split against the same pattern without leaving the shell. The whole thing is two crates and about 200 lines of regex-relevant code.

🔗 GitHub: https://github.com/sen-ltd/regex-test

I have a regex problem and a tooling problem.

The regex problem is the same one everybody has. I write a pattern that almost works on three lines of input, paste it into the script, and discover at 2 a.m. that one of the named capture groups is silently None because I forgot to make a \s+ non-greedy. The fix is always boring. The reason the bug is hard to find is that, at the moment of writing, I had no easy way to inspect every group of every match against the inputs that mattered. I had two options that both lose:

regex101.com. It's the obvious choice and it is genuinely good — multi-flavor support, named-group inspector, replacement preview, the works. But it lives in a browser tab on someone else's server, and the pattern I'm debugging is for parsing log lines from a customer's internal tool. Pasting that into a third-party site is a conversation I would rather not have with our security team.
grep -E. It runs locally, and for a quick "does this match anywhere" question it's perfect. But grep -E only ever shows you matched lines, not capture groups. If your pattern is (?P<user>\w+)@(?P<host>\S+) and you want to see what user and host actually got bound to, grep -E cannot help you. There is grep -oP if you have GNU grep with PCRE, and even then it only prints group 0 unless you write some awk glue around it.

What I wanted was something I could alias rt='regex-test' and use as a scratch pad: hand it a pattern and a string, get back the full match, every capture group with its name and byte offset, and optionally the result of --replace or --split. No browser, no logging, no servers.

So I wrote it. It is one Rust binary, two dependencies (clap and regex), and about 200 lines of regex-touching code spread across three files. This article walks through the parts I think are worth showing.

The choice that determines everything: which regex engine

The Rust ecosystem has roughly three options for matching a regex.

regex. The crate maintained alongside the standard library. RE2-style. Linear time worst case. No backreferences, no lookaround, no recursion. Compiled and bundled in something like 800 KB of Rust code. This is the boring, correct choice.
fancy-regex. Wraps regex and falls back to a backtracking engine when the pattern asks for features regex doesn't support. You get lookbehind, backreferences, and PCRE-flavored extras at the cost of accepting that some pathological patterns can blow up exponentially.
pcre2. Bindings to libpcre2. Closest to "every regex you know from Perl/Python/PHP." Adds a C dependency.

For a CLI debug tool I picked plain regex. The argument is short and I think correct: most of the patterns you debug at 2 a.m. are log-shaped data, and log-shaped data does not need lookbehind to parse. The one feature that actually hurts — backreferences — turns out to be rare in practice and almost always replaceable with a two-pass approach. In return I get a guaranteed linear-time engine and a cargo build --release that produces a 9 MB statically-linked binary. The first time you accidentally feed a runaway PCRE pattern to a script in production and watch it consume the box for 40 minutes, you become a fan of linear-time guarantees.

The right place to put this decision is the README. People who reach for regex-test and miss (?<=foo) deserve to be told upfront, not after they file an issue.

The architecture, such as it is

Three modules and a shim:

src/
├── main.rs          # 90 lines: parse CLI, dispatch, exit code.
├── cli.rs           # clap derive, no project logic.
├── tester.rs        # Pure: pattern + text -> MatchReport.
├── formatter.rs     # Pure: MatchReport -> text / json / highlight.
└── color.rs         # ANSI palette.

The split between tester and formatter is the only architectural choice in the entire program, and it matters because of testability. tester knows nothing about ANSI, JSON, or stdout. formatter knows nothing about regex::Regex. The two of them communicate through a single owned data type called MatchReport whose fields are all String, usize, and Vec. That sounds boring but it is the difference between "I unit-test these two modules in isolation in microseconds" and "I have to spawn a subprocess for every test." For a tool of this size the rule is unambiguous: keep the integration tests for things only the integration test can check (exit codes, JSON shape, ANSI bytes), and let the inner modules be pure.

Concretely, the integration test file is 15 tests of assert_cmd::Command::cargo_bin("regex-test"), and the inner modules contribute 25 tests that don't spawn anything. 40 tests total, the binary is built once.

The interesting bit #1: walking captures

The Rust regex crate gives you Captures, which behaves like a Vec<Option<Match>> indexed by group number. You can also call name("user") to look up by group name, or iterate over capture_names() to enumerate the slot names in declaration order. The catch is that Captures itself has no idea which groups are named — you need the parent Regex for that. Once you know, the lowering is straightforward:

fn captures_to_match(caps: &regex::Captures<'_>, re: &Regex) -> Match {
    let m0 = caps.get(0).expect("group 0 always exists");
    let mut groups = Vec::with_capacity(caps.len());
    for (i, name) in re.capture_names().enumerate() {
        let g = caps.get(i);
        groups.push(Capture {
            index: i,
            name: name.map(|s| s.to_string()),
            value: g.map(|m| m.as_str().to_string()),
            start: g.map(|m| m.start()),
            end: g.map(|m| m.end()),
        });
    }
    Match {
        text: m0.as_str().to_string(),
        start: m0.start(),
        end: m0.end(),
        captures: groups,
    }
}

The crucial line is re.capture_names(). It returns an iterator of Option<&str> aligned with the group indices, with None for unnamed groups and Some("user") for (?P<user>...). So index 0 is the whole match (always None as a name), index 1 is the first declared group, and so on. By zipping that against caps.get(i), every capture I emit knows both its index and — if it has one — its name. The upshot at the user-visible end is regex101-style output:

$ regex-test '(?P<user>\w+)@(?P<host>\S+)' 'alice@ex.com bob@foo.bar' --all --named
Match 1: alice@ex.com  at 0..12 (len 12)
  user = alice (0..5)
  host = ex.com (6..12)
Match 2: bob@foo.bar  at 13..24 (len 11)
  user = bob (13..16)
  host = foo.bar (17..24)

--named runs a second pass that drops every unnamed group, including group 0:

pub fn filter_named(matches: &mut [Match]) {
    for m in matches.iter_mut() {
        m.captures.retain(|c| c.name.is_some());
    }
}

That retain is the entire reason the formatter has a .filter(|c| c.index != 0) instead of a .skip(1). I learned this the wrong way: my first version skipped index 1 of the list, which after filter_named was [user, host], and so user silently vanished from the output. Test #6 in the integration suite is the regression test for this exact mistake. (You will write this same bug, and your test for it will look the same way.)

The interesting bit #2: replace, with template syntax for free

The whole reason regex replace is its own command-line flag is that the substitution syntax is half the value. The Rust regex crate accepts $1, $name, ${name}, and even $0 (the whole match) inside the replacement template, and resolves them when you call replace/replace_all. There is no extra work to expose this to the user — I just thread the user's --replace argument straight through:

let replaced = opts.replace.as_ref().map(|template| {
    if opts.all {
        re.replace_all(input, template.as_str()).into_owned()
    } else {
        re.replace(input, template.as_str()).into_owned()
    }
});

Three lines. The output:

$ regex-test '(\w+)' 'hello world' --all --replace '<$1>'
Replaced:
<hello> <world>

$ regex-test '(?P<word>\w+)' 'hi there' --all --replace '[${word}]'
Replaced:
[hi] [there]

The reason this is worth having as a flag instead of a separate tool is feedback loop. If you are debugging a replacement, the question you actually want to answer is "what did $1 end up being for each match?" In regex-test the answer is on the screen below the replacement, because the match list is still printed. The two pieces of information that are usually a context switch apart — the captured pieces and the replaced output — sit on the same screen, against the same input. That is the entire point of the tool.

A note on ${name} versus $name: Rust's regex follows the rule that $name greedily takes as many word characters as it can, so $name1 is the group name1 even if you meant $name followed by literal 1. The ${...} braces escape that. If you can't reproduce a replacement, this is usually why.

The interesting bit #3: ANSI highlight in 20 lines

The third format is highlight: same human-readable text output, but the input is echoed once at the top with each match wrapped in bold/yellow. It is the part of the output that makes you stop wanting to grep for the matched ranges by hand. The implementation is a stitch:

fn highlight_input(input: &str, matches: &[Match], pal: &Palette) -> String {
    if matches.is_empty() {
        return input.to_string();
    }
    let mut out = String::with_capacity(input.len() + matches.len() * 8);
    let mut cursor = 0;
    for m in matches {
        if m.start < cursor {
            continue; // overlapping or zero-width: skip defensively
        }
        out.push_str(&input[cursor..m.start]);
        out.push_str(pal.match_hl());
        out.push_str(&input[m.start..m.end]);
        out.push_str(pal.reset());
        cursor = m.end;
    }
    out.push_str(&input[cursor..]);
    out
}

That walk relies on regex::Regex::captures_iter returning matches in start-position order, which it does. The defensive continue handles the corner case of zero-width matches (e.g. \b), which captures_iter does not yield in normal regex use, but it would be embarrassing to crash on. The Palette type is the same trick I use in every Rust CLI of mine: it stores a single enabled: bool, and every accessor returns either the real ANSI escape or "". The branch is paid once at startup, not once per byte.

What it deliberately doesn't do

A short list, because tools should be honest about their limits:

No PCRE lookbehind or lookahead. RE2 doesn't have them. If your pattern needs them, you want fancy-regex or pcre2, both of which trade the linear-time guarantee for the feature.
No backreferences. Same reason. The most common use of \1 is "match the same thing twice in a row" (e.g. detecting doubled words), and you can usually do that with a two-pass approach: one regex finds candidates, a second pass in your language of choice verifies the equality.
No multi-line --split with surrounding context. regex-test --split gives you the pieces between matches. If you want the matches themselves with context, that's --all against the same pattern. Two flags, two answers, and I am not going to glue them together because awk already does that job.
No piping mode. regex-test does not stream stdin line-by-line the way grep does. If you need that, grep -P '<pattern>' file is right next to it in your $PATH.

The shape these omissions take: regex-test is the inner-loop tool. You use it to figure out what your pattern matches and what the captures are, on a small piece of input. Once the pattern is right, you bake it into whatever script or service is going to run it for real. It is not trying to replace grep for whole-file streaming, and it is not trying to replace regex101 for the part of regex authoring where you click around. It is the one thing in between.

Try it in 30 seconds with Docker

docker build -t regex-test https://github.com/sen-ltd/regex-test.git
docker run --rm regex-test '(?P<user>\w+)@(?P<host>\S+)' 'alice@example.com' --named

Or cargo install --path . if you have a Rust toolchain. The release profile is set to strip + lto + codegen-units=1 + opt-level="z" + panic="abort", so the resulting binary is around 1 MB stripped. The Alpine image weighs in at 10 MB.

What I'd build next

If I had another afternoon I'd add a --bench flag that wraps the match in std::time::Instant and reports per-pattern compile time and per-match wall-clock. The point would not be to win benchmarks against PCRE; it would be to give people who suspect their regex is "slow" a one-shot way to check whether it actually is. Most of the time the answer is "your pattern is fine, your I/O is the problem," and a tool that makes that easy to demonstrate is doing useful work.

I'd also probably accept a --patterns-file so you could feed it a \n-separated list of patterns to try against the same input — turning the inner-loop tool into a small fuzzer for "which of these five candidate patterns wins on this input?" That is the kind of thing where regex-test would actually displace the browser tab, instead of just being faster than it.

But neither of those is in the current version. The current version is the smallest thing I would alias to rt and use every day, and that was the goal.

DEV Community

Writing a Regex Playground in 200 Lines of Rust

Writing a Regex Playground in 200 Lines of Rust

The choice that determines everything: which regex engine

The architecture, such as it is

The interesting bit #1: walking captures

The interesting bit #2: replace, with template syntax for free

The interesting bit #3: ANSI highlight in 20 lines

What it deliberately doesn't do

Try it in 30 seconds with Docker

What I'd build next

Top comments (0)