SEN LLC

Posted on Apr 16

Reimplementing fd in 500 lines of Rust (and what I learned about the crates it stands on)

#rust #cli #filesystem #tutorial

Reimplementing fd in 500 lines of Rust (and what I learned about the crates it stands on)

A minimal fd-style file finder in four small modules. Regex or glob, .gitignore-aware, type/depth/exclude filters. Not a replacement for fd. A teaching sibling.

📦 GitHub: https://github.com/sen-ltd/fdlite

There is a particular kind of Rust CLI that gets the "wow, it's so fast" response from people who try it for the first time. ripgrep, fd, bat, eza, hyperfine — the whole @sharkdp / @BurntSushi constellation. I've used fd every day for years. I've read its source a couple of times. But reading source is not the same as building the thing. So I built fdlite: a deliberately small, deliberately partial reimplementation of fd, to see how much of its usefulness falls out of the crates it's built on, and how much has to be written by hand.

Short answer: a lot falls out of the crates. The first working version was under 300 lines. The version I ended up shipping, with polish and tests, is around 500. Here is what I learned.

What I was willing to give up

The first decision was the one everyone makes when they try to reproduce a piece of production software: where is the scope knife going to cut? I drew the line like this.

In scope, because they are the reason you reach for fd in the first place:

regex match against the filename (the default)
glob match (--glob)
.gitignore respected out of the box
file type filter (--type f|d|l)
extension filter (--extension rs)
exclude globs (--exclude)
depth limit (--depth)
hidden files toggle (--hidden)
--print0 for xargs pipelines
same exit codes as grep/fd (0 found, 1 nothing, 2 bad input)

Out of scope, deliberately:

parallel directory traversal (fd walks with a work-stealing thread pool; fdlite is single-threaded)
colour output and terminal detection
smart case
executing a command per match (-x / -X)
custom ignore-file names, per-directory ignore overrides, fine-grained ignore surgery
full Unicode glob grammar (**, brace expansion, class negation)

The point isn't "everything fd does, but worse." It's "what is the subset that fits in an afternoon, and what does that teach you?"

Four modules, one glue file

The whole thing lives in four small files under src/, plus a tiny main.rs that stitches them together. The shape:

src/
  cli.rs         # clap derive — one struct, every flag
  matcher.rs     # regex | glob dispatch, plus extension constraint
  walker.rs      # ignore-aware walk, all the filters
  formatter.rs   # \n or \0 record separator
  main.rs        # parse → options → walk → format → print
  lib.rs         # re-exports, so tests can reach everything

Every module exposes a pure data type and a function or two. main.rs is 70 lines and boring on purpose.

The `ignore` crate does the hardest thing for you

Parsing .gitignore is the kind of problem that looks tiny from the outside and is absolutely not. The grammar has !-negation, leading-slash-anchored patterns, trailing-slash directory-only patterns, ** any-depth wildcards, line-order precedence (later rules override earlier ones), parent-directory lookup, the global ~/.config/git/ignore file, the per-repo .git/info/exclude file, and the interaction with the bare-.ignore file that ripgrep and fd invented. Writing a correct implementation is a week of work minimum.

You don't have to. The ignore crate (on crates.io as ignore = "0.4") is the exact same crate that powers fd and ripgrep. It exposes a WalkBuilder that does the whole thing for you:

use ignore::WalkBuilder;

fn walk_with_ignore(opts: &WalkOptions) -> Vec<PathBuf> {
    let mut builder = WalkBuilder::new(&opts.root);
    builder
        .hidden(!opts.include_hidden)     // note: hidden(true) *skips* hidden
        .git_ignore(true)
        .git_global(true)
        .git_exclude(true)
        .require_git(false)               // read .gitignore even outside a git repo
        .parents(true);
    if let Some(d) = opts.max_depth {
        builder.max_depth(Some(d));
    }

    let mut out = Vec::new();
    for dent in builder.build().flatten() {
        if dent.depth() == 0 {
            continue; // skip the starting directory itself
        }
        if let Some(rel) = accept(opts, dent.path()) {
            out.push(rel);
        }
    }
    out
}

Two notes on things that surprised me.

require_git(false) is non-default and you almost always want it. By default, ignore will only read .gitignore files if they sit inside a discovered git repo. That is the correct behavior for ripgrep's "I refuse to grep random files on your machine" ethos, but it makes unit tests annoying and it means running fdlite in a freshly un-tarred source archive quietly ignores the .gitignore it ships with. The test I wrote to pin this was the first one to fail; the fix is one line. (My first integration test caught this: I built a fixture tree with .gitignore: *.log plus a debug.log and asserted the log was hidden. It wasn't. Thirty seconds of reading WalkBuilder's doc told me why.)

hidden(true) skips hidden files, not the other way around. The method is asking a question in a particular voice: "should the walker enforce hiding-of-hidden-files?" The naming is defensible once you see it from inside the crate, but I inverted it twice before I stopped reaching for the wrong constant.

`walkdir` is the fallback when the user says `--no-gitignore`

The ignore crate is a superset of walkdir in the sense that you can turn off all the ignore-file logic and get back a plain recursive walk. But the API is heavier and the "turn off literally everything" mode still reads some dotfiles. When the user says --no-gitignore, I didn't want any of that magic; I wanted a dumb depth-first walk.

For that there is walkdir (also by @BurntSushi), which is what ignore is built on top of:

use walkdir::WalkDir;

fn walk_plain(opts: &WalkOptions) -> Vec<PathBuf> {
    let mut walker = WalkDir::new(&opts.root);
    if let Some(d) = opts.max_depth {
        walker = walker.max_depth(d);
    }
    let mut out = Vec::new();
    for entry in walker.into_iter().flatten() {
        if entry.depth() == 0 {
            continue;
        }
        let path = entry.path();
        if !opts.include_hidden && is_hidden(path) {
            continue;
        }
        if let Some(rel) = accept(opts, path) {
            out.push(rel);
        }
    }
    out
}

Two walkers, one accept function — the difference is entirely in how you set up the iterator, not in how you filter its output. That turned out to be the cleanest way to keep the filter logic readable.

The filter pipeline is one function, and that is on purpose

Everything a walked entry has to pass through — type filter, exclude globs, the primary pattern — lives in a single accept function that returns Option<PathBuf>:

fn accept(opts: &WalkOptions, path: &Path) -> Option<PathBuf> {
    // File type filter.
    if let Some(t) = opts.type_filter {
        let meta = path.symlink_metadata().ok()?;
        let ft = meta.file_type();
        let ok = match t {
            EntryType::F => ft.is_file(),
            EntryType::D => ft.is_dir(),
            EntryType::L => ft.is_symlink(),
        };
        if !ok {
            return None;
        }
    }

    // Exclude globs (checked against filename only).
    if !opts.excludes.is_empty() {
        if let Some(name) = path.file_name().and_then(|s| s.to_str()) {
            for g in &opts.excludes {
                if g.is_match(name) {
                    return None;
                }
            }
        }
    }

    // Primary pattern.
    if !opts.matcher.is_match(path) {
        return None;
    }

    let rel = path.strip_prefix(&opts.root).unwrap_or(path).to_path_buf();
    Some(rel)
}

The nice thing about keeping the pipeline in one place is that adding a new filter is a one-hunk change. Want --size +1M? Add a clause here. Want --owner root? Add a clause here. You never have to decide where to put it.

One subtle call: I deliberately use symlink_metadata() instead of metadata() for the --type l branch. metadata() follows symlinks, so ft.is_symlink() on a valid link will always be false. This is the kind of thing you can miss for a long time until you write the test.

Regex and glob share an interface

The primary matcher decides, for each walked entry, whether the filename passes. It has to handle two dialects — regex (the default) and glob (with --glob) — and I didn't want the walker to know which was in use. A tiny trait solves it:

trait PathMatch: Send + Sync {
    fn matches(&self, s: &str) -> bool;
}

impl PathMatch for Regex {
    fn matches(&self, s: &str) -> bool { self.is_match(s) }
}

pub struct Glob(Regex);
impl PathMatch for Glob {
    fn matches(&self, s: &str) -> bool { self.0.is_match(s) }
}

pub struct Matcher {
    primary: Box<dyn PathMatch>,
    extension: Option<String>,
}

impl Matcher {
    pub fn new(pattern: &str, as_glob: bool, extension: Option<String>) -> Result<Self, String> {
        let primary: Box<dyn PathMatch> = if pattern.is_empty() {
            Box::new(Regex::new("").unwrap())
        } else if as_glob {
            Box::new(Glob::new(pattern)?)
        } else {
            Box::new(Regex::new(pattern)
                .map_err(|e| format!("invalid regex {pattern:?}: {e}"))?)
        };
        Ok(Matcher { primary, extension })
    }

    pub fn is_match(&self, path: &Path) -> bool {
        let name = path.file_name().and_then(|s| s.to_str());
        let Some(name) = name else { return false; };
        if !self.primary.matches(name) { return false; }
        if let Some(ext) = &self.extension {
            return path.extension().and_then(|s| s.to_str()) == Some(ext.as_str());
        }
        true
    }
}

The Glob type is a regex under the hood, compiled from a tiny glob-to-regex function that handles *, ?, and [abc]. I did not reach for the globset crate, even though it is great, because the whole point of this project is to show how small the glob problem gets when you only need single-segment patterns. Forty lines of glob_to_regex and a thorough test file cover the ground.

Two deliberate design calls in the matcher:

Match on the filename, not the full path. fdlite '^src$' does not match every file under a src/ directory; it matches an entry whose filename is exactly src. This is what fd does by default, and it's what users expect, but it's also the kind of decision you could easily get wrong if you reached for Regex::is_match on the stringified path. I have a test for it (regex_matches_on_name_not_parent_dirs) specifically because the wrong behavior would be almost invisible on casual use.

Extension is a separate AND-clause, not folded into the regex. You can say fdlite '^test_' --extension rs and get exactly the files whose names start with test_ and end in .rs. If extension were folded into the regex you'd have to anchor both ends yourself, which is a trap.

Tests: 36 of them, a mix of unit and integration

The most valuable tests are the integration ones. Each builds a temporary directory tree with tempfile::tempdir(), invokes the real fdlite binary via CARGO_BIN_EXE_fdlite, and asserts on stdout, stderr, and the exit code:

fn bin() -> PathBuf {
    PathBuf::from(env!("CARGO_BIN_EXE_fdlite"))
}

#[test]
fn exit_zero_and_lists_matches_with_extension_filter() {
    let dir = build_fixture();
    let out = Command::new(bin())
        .args(["", "--extension", "rs"])
        .arg(dir.path())
        .output()
        .unwrap();
    assert_eq!(out.status.code(), Some(0));
    let stdout = String::from_utf8(out.stdout).unwrap();
    assert!(stdout.lines().any(|l| l.ends_with("main.rs")));
    assert!(!stdout.contains("README.md"));
}

CARGO_BIN_EXE_<name> is a built-in Cargo feature I wish more people knew about. Cargo sets it automatically for integration tests, so you never have to guess where the binary lives. It also means your integration tests run on whatever platform CI is using, against a freshly-built binary, without a custom test harness.

The 36 tests break down as: 21 unit tests (matcher regex/glob/extension/filename edge cases, walker gitignore/depth/exclude/type/hidden, formatter newline vs NUL), 12 integration tests (the whole binary under every flag combination plus all three exit codes), and 3 extra pure-matcher tests. One integration test is #[cfg(unix)]-gated because it creates a symlink with std::os::unix::fs::symlink; symlink creation on Windows needs a capability that CI doesn't always have.

Tradeoffs, honestly

If you are reaching for fdlite in production, stop. Use fd. Here is why, specifically:

No parallelism. fd uses a work-stealing thread pool from the ignore crate's parallel walker. fdlite is single-threaded because the code is more readable that way. On a large monorepo fd is a lot faster.
No colour. fd detects the terminal, picks colours from LS_COLORS, and highlights matches. fdlite prints plain UTF-8.
No smart case. fd is case-insensitive unless your pattern contains an uppercase letter. fdlite is always case-sensitive, because the regex and glob grammars we pass through don't flip.
No -x/-X. You cannot ask fdlite to run a command per match. Pipe to xargs -0 with --print0 if you need that.
Glob grammar is a subset. No **, no {a,b} brace expansion, no class negation [!abc].
Only the filename is matched. fd also has a --full-path / -p flag that opts into matching against the entire path. fdlite does not.

None of this is hard to add. Most of it is explicitly not-added because the goal is a readable afternoon project, not a daily driver.

Try it in 30 seconds

git clone https://github.com/sen-ltd/fdlite && cd fdlite
docker build -t fdlite .

mkdir -p /tmp/demo/src/sub
echo a > /tmp/demo/src/hello.txt
echo b > /tmp/demo/src/README.md
echo c > /tmp/demo/src/sub/nested.txt
echo "*.log" > /tmp/demo/.gitignore
echo x > /tmp/demo/src/debug.log

docker run --rm -v /tmp/demo:/work fdlite '\.txt$' /work
docker run --rm -v /tmp/demo:/work fdlite '' /work --extension md
docker run --rm -v /tmp/demo:/work fdlite '' /work --type d
docker run --rm -v /tmp/demo:/work fdlite '' /work --no-gitignore

The runtime image is about 10 MB. alpine:3.20 plus a stripped, LTO'd musl binary.

What I actually learned

Writing fdlite did not teach me how to parse .gitignore. The ignore crate did that for me, and it did it correctly in situations I would not have thought to test. What it taught me instead was how carefully the fd / ripgrep authors have packaged up the hard parts so that someone else can build a small tool without redoing the work. WalkBuilder is a beautiful piece of API: you can pick hidden, gitignore, require_git, max_depth, parents independently, in any combination, and they compose. That is not an accident; that is @BurntSushi thinking hard about the shape of the problem so that the next person doesn't have to.

The second thing it taught me was how much of a CLI you get for free from clap + one derive struct. The entire cli.rs file is one struct with documented fields, and clap generates the whole help output, including --help vs -h behavior, value parsing, ValueEnum for --type, repeatable flags, and error messages. That file is 80 lines, most of them doc comments, and it took ten minutes to write.

The third thing — this one is less technical — is that small teaching reimplementations are an under-used way to understand a piece of software. I've read the fd source before. I did not internalize how WalkBuilder's knobs compose until I had written a program that had to pick which combination to use when, and had been bitten by require_git(false) the way every first-time user gets bitten. The act of having to make the call forces the learning.

If you want to learn fd better, don't read its source. Build a worse version, one evening, one file at a time. You'll see the shape of the thing from the inside, and the real fd will look simpler afterwards, not more complicated.

Source: https://github.com/sen-ltd/fdlite · MIT licensed · About 500 lines of Rust.

DEV Community

Reimplementing fd in 500 lines of Rust (and what I learned about the crates it stands on)

Reimplementing fd in 500 lines of Rust (and what I learned about the crates it stands on)

What I was willing to give up

Four modules, one glue file

The `ignore` crate does the hardest thing for you

`walkdir` is the fallback when the user says `--no-gitignore`

The filter pipeline is one function, and that is on purpose

Regex and glob share an interface

Tests: 36 of them, a mix of unit and integration

Tradeoffs, honestly

Try it in 30 seconds

What I actually learned

Top comments (0)

Reimplementing fd in 500 lines of Rust (and what I learned about the crates it stands on)

What I was willing to give up

Four modules, one glue file

The ignore crate does the hardest thing for you

walkdir is the fallback when the user says --no-gitignore

The filter pipeline is one function, and that is on purpose

Regex and glob share an interface

Tests: 36 of them, a mix of unit and integration

Tradeoffs, honestly

Try it in 30 seconds

What I actually learned

The `ignore` crate does the hardest thing for you

`walkdir` is the fallback when the user says `--no-gitignore`