DEV Community

Cover image for How I Wrote the Fastest Directory Crawler Ever
Abdullah Atta
Abdullah Atta

Posted on

How I Wrote the Fastest Directory Crawler Ever

GitHub logo thecodrr / fdir

⚡ The fastest directory crawler & globbing library for NodeJS. Crawls 1m files in < 1s

The Fastest Directory Crawler & Globber for NodeJS

The Fastest: Nothing similar (in the NodeJS world) beats fdir in speed. It can easily crawl a directory containing 1 million files in < 1 second.

💡 Stupidly Easy: fdir uses expressive Builder pattern to build the crawler increasing code readability.

🤖 Zero Dependencies*: fdir only uses NodeJS fs & path modules.

🕺 Astonishingly Small: < 2KB in size gzipped & minified.

🔥 All Node Versions Supported: Unlike other similar libraries that have dropped support for Node versions < 10, fdir supports all versions >= 6.

🖮 Hackable: Extending fdir is extremely simple now that the new Builder API is here. Feel free to experiment around.

* picomatch must be installed manually by the user to support globbing.

Support

Do you like this project? Support me by donating, creating an issue, becoming a stargazer, or opening a pull request. Thanks.

If it's not fast, the work is not yet complete.

Said no one ever.

Writing code fast and writing fast code are two very different things. One could even say they are opposites. If you are in the habit of writing code very fast there's a great chance that it will be slow. Writing fast code is not just about choosing the fastest language, the fastest platform, the fastest libraries etc. Anyone could do that. What makes code truly fast is the little things; the loops, the conditionals, the assignments, function calls etc.

Introduction

I woke up Thursday morning, groggy, upset and very, very sleepy. My head was hurting. I had been coding all night and I had finally finished the first version of fdir; the fastest directory crawler for NodeJS. I opened my laptop after a good breakfast, splendid tea and a nice walk; ran the benchmark again: fdir was up against 13 other contenders. Always fdir would come out on top in both synchronous and asynchronous crawling. But it was not ready yet...

The First Try

Writing fast code starts from the very first line.

The purpose of fdir is simple; crawl as many directories as possible in as short a time as possible. The first version of fdir used recursion; it went something like this:

Note: This is an overly simplified version.

function sync(dir) {
  const dirents = fs.readdirSync(dir, { withFileTypes: true });
  const paths = [];

  dirents.forEach(dirent => {
    const res = `${dir}${path.sep}${dirent.name}`;
    if (dirent.isDirectory()) {
     sync(res).forEach(push.bind(paths));
    } else {
      paths.push(res);
    }
  });
  return paths;
}
Enter fullscreen mode Exit fullscreen mode

This could already beat almost everything out there. There's nothing special in it. Just some loops, recursion etc. etc. So what made it faster than everything?

The first line.

withFileTypes: true to be specific. This allowed me to skip the fs.lstatSync syscall for each item in a directory. Yup. You can imagine the speed boost.

This line must be making you jump out of your underwear. Why didn't you use path.join?!!

....
const res = `${dir}${path.sep}${dirent.name}`;
....
Enter fullscreen mode Exit fullscreen mode

Because it's slow. It's much slower than just using path.sep. I benchmarked it. It's about 50% slower.

v1 Benchmark:

v1 Benchmark

As you can see, only rrdir.sync comes even close to fdir and that's because it uses a similar approach.

The Defenders Arrive

Saturday evening, I was posting about fdir on Reddit. 2 hours later, the author of rrdir opened a PR to update his library to improve async performance. I was heavily refactoring fdir and adding support for Node version < 10 so his PR couldn't be merged. After an hour, though, I manually updated his library and ran the benchmarks again.

fix benchmark, bump rrdir, add rrdir.stream #2

async rrdir shouldn't look so bad anymore with this

Results:

Async:

Async benchmarks

Sync:

Async benchmarks

Two hours after that, the author of fs-recursive opened a PR to include his library in the benchmark. The title of the PR was: "I am fastest now". And it was. By quite the margin (50%). Of course I merged it.

Now I'm the fastest #3

simov avatar
simov posted on

Thanks for the benchmark 👍

The Rewrite

And of course, I couldn't let fs-recursive take the first place. I had spent "one whole night" writing the fastest crawler. I couldn't back down now. So I rewrote the whole algorithm. From top to bottom. It removed recursion (from fdir.sync), stopped array recreation, only used a single Promise per fdir.async call etc. etc. The code now looked like this:

function sync(dir, options) {
    const paths = [];
    const dirs = [dir];
    var i = 0;
    while (i < dirs.length) {
        const dir = dirs[i];
        const dirents = fs.readdirSync(dir, readdirOpts);
        dirents.forEach(function(dirent) {
            let fullPath = `${dir}${path.sep}${dirent.name}`;
            if (dirent.isDirectory()) {
                dirs.push(fullPath);
            } else {
                paths.push(fullPath);
            }
        });
        ++i;
    }
    return paths;
}
Enter fullscreen mode Exit fullscreen mode

The code is pretty self-explanatory; we keep adding directories to the dirs array so the loop never ends until there are no more directories. But fdir.sync was already the first so I didn't really need to optimize it further but I couldn't resist. Removing the multiple array initialisation, recursion gave a good speed boost and overall made the code quite clean (imo).

The real challenge was optimizing the async version. As you all know, looping with asynchronous/callback functions is quite a PITA. So after everything this came into being:

function async(dir, options) {
  return new Promise(function(resolve) {
    const paths = [];
    const dirs = [dir];
    let cursor = 0;
    let readCount = 0;
    let currentDepth = options.maxDepth;
    function walk() {
      // cache the total directories before starting the walk
      let total = dirs.length;
      for (; cursor < total; ++cursor) {
        const dir = dirs[cursor];
        fs.readdir(dir, readdirOpts, function(_, dirents) {
          dirents.forEach(function(dirent) {
            let fullPath = `${dir}${path.sep}${dirent.name}`;
            if (dirent.isDirectory()) {
                dirs.push(fullPath);
            } else {
                paths.push(fullPath);
            }
          });
          // check if we have walked all the directories we had
          if (++readCount === total) {
            // check if we got any new ones
            if (dirs.length === cursor) {
              resolve(paths);
            } else {
            // walk again if we have new directories.
              walk();
            }
          }
        });
      }
    }
    walk();
  });
}
Enter fullscreen mode Exit fullscreen mode

The concept is quite similar to fdir.sync but we retained recursion (although a new version of it). I couldn't find a way to reliably remove recursion.

The Results

And, fdir was back on top.

Async:

Async Benchmark

Sync:

Sync Benchmark

The Takeaway

The moment you have all been waiting for. The takeaways. What I learned. What I didn't. Etc. etc. However, I don't have "don't use X but Y" kind of lessons for you. I am sorry. The reason is, performance is dependent upon use-case.

  1. Don't write code fast. You will have to rewrite it again and again. And if it's a large codebase, it will very soon become a PITA. So write it carefully, take all the precautions, do all the optimizations.
  2. A single millisecond matters. Oftentimes, we do not add an optimization just because it adds only a millisecond. But "drop by drop a river is born" right?
  3. NodeJS is very fast, you just have to write honest code. Don't make it complex just for the hell of it. Keep it simple, keep it fast.
  4. Benchmark. Benchmark. Benchmark. JavaScript has a lot of ways of doing one thing, multiple loops, iterators etc. You won't know what's fastest until you benchmark. I ran benchmarks for each line of my code that could have an alternative. Remember, every millisecond matters.

But I am gonna give a few "use X not Y" lessons anyway.

  1. Use as few conditionals as possible. Each branch adds an overhead and although the engine optimizes it, you have to be careful.
  2. Prepare for errors beforehand. Try-catch is expensive. Be careful.
  3. for, forEach and array.reduce are all very fast. Use whichever works for you. Actually, use them all and see which one makes your code faster.
  4. Research the API before using it. More often than note, there's something in the API that will reduce unnecessary calls, bootstrapping, error-checks etc. Like withFileTypes: true.
  5. Use string methods as less as possible. Actually, use strings as less as possible. Pushing a string into an array is much slower than pushing an int. (I didn't get to apply this).

So What Happens Now?

Well, I will keep benchmarking and finding ways to make it faster. I will try using WebAssembly, Workers etc etc. Innovation, my friend, innovation. Currently, fdir can crawl about 1 million files in ~900ms but I want to reduce it down to 500ms. The current code is as optimized as it can get. So let's see what I try.

GitHub logo thecodrr / fdir

⚡ The fastest directory crawler & globbing library for NodeJS. Crawls 1m files in < 1s

The Fastest Directory Crawler & Globber for NodeJS

The Fastest: Nothing similar (in the NodeJS world) beats fdir in speed. It can easily crawl a directory containing 1 million files in < 1 second.

💡 Stupidly Easy: fdir uses expressive Builder pattern to build the crawler increasing code readability.

🤖 Zero Dependencies*: fdir only uses NodeJS fs & path modules.

🕺 Astonishingly Small: < 2KB in size gzipped & minified.

🔥 All Node Versions Supported: Unlike other similar libraries that have dropped support for Node versions < 10, fdir supports all versions >= 6.

🖮 Hackable: Extending fdir is extremely simple now that the new Builder API is here. Feel free to experiment around.

* picomatch must be installed manually by the user to support globbing.

Support

Do you like this project? Support me by donating, creating an issue, becoming a stargazer, or opening a pull request. Thanks.

Support fdir on ProductHunt

Thanks for reading,
thecodrr

Top comments (5)

Collapse
 
tunnckocore profile image
Charlike Mike Reagent • Edited

The funny thing is that fast-glob with dot:true is faster than others, except yours and rrdir. The recursive-fs is also competing for the async crown.

There's also an interesting thing to note that the different ones get different number of files.

In fact there's 5104 files in node_modules (after installing tiny-glob and fast-glob), as per

ls -l node_modules/**/* | grep "^-" | wc -l

But that's what libs found

sync:

fastGlob 2605
tinyGlob 2926
fdir 2605
allFilesInTree 2605
fsReadDirRecursive 2552
klaw 2926
recurReadDir undefined
walkSync 2926
rrdir 2926

async:

fastGlob 2605
tinyGlob 2926
fdir 2605
recursiveFs 322
getAllFiles 2605
recurReadDir undefined
recursiveFiles 28
recursiveReadDir 2605
rrdir 2926

edit: ah yea, there's includeDirs option

Collapse
 
thecodrr profile image
Abdullah Atta • Edited

Hey, thanks for the info. I will definitely investigate and see why fdir is missing files. When I tested I confirmed fdir with my system's file manager's file count (KDE's Dolphin) but I will use some other tool to confirm. Thanks again.

Edit: KDE's Dolphin is giving file count as: 2,407 files (in node_modules) while the command you gave is: 1973. So... either you command is flawed or something else is up.

Collapse
 
tunnckocore profile image
Charlike Mike Reagent

Okay, I see why it was 4700, it was doubled.

Updated the command ls -al node_modules/**/* | grep "^-" | grep -v "node_modules" | wc -l

It basically is 2398, including the dot files, that's closer to your report. Soo.. I don't know.

Thread Thread
 
thecodrr profile image
Abdullah Atta

Yes I tried various other commands. Fdir comes really close and since it cannot handle symlinks (yet) the difference maybe because of that. But overall, fdir wins. 😁😁😁

Collapse
 
tunnckocore profile image
Charlike Mike Reagent • Edited

I don't know. Mine (pcmanfm, on archlinux) is showing 2680, while the command 4716 - git clone, yarn, and then the command.

I more believe on the ls command. The ls -l node_modules/**/* returns all files AND directories. But the grep is getting only the files.

node_modules/urix:
total 20
drwxr-xr-x 2 charlike users 4096 Mar 20 08:38 test/
-rw-r--r-- 1 charlike users  308 Jan  6 23:04 index.js
-rw-r--r-- 1 charlike users 1079 Jan  6 23:04 LICENSE
-rw-r--r-- 1 charlike users  494 Jan  6 23:04 package.json
-rw-r--r-- 1 charlike users  812 Jan  6 23:04 readme.md

So above will give us 4 here, instead of 5.

edit: In anyway good job. The difference of 2600 to 2900 is most probably because they (rrdir for example and others) include the dirs in the final results array. Both fast-glob and fdir report same count, because both don't include dirs. So, don't think that one too much. The interesting thing is why only ~2k are reported and not 4k which is more likely the real number of files.