My website’s search is using ripgrep under the hood. You can try it out visiting the search page and typing a simple regex. As a disclaimer I just want to say that this is mostly an experiment and it’s in no way a real alternative to do Full-Text Search… but IT IS cool 😎✨
ripgrep is an interesting software that allows to do regex-based search queries to files and directories recursively. The most important part is that it’s fast like, really fast. So, could it be possible to make it works over HTTP instead of the filesystem? And what about using it in the browser?
Turned out that it actually IS possible to use it over HTTP since there is a Rust create with all the essential code to make ripgrep works programmatically. About the “work in the browser” thing the story is a bit more complicated. Since we are talking about a Rust library, the most common way to use it is through WebAssembly (WASM). The ripgrep’s codebase is mostly compatible with some exception which I had to manually fix inside a fork.
So, now that we have everything sorted out, let's go a bit deeper!
The netgrep library is divided into two macro parts: a WASM binary that interacts with the ripgrep’s internals and a TypeScript library which manages the bindings and the exposed API. I also wanted to try nx as a build system, which is quite good for a Rust + TS codebase.
After dealing with the WASM compatibility issue, which was actually quite simple to fix, I had to choose the architecture of the library. Analysing a bit ripgrep we can summarise its work into two sections:
- Discovery which is the act of navigating inside a directory and list all the files recursively;
- Search or: “look for the given pattern inside that file”.
At the moment I just wanted to release netgrep with only the Search feature, leaving to the user the job of providing a list of files to analyse. Taking this into consideration and knowing that a WASM binary can only use the native browser APIs for networking (so fetch and XMLHttpRequest), I decided to handle just the searching function inside the binary.
More specifically, the
search_bytes function exposed from the search package uses the
search_slice method from the
grep crate to analyse a slice of bytes, returning a boolean value representing whether the given pattern has been found or not. This allows for a great deal of flexibility, for example we’ll be able to check for a pattern while a file is being downloaded and not just after, leveraging one of the most useful features of ripgrep even over HTTP.
The netgrep package is the one responsible to expose the final API to the user, and the “core” function used to build all the other methods is
Netgrep.search(). This just executes a
fetch request toward an endpoint and triggers the
search_bytes function for every batch of bytes downloaded until a match has been found. When this happens it will just resolve the returned
Promise with a
XMLHttpRequest with an
onprogress event, but I noticed that I couldn’t actually read the content being downloaded. Trying reading the response’s value was a dead end-ish too, since as stated in the official documentation:
[…] The value is null if the request is not yet complete or was unsuccessful, with the exception that when reading text data using a responseType of "text" or the empty string (""), the response can contain the response so far while the request is still in the LOADING readyState (3).
Even though this is an interesting tradeoff, there is a better (this is opinionated obviously) approach using
ReadableStream allowing us to read a network response “chunk by chunk”. I
copied the example implemented it inside the
search method here.
All the other methods like
searchBatchWithCallback are utility functions built over
search that will provide a nice (or at least I hope ) dev experience using this library.
Well, as I said this was just an experiment to test a bit WASM and the integration of a library that is completely outside the "web" scope. This means that even though I have written it with performance in mind, it’s not the best way to do a Full-Text Search. It could be used for small files-based databases (like this blog) and possibly with a server supporting HTTP/2 in order to leverage multiplexing. Anything bigger than that will probably require a more “scalable” approach like an Index-Based Search Engine.
See ya in the next article 👋