Easy, relevant, efficient and semi-scalable static site search with InfiSearch

Hi! At its core, InfiSearch aims to be a tool that helps you deliver client-side static site search easily and effectively.

The problem statement(s): You want to add page search to your static site, but:

Don't have the time to pluck a client-side search library (e.g. lunr.js), write the indexing code, hook up data to an existing UI, design a new UI, etc.
Can't use a free tier SaaS service like Algolia DocSearch. (e.g. private site, your use case dosen't fit the ToS, etc.)
Don't want to forgo the amazing search features and relevance offered by such SaaS services.
Don't have the time to implement an accessible UI
Want to be able to support a sizeable collection of documents. (roughly < 500MB of raw text)

To give a high-level overview (check out the repo for details) of how InfiSearch runs, you would:

Download the CLI binary or install it through Rust toolchains.
Use the binary to index a folder of your HTML files. Links to your HTML files are generated based on relative file paths.
Add <script> and <link> tags to InfiSearch's resources in your site.
Call a simple function infisearch.init, which takes:
- URL of the index output directory.
- A link base URL, for concatenating with the relative file paths, giving the full link of your pages.
- id of an input element on your page.

What about "effective" search? That is a broad term, and can be further broken down:

Accessibility: The search UI implements all the standard combobox controls (arrow keys, home, enter, end), aria attributes, etc. Keyboard navigability and screen reader usability are also often considered. This is a critical design goal for the UI. If you find any bugs, or feel strongly about a design choice, please don't hesitate to open an issue.

Scalability: Client-side search often very quickly runs into scalability issues due to generating a monolithic/single index file and text store of the documents. InfiSearch alleviates and defers this problem to a large degree by optionally fragmenting the index and/or text stores into smaller fragments, so searches use as little bandwidth as possible. See the website, which demos a 520MB, 50000 document raw text (no HTML soup) Gutenberg collection for example. Many low-level compression techniques are also employed on the index file itself, and persistent caching if available is employed using the Cache API that backs service workers.

Efficiency: Using less bandwidth is good, but the code running the query (ranking, processing boolean operators, etc.) itself must also be extremely efficient. InfiSearch 1) uses WebAssembly powered with Rust to run the queries quickly and 2) runs the expensive things inside a WebWorker to avoid blocking the UI thread.

Search Relevance: An scalable search solution is not very useful if the results returned are not relevant. Searchers won't have the patience to dig through hundreds and thousands of results.

As a baseline, InfiSearch borrows and implements some industry-standard ideas: 1) BM25 ranking, this article provides an excellent overview 2) computes a (soft) disjunctive maximum of the various BM25 field scores of a document (titles, h1s, headings, other text) and allows configuring field weights (with sensible defaults) 3) documents matching more terms get a straightforward boost to their scores versus ones that match less. These are good baselines, but when dealing with thousands of results, can still be improved.

The key way InfiSearch expands on these is by factoring in the proximity and order of query terms with respect to one another, using the term positions it stores by default. This greatly improves contextual relevance. To give a real world example, if you run the query "sunny weather", you will likely want to see these two terms somewhat closeby and in-order, and not in completely separate paragraphs as "last week was sunny ... lots of text in between ... weather forecast".

For term tokenization, InfiSearch performs automatic prefix search (e.g. run* searches for run,running) on the last query term the user is still typing to increase search recall. Stemming is not enabled by default, as it decreases the precision/relevance of searches. You may recognise this behaviour from Algolia DocSearch, InfiSearch borrows this idea/observation.

(Manual) Search Relevance: The prior section discussed automatic strategies for boosting relevance. As relevance is such a key part of a good search experience, providing manual ways for the user to refine searches is also important. (e.g. Google's boolean, phrase query syntaxes) InfiSearch supports exact/phrase queries ("weather forecast"), boolean operators (+mandatory -subtraction ~inversion), prefix searches (run*). Searching for only titles/headings is also possible.

Query syntaxes are not the most intuitive. You can also set up UI multi-select dropdowns for filtering by any kind of "categories" using your own data. Numeric filters are also available to sort results by numbers and datetimes you provide in your page.

Please check out the repo here, and drop a star if you found this tool useful.

DEV Community

Easy, relevant, efficient and semi-scalable static site search with InfiSearch

Top comments (0)