DEV Community

Hyeseong Kim
Hyeseong Kim

Posted on • Updated on

Dealing with Unicode strings, done right and better.

TL;DR

I created a library called "unicode-segmenter" to handle Unicode grapheme clusters with good performance and reasonable bundle size. Check it out!

This article is not just about promoting my library but also about the process of creating a high-quality library, including optimization and testing.

Intro

Have you ever heard of the term "grapheme cluster"? If you're using non-English languages or supporting emoji in your service, you probably have.

One key lesson from Unicode is that strings often require more space than they appear to.

If we used UCS-4 (or UTF-32), it would be sufficient to represent every character ever invented, making it easy to count each character. However, this would waste a lot of space.

Therefore, most programming languages use more efficient encodings like UTF-8 or UTF-16. JavaScript uses UTF-16 to represent characters like "👋" as surrogate pairs (two characters). So in JavaScript, its .length property is 2.

Some characters can be even longer by using special characters called "joiners".

  • "🇰🇷" displays as 1, but has a .length of 4
  • "👩‍👩‍👦‍👦" displays as 1, but has a .length of 11
  • "अनुच्छेद" displays as 4, but has a .length of 8
  • Some complex characters displays as 6, but has a .length of 75! 🤯

A unit internally expressed as multiple characters but logically treated as one character is called a "Grapheme Cluster."

Unicode® standardizes how to handle these grapheme clusters in UAX#29, the "Unicode Segmentation" rules.

The problem

Handling graphemes is crucial when creating custom inputs, text editors, or code analyzers that count characters.

There is the Web's Intl.Segmenter API that supports the Unicode Segmentation standard. It's available not only in Web browsers like Chrome and Safari, but also in JS runtimes like Node.js, Deno, and Bun.

However, it's so new that I couldn't use it for my company's app, Karrot. For more context, Karrot is a massive app with over 18M monthly active users in South Korea, so supporting users on older versions is essential. Given the update frequency of mobile users, versions like Chrome 87 and Safari 14.5 are still not sufficient.

One day, my colleague was analyzing the bundle size of their production and reported an issue with our design system library as it seemed unexpectedly large.

Bundle size of graphemer, stat size is 459.49 KB and parsed size is 89.08 KB

It was the "graphemer" library inside, more than 95% of the size of the problematic input component, and is a large enough portion of the overall app that it's even visually noticeable.

Today, graphemer is the most popular way to handle grapheme clusters in JavaScript. (20M weekly downloads on NPM !!)

But graphemer is an old library and is no longer actively maintained. This may result in differences with the latest Unicode version (mostly in Hindi), and no ESM support.

Custom implementations for Unicode are not trivial in bundle size because they include Unicode data. Also, graphemer generates tons of if-else statements for performance reasons. (spoiler: it's not performant)

Well, isn't there a better alternative?

Can WebAssembly (and Rust) save us?

Probably YES!

I already knew that there was a good quality library unicode-segmentation in Rust, and Rust has a great WebAssembly toolchain called wasm-bindgen.

So I created a simple binding using wasm-bindgen.

use unicode_segmentation::UnicodeSegmentation;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn count(val: &str) -> usize {
    val.graphemes(true).count()
}

#[wasm_bindgen]
pub fn collect(val: &str) -> Vec<String> {
    val.graphemes(true).map(|s| s.to_string()).collect()
}
Enter fullscreen mode Exit fullscreen mode

The above code generates a WASM binary with JavaScript binding. It was surprisingly easy.

After a few manual optimizations, I got a WASM binary in 52KB size. When combined with the bindings, it totals around 55KB. Also, WASM binaries compress very well, coming in at a total of 23KB when compressed by gzip!

This is already smaller than the parsed size of the graphemer library, and since WASM supports streaming compilation, it appeared to be a better option to me.

However, feedback from my colleagues wasn’t positive. The reason was that the size was still too large, and using WebAssembly was not yet feasible for us.

RIIJS (Rewrite It In JavaScript)

Rust people have a cultural idiom called RIIR (Rewrite It In Rust).

Although WebAssembly is known for its performance, its size can sometimes be unacceptable in mobile apps, or the performance might not meet expectations due to binding overhead.

A high-quality JavaScript library is still valuable, so I decided to do the opposite: Rewrite It in JavaScript.

First, I reviewed the Rust library’s license. It's dual-licensed under MIT and Apache 2.0, I chose MIT.

Then I began by modifying its code generator, which was written in Python, to produce JavaScript instead of Rust. Since I’m not familiar with Python code, this was the most time-consuming task.

By utilizing my knowledge of Rust and JavaScript, I ported the implementation of its Iterator trait into JavaScript’s iteration protocol.

Since JavaScript doesn't have a pattern-matching like Rust, it could be hard to replicate the same logic. I used the ReScript compiler to maintain the original logic as much as possible. It made me able to port it confidently. Specifically check_pair function can be converted into this.

After some test-fix loops, I got the first working version and the result was way better than expected. It was 3x smaller in size and 2.5x faster than the graphemer 😲

(It was the initial result and even better now!)

Reducing bundle size

The major win of the library comes from choosing the simplest structure to store the Unicode data table. This wins in both size and performance. Not only that but there are also various techniques and efforts to achieve an even smaller size.

Using JavaScript generator

JavaScript has a syntax for stateful logic known as "Generator". Generator is highly expressive, making them 40% smaller (1682 bytes into 1011 bytes) than an equivalent class with the same functionality!

You can see both class version and generator version in old commits.

Considering compression

JavaScript code is mostly transmitted with compression like gzip or brotli (zstd soon).

The minified graphemer library is 89.08 KB, but after gzip compression, it reduces to 13.15 KB. This is because the library size is significantly inflated by a bunch of if-else statements, and compression algorithms excel at such repetitive stuff.

unicode-segmenter aims to be small even before compression, but it also carefully considers the size after compression.

For instance, when dealing with character data like “\u{1F680}”, a minifier like Terser tries to unescape it to "🚀". Unescaping can reduce code size since escape sequences require additional characters. However, in the case of data tables with a significant number of escaped characters, gzip compression is highly effective, making it better not to unescape them. The results are summarized in this pull request.

Enabling tree-shaking

JavaScript production apps use bundlers. These tools can exclude unused modules from the final bundle or completely remove unused code through static analysis.

The one goal of unicode-segmenter is to provide a complete polyfill for Intl.Segmenter. However, if only grapheme segmentation is needed, it adopts a well-considered modular structure to ensure that unnecessary parts are not included.

The exposed APIs have independent entries based on their respective topics.

  • unicode-segmenter/emoji (1KB): Provides emoji matchers
  • unicode-segmenter/general (5.9KB): Provides alphabet/number matchers
  • unicode-segmenter/grapheme (9.6KB): Provides text segmentation by extended grapheme cluster
  • unicode-segmenter/intl-polyfill (9.9KB): Provide Intl.Segmenter (currently grapheme only) polyfill
  • unicode-segmenter/utils (300B): Provide utilities for UTF-16 and Unicode code points.

Optimizing performance

Thanks to the unicode-rs' excellent implementation I referenced, it was already 2.5x better than graphemer in its first version. It's mostly a hand-written binary search in a compressed Unicode data table. It's simple but very efficient.

However, for even better results, it is important to fully consider the characteristics of the language.

Leverage the compiler

As mentioned above, the ReScript compiler was very helpful in the initial porting process.

I ended up rewriting it by hand for the best performance, but until then it allowed me to focus on other areas first, as it always ensures accuracy and sufficient efficiency.

Using the optimal data type

JavaScript engines also have a compilation process called JIT(Just-In-Time compilation). One well-known thing is that they are very efficient when dealing with 32-bit integers (aka SMI; small integers).

A Unicode data is a list of ranges of Unicode code points, which are 32-bit integers. Therefore, I was able to make all operations internally based on SMI.

Unlike Rust, where the compiler checks data types, in JavaScript, this responsibility falls entirely on the developer. I managed to do it with careful attention.

As a result, performance more than doubled in the second release.

Explicit bindings

When using generators or regular functions instead of classes, it is natural to implicitly capture external variables through "closures". However, implicit bindings can negatively impact optimization and garbage collection.

To avoid this, simply hoisting functions resulted in an additional 10% improvement. (See the change)

Avoiding copying and constructing structs

When dealing with more than one piece of data, there’s a desire to handle it as a structure.

However, JavaScript lacks efficient records/tuples, and both Object and Array always come with a cost. While this isn’t an issue for most applications, it can be critical in the micro-benchmarks of libraries.

It is also very important to avoid implicit copying. If necessary, pass references directly and perform the copy explicitly.

// retained `cache` reference
function cat(cp, cache) {
  ...
    if (cp < cache[0] || cp > cache[1]) {
      let result = searchGraphemeCategory(cp);
      // update its values by explicit copying
      cache[0] = result[0];
      cache[1] = result[1];
      cache[2] = result[2];
    }
    return cache[2];
  }
};
Enter fullscreen mode Exit fullscreen mode

Continuous benchmarking and efforts!

What helped getting faster mostly was the benchmark tool that was set up from the very beginning and the continuous measurements.

I use mitata (pretiously tinybench). If you care about performance, I highly recommend you do it.

However, poorly designed benchmarks can be as dangerous as having none. They might be completely invalidated by the compiler, too small to reflect real-world scenarios, or fail to identify scalability issues.

To avoid these pitfalls, it is crucial to design benchmarks to be as close to the real world as possible. unicode-segmenter gets help from ChatGPT to construct tests that resemble real code!

Benchmarks can yield very different results depending on the environment. Therefore, I run it not only in my local environment but also across various versions of Node.js and Bun, on various OS and CPU architectures, and I also configure them to run in browsers to verify performance in both Chrome and Safari. Through this process, I confirmed that the performance of Intl.Segmenter has improved significantly in recent versions of Node.js, Chrome, and Safari. This is very exciting news!

But efforts are still worth it since not everyone is using the latest version or ideal environment :)

Testing strategy

Testing is crucial. Since Unicode involves characters that might be unfamiliar, extensive testing was necessary.

I set up initial test suites with Node.js test runner (It's great!) and manually wrote a few known cases, but it was never enough.

Property-based testing

I tried a more precise method called "Property-based testing".

fast-check is the most popular PBT tool in the JavaScript ecosystem. With fast-check, I can easily write test code that defines properties and verifies them based on automated inputs.

Then, how is the verification done? The idea is to check if the results match those of Intl.Segmenter, which is already available at the Node.js runtime!

test('graphemeSegments', async t => {
  await t.test('unicode string', () => {
    // Intl.Segmenter is available on Node.js!
    const intlSegmenter = new Intl.Segmenter();

    fc.assert(
      fc.property(fc.fullUnicodeString(), input => {
        // assert through automated input
        assertObjectContaining(
          [...graphemeSegments(input)],
          [...intlSegmenter.segment(input)],
        );
      }),
    );
  });
});
Enter fullscreen mode Exit fullscreen mode

Through PBT, I found and fixed numerous bugs in the initial implementation. The first bug PBT identified was an empty string (""). 😅 This is a surprisingly common mistake.

I completely switched my primary testing method to PBT and set up additional tests to debug the counterexamples found.

Generating test suites

Unicode provides official test data. I used the data to generate test suites and check for spec compliance.

During this process, I discovered that the Rust library I referenced had not yet implemented the “GB9c” rule, so I implemented it myself. (It seems they implemented it recently)

Production testing

I understand that despite all these efforts, it might still be insufficient. Reality can sometimes be more extreme than theory.

I tried to create migration PRs directly to major dependents of graphemer. I found a few issues related to implementation, and many others related to TypeScript configuration, sourcemap setup, etc.

By supporting both large and small-scale projects, I was able to update the library and gain confidence in its final quality.

Conclusion

As a result, my app achieved a smaller bundle size and better performance.

Bundle size of unicode-segmenter/grapheme, stat size is 42.83 KB,  parsed size is 29.5 KB, gzipped size is 9.15 KB

I also learned that what it does is not that simple. This should be a temporary alternative to Intl.Segmenter.

I recommend to use Intl.Segmenter where possible. The unicode-segmenter library might become another graphemer after 10 years. Who knows!

But if you are in a special environment where Intl.Segmenter is not available, try unicode-segmenter. It provides good performance in a reasonable size.

Also, there are more coming shortly

  • It will soon provide a full Intl.Segmenter polyfill, including word/sentence support. (issue #25)
  • It will support more advanced patterns that Intl.Segmenter doesn't, like backward iteration. (issue #26)

Are you struggling with the quality of the library in another area? Consider taking some time to investigate. We still have many opportunities for further improvement in the JavaScript ecosystem.

Top comments (0)