The Right Way to Split String Into Words in JavaScript

#javascript #nlp #chinese #thai

When processing text, one of the common tasks is breaking a string into array of words. In a hurry? Jump to the correct way.

Wrong way

The quick and dirty way to do it is to use built-in JavaScript function split with space character as the separator:

"Hello world!".split(" ")
// ['Hello', 'world!']

This approach, however, does not take into account double spaces and punctuation. One way to improve it is by using built-in function match to find all words \w+ excluding punctuation:

"Let's try again: hello world!" . match(/\w+/g);
// ['Let', 's', 'try', 'again', 'hello', 'world']

Whoops, the apostrophe broke the word Let's into two words.

You could try to fix the regular expression (for example do [\w\']+), but as you tackle different edge cases, it will be more and more difficult to make match work correctly.

I18n

If you expect other people to use your code, you should also know that there are a number of languages that don't use any word separators at all: Chinese, Japanese, Thai, etc. In total about 1.5 billion people speak these languages or about 20% of world population.

When our company BestKru was looking for NLP solution to work with Thai text, we found out that lots of popular libraries, frameworks and apps support only space separated text. Which means we could not use them at all! I'm writing this post to raise awareness of this problem and to encourage developers to use the better way for breaking the text into words.

Correct way

To make your code support different languages and work correctly with complex punctuation, use the standard built-in JavaScript object Intl.Segmenter which utilizes the Unicode Standard's segmentation rules.

This is how you use it:

const text = "What's up, world? 你好世界！ こんにちは世界！สวัสดีชาวโลก";
const segmenter = new Intl.Segmenter([], { granularity: 'word' });
const segmentedText = segmenter.segment(text);
const words = [...segmentedText].filter(s => s.isWordLike).map(s => s.segment);
// ["What's", 'up', 'world', '你好', '世界', 'こんにちは', '世界', 'สวัสดี', 'ชาว', 'โลก']