When processing text, one of the common tasks is breaking a string into array of words. In a hurry? Jump to the correct way.
Wrong way
The quick and dirty way to do it is to use built-in JavaScript function split
with space character as the separator:
"Hello world!".split(" ")
// ['Hello', 'world!']
This approach, however, does not take into account double spaces and punctuation. One way to improve it is by using built-in function match
to find all words \w+
excluding punctuation:
"Let's try again: hello world!" . match(/\w+/g);
// ['Let', 's', 'try', 'again', 'hello', 'world']
Whoops, the apostrophe broke the word Let's
into two words.
You could try to fix the regular expression (for example do [\w\']+
), but as you tackle different edge cases, it will be more and more difficult to make match
work correctly.
I18n
If you expect other people to use your code, you should also know that there are a number of languages that don't use any word separators at all: Chinese, Japanese, Thai, etc. In total about 1.5 billion people speak these languages or about 20% of world population.
When our company BestKru was looking for NLP solution to work with Thai text, we found out that lots of popular libraries, frameworks and apps support only space separated text. Which means we could not use them at all! I'm writing this post to raise awareness of this problem and to encourage developers to use the better way for breaking the text into words.
Correct way
To make your code support different languages and work correctly with complex punctuation, use the standard built-in JavaScript object Intl.Segmenter which utilizes the Unicode Standard's segmentation rules.
This is how you use it:
const text = "What's up, world? 你好世界! こんにちは世界!สวัสดีชาวโลก";
const segmenter = new Intl.Segmenter([], { granularity: 'word' });
const segmentedText = segmenter.segment(text);
const words = [...segmentedText].filter(s => s.isWordLike).map(s => s.segment);
// ["What's", 'up', 'world', '你好', '世界', 'こんにちは', '世界', 'สวัสดี', 'ชาว', 'โลก']
As you can see, all punctuation, the apostrophe, and four languages all handled correctly.
You can also use Intl.Segmenter
to break a text into sentences, but that is a story for another time.
Top comments (3)
This is really helpful, thank you!
Shouldn't
sentence.split(/\s+/)
just do the trick?No all languages has space as word delimiter