<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kamonwan Achjanis</title>
    <description>The latest articles on DEV Community by Kamonwan Achjanis (@kamonwan).</description>
    <link>https://dev.to/kamonwan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1132866%2F9137318c-3f99-4954-aba6-7acd50716845.jpg</url>
      <title>DEV Community: Kamonwan Achjanis</title>
      <link>https://dev.to/kamonwan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kamonwan"/>
    <language>en</language>
    <item>
      <title>The Right Way to Split String Into Words in JavaScript</title>
      <dc:creator>Kamonwan Achjanis</dc:creator>
      <pubDate>Fri, 04 Aug 2023 10:17:12 +0000</pubDate>
      <link>https://dev.to/kamonwan/the-right-way-to-break-string-into-words-in-javascript-3jp6</link>
      <guid>https://dev.to/kamonwan/the-right-way-to-break-string-into-words-in-javascript-3jp6</guid>
      <description>&lt;p&gt;When processing text, one of the common tasks is breaking a string into array of words. In a hurry? Jump to the correct way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrong way
&lt;/h2&gt;

&lt;p&gt;The quick and dirty way to do it is to use built-in JavaScript function &lt;code&gt;split&lt;/code&gt; with space character as the separator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Hello world!&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// ['Hello', 'world!']&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach, however, does not take into account double spaces and punctuation. One way to improve it is by using built-in function &lt;code&gt;match&lt;/code&gt; to find all words &lt;code&gt;\w+&lt;/code&gt; excluding punctuation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Let's try again: hello world!&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\w&lt;/span&gt;&lt;span class="sr"&gt;+/g&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// ['Let', 's', 'try', 'again', 'hello', 'world']&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Whoops, the apostrophe broke the word &lt;code&gt;Let's&lt;/code&gt; into two words.&lt;/p&gt;

&lt;p&gt;You could try to fix the regular expression (for example do &lt;code&gt;[\w\']+&lt;/code&gt;), but as you tackle different edge cases, it will be more and more difficult to make &lt;code&gt;match&lt;/code&gt; work correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  I18n
&lt;/h2&gt;

&lt;p&gt;If you expect other people to use your code, you should also know that there are a number of languages that don't use any word separators at all: Chinese, Japanese, Thai, etc. In total about 1.5 billion people speak these languages or about 20% of world population.&lt;/p&gt;

&lt;p&gt;When our company &lt;a href="https://bestkru.com/"&gt;BestKru&lt;/a&gt; was looking for NLP solution to work with Thai text, we found out that lots of popular libraries, frameworks and apps support only space separated text. Which means we could not use them at all! I'm writing this post to raise awareness of this problem and to encourage developers to use the better way for breaking the text into words.&lt;/p&gt;

&lt;h2&gt;
  
  
  Correct way
&lt;/h2&gt;

&lt;p&gt;To make your code support different languages and work correctly with complex punctuation, use the standard built-in JavaScript object &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter"&gt;Intl.Segmenter&lt;/a&gt; which utilizes the Unicode Standard's segmentation rules.&lt;/p&gt;

&lt;p&gt;This is how you use it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;What's up, world? 你好世界！ こんにちは世界！สวัสดีชาวโลก&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;segmenter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Intl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Segmenter&lt;/span&gt;&lt;span class="p"&gt;([],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;granularity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;word&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;segmentedText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;segmenter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;segment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;segmentedText&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isWordLike&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;segment&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// ["What's", 'up', 'world', '你好', '世界', 'こんにちは', '世界', 'สวัสดี', 'ชาว', 'โลก']&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, all punctuation, the apostrophe, and four languages all handled correctly.&lt;/p&gt;

&lt;p&gt;You can also use &lt;code&gt;Intl.Segmenter&lt;/code&gt; to break a text into sentences, but that is a story for another time.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>nlp</category>
      <category>chinese</category>
      <category>thai</category>
    </item>
  </channel>
</rss>
