DEV Community

Cover image for Stop Splitting Strings the Wrong Way: Discover Intl.Segmenter
Giuseppe Ciullo
Giuseppe Ciullo

Posted on • Originally published at dev.giuseppeciullo.it

Stop Splitting Strings the Wrong Way: Discover Intl.Segmenter

How many times have you done this?

const text = "I love pizza and pasta";
text.split(" ");
// ["I", "love", "pizza", "and", "pasta"]
Enter fullscreen mode Exit fullscreen mode

Looks fine — until you deal with punctuation, multiple spaces, or languages that don’t even use spaces.
Try splitting a French sentence with apostrophes, or a Japanese one with no whitespace at all.
Suddenly .split(" ") feels more like a guess than a rule.

That’s where Intl.Segmenter comes in.


What Is Intl.Segmenter?

Intl.Segmenter is part of the JavaScript Internationalization API.
It splits text into human-perceived units — words, sentences, or characters — using the rules of a specific language.

In other words, it doesn’t just separate symbols.
It understands how people actually read and write.


Understanding Granularity

The secret sauce of Intl.Segmenter lies in its granularity — the level at which text is broken down.
You can choose between three modes:

  • grapheme → splits text into visible characters (user-perceived symbols)
  • word → splits into words, respecting language-specific rules
  • sentence → splits into sentences, automatically detecting punctuation and spacing

Each level serves a different purpose:
use graphemes for counting characters, words for tokenization, and sentences for text analysis.

Here’s the beauty: the same phrase can be segmented differently depending on the locale.


Example: One Sentence, Three Languages

Let’s see how Intl.Segmenter handles the same idea across English, French, and Japanese.

const sentences = {
  en: "I love sushi.",
  fr: "J’aime les sushis.",
  ja: "私は寿司が好きです。"
};

for (const [locale, text] of Object.entries(sentences)) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const words = [...segmenter.segment(text)]
    .filter(s => s.isWordLike)
    .map(s => s.segment);

  console.log(locale, words, `(${words.length} words)`);
}
Enter fullscreen mode Exit fullscreen mode

Output:

en [ 'I', 'love', 'sushi' ] (3 words)
fr [ 'J’aime', 'les', 'sushis' ] (3 words)
ja [ '私', 'は', '寿司', 'が', '好き', 'です' ] (6 words)
Enter fullscreen mode Exit fullscreen mode

Same sentence.
Different segmentation — perfectly adapted to each language’s rules.


Example: Word Segmentation (English)

const text = "Pizza, pasta and ice cream!";
const segmenter = new Intl.Segmenter("en", { granularity: "word" });

const words = [...segmenter.segment(text)]
  .filter(s => s.isWordLike)
  .map(s => s.segment);

console.log(words);
// ["Pizza", "pasta", "and", "ice", "cream"]
Enter fullscreen mode Exit fullscreen mode

Each element is not just a string — it’s a full object with metadata:

{
  segment: "Pizza",
  index: 0,
  input: "Pizza, pasta and ice cream!",
  isWordLike: true
}
Enter fullscreen mode Exit fullscreen mode

Example: Grapheme Segmentation

Need to count or slice user-visible characters correctly?

const text = "Café résumé";
const graphemeSegmenter = new Intl.Segmenter("en", { granularity: "grapheme" });

const graphemes = [...graphemeSegmenter.segment(text)].map(s => s.segment);
console.log(graphemes);
// ["C", "a", "f", "é", " ", "r", "é", "s", "u", "m", "é"]
Enter fullscreen mode Exit fullscreen mode

Each accented character counts as one, not two code units.
Perfect for counters, highlighting, or text selection logic.


Example: Sentence Segmentation

const paragraph = "I love pizza. Pasta is great too! Let's eat.";
const sentenceSegmenter = new Intl.Segmenter("en", { granularity: "sentence" });

for (const segment of sentenceSegmenter.segment(paragraph)) {
  console.log(segment.segment);
}
Enter fullscreen mode Exit fullscreen mode

Output:

I love pizza.
Pasta is great too!
Let's eat.
Enter fullscreen mode Exit fullscreen mode

Automatically handles punctuation and spacing.


Example: Real-world Use — Word Counter

You can build a word counter that works in any language:

function countWords(text, locale = "en") {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  return [...segmenter.segment(text)].filter(s => s.isWordLike).length;
}

console.log(countWords("Pizza, pasta and ice cream!")); // 5
console.log(countWords("J’aime le chocolat et le café.")); // 6
console.log(countWords("私は寿司が好きです。")); // 6
Enter fullscreen mode Exit fullscreen mode

No regex, no guesswork — fully locale-aware.


Conclusion

Splitting text isn’t just a technical task — it’s a linguistic one.
Intl.Segmenter gives JavaScript the power to understand words, sentences, and meaning, not just characters.

No regex. No libraries. No fragile logic.

To explore more, visit the official MDN Web Docs.
And connect on LinkedIn for more updates and insights.

Top comments (0)