Giuseppe Ciullo

Posted on Nov 11 • Originally published at dev.giuseppeciullo.it

Stop Splitting Strings the Wrong Way: Discover Intl.Segmenter

#webdev #javascript #programming #tutorial

How many times have you done this?

const text = "I love pizza and pasta";
text.split(" ");
// ["I", "love", "pizza", "and", "pasta"]

Looks fine — until you deal with punctuation, multiple spaces, or languages that don’t even use spaces.
Try splitting a French sentence with apostrophes, or a Japanese one with no whitespace at all.
Suddenly .split(" ") feels more like a guess than a rule.

That’s where Intl.Segmenter comes in.

What Is `Intl.Segmenter`?

Intl.Segmenter is part of the JavaScript Internationalization API.
It splits text into human-perceived units — words, sentences, or characters — using the rules of a specific language.

In other words, it doesn’t just separate symbols.
It understands how people actually read and write.

Understanding Granularity

The secret sauce of Intl.Segmenter lies in its granularity — the level at which text is broken down.
You can choose between three modes:

grapheme → splits text into visible characters (user-perceived symbols)
word → splits into words, respecting language-specific rules
sentence → splits into sentences, automatically detecting punctuation and spacing

Each level serves a different purpose:
use graphemes for counting characters, words for tokenization, and sentences for text analysis.

Here’s the beauty: the same phrase can be segmented differently depending on the locale.

Example: One Sentence, Three Languages

Let’s see how Intl.Segmenter handles the same idea across English, French, and Japanese.

const sentences = {
  en: "I love sushi.",
  fr: "J’aime les sushis.",
  ja: "私は寿司が好きです。"
};

for (const [locale, text] of Object.entries(sentences)) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const words = [...segmenter.segment(text)]
    .filter(s => s.isWordLike)
    .map(s => s.segment);

  console.log(locale, words, `(${words.length} words)`);
}

Output:

en [ 'I', 'love', 'sushi' ] (3 words)
fr [ 'J’aime', 'les', 'sushis' ] (3 words)
ja [ '私', 'は', '寿司', 'が', '好き', 'です' ] (6 words)

Same sentence.
Different segmentation — perfectly adapted to each language’s rules.

Example: Word Segmentation (English)

const text = "Pizza, pasta and ice cream!";
const segmenter = new Intl.Segmenter("en", { granularity: "word" });

const words = [...segmenter.segment(text)]
  .filter(s => s.isWordLike)
  .map(s => s.segment);

console.log(words);
// ["Pizza", "pasta", "and", "ice", "cream"]

Each element is not just a string — it’s a full object with metadata:

{
  segment: "Pizza",
  index: 0,
  input: "Pizza, pasta and ice cream!",
  isWordLike: true
}

Example: Grapheme Segmentation

Need to count or slice user-visible characters correctly?

const text = "Café résumé";
const graphemeSegmenter = new Intl.Segmenter("en", { granularity: "grapheme" });

const graphemes = [...graphemeSegmenter.segment(text)].map(s => s.segment);
console.log(graphemes);
// ["C", "a", "f", "é", " ", "r", "é", "s", "u", "m", "é"]

Each accented character counts as one, not two code units.
Perfect for counters, highlighting, or text selection logic.

Example: Sentence Segmentation

const paragraph = "I love pizza. Pasta is great too! Let's eat.";
const sentenceSegmenter = new Intl.Segmenter("en", { granularity: "sentence" });

for (const segment of sentenceSegmenter.segment(paragraph)) {
  console.log(segment.segment);
}

Output:

I love pizza.
Pasta is great too!
Let's eat.

Automatically handles punctuation and spacing.

Example: Real-world Use — Word Counter

You can build a word counter that works in any language:

function countWords(text, locale = "en") {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  return [...segmenter.segment(text)].filter(s => s.isWordLike).length;
}

console.log(countWords("Pizza, pasta and ice cream!")); // 5
console.log(countWords("J’aime le chocolat et le café.")); // 6
console.log(countWords("私は寿司が好きです。")); // 6

No regex, no guesswork — fully locale-aware.

Conclusion

Splitting text isn’t just a technical task — it’s a linguistic one.
Intl.Segmenter gives JavaScript the power to understand words, sentences, and meaning, not just characters.

No regex. No libraries. No fragile logic.

To explore more, visit the official MDN Web Docs.
And connect on LinkedIn for more updates and insights.

DEV Community

Stop Splitting Strings the Wrong Way: Discover Intl.Segmenter

What Is `Intl.Segmenter`?

Understanding Granularity

Example: One Sentence, Three Languages

Example: Word Segmentation (English)

Example: Grapheme Segmentation

Example: Sentence Segmentation

Example: Real-world Use — Word Counter

Conclusion

Top comments (0)

What Is Intl.Segmenter?

Understanding Granularity

Example: One Sentence, Three Languages

Example: Word Segmentation (English)

Example: Grapheme Segmentation

Example: Sentence Segmentation

Example: Real-world Use — Word Counter

Conclusion

What Is `Intl.Segmenter`?