DEV Community

SEN LLC
SEN LLC

Posted on

A Text Statistics Tool That Classifies Hiragana, Katakana, and Kanji Separately

A Text Statistics Tool That Classifies Hiragana, Katakana, and Kanji Separately

Word count tools work fine for English. For Japanese they're mostly useless — there's no space between words, so "word count" doesn't exist in the normal sense. But character count broken down by script (hiragana / katakana / kanji / ASCII) is useful: it tells you the vocabulary mix, the reading difficulty, and the writing style.

Paste a paragraph. Get character counts, word counts (where meaningful), reading time, speaking time, character frequency distribution, social media limit checks, and a Japanese script breakdown. All client-side, all live, zero dependencies.

🔗 Live demo: https://sen.ltd/portfolio/text-count/
📦 GitHub: https://github.com/sen-ltd/text-count

Screenshot

Features:

  • Characters (with/without spaces), words, sentences, paragraphs, lines
  • Reading time (200 wpm / 400 wpm)
  • Speaking time (150 wpm)
  • Japanese script breakdown (hiragana / katakana / kanji / ASCII / digit / other)
  • Top 10 words by frequency
  • Unique word count, average word/sentence length
  • Character frequency chart (top 15)
  • Platform limits (Twitter 280, SMS 160, Instagram bio 150, etc.)
  • Japanese / English UI
  • Zero dependencies, 73 tests

Character classification by Unicode range

export function classifyChar(c) {
  const code = c.charCodeAt(0);
  if (code >= 0x3041 && code <= 0x309F) return 'hiragana';  // ぁ-ゟ
  if (code >= 0x30A0 && code <= 0x30FF) return 'katakana';  // ゠-ヿ
  if (code >= 0x4E00 && code <= 0x9FFF) return 'kanji';     // CJK Unified
  if (code >= 0x30 && code <= 0x39) return 'digit';
  if (code >= 0x20 && code <= 0x7E) return 'ascii';
  if (/\s/.test(c)) return 'space';
  return 'other';
}
Enter fullscreen mode Exit fullscreen mode

The Unicode ranges:

  • Hiragana: U+3041-U+309F (96 codepoints)
  • Katakana: U+30A0-U+30FF (96 codepoints)
  • Kanji (CJK Unified Ideographs): U+4E00-U+9FFF (20,992 codepoints)

Everything else is categorized as digit, ASCII, whitespace, or other. This gives you a breakdown like "200 chars total: 120 hiragana, 40 kanji, 30 katakana, 10 ASCII" — which describes Japanese prose more usefully than "200 characters".

The word-count problem

In English, "words" are space-delimited. In Japanese, they aren't:

The quick brown fox   → 4 words
素早い茶色の狐が      → ? words
Enter fullscreen mode Exit fullscreen mode

Japanese word segmentation needs a morphological analyzer (MeCab, Kuromoji) with a dictionary. That's way too heavy for a client-side tool. Instead, I fall back to treating each CJK character as an independent "token":

export function tokenize(text) {
  const tokens = [];
  let buffer = '';
  for (const c of text) {
    const cat = classifyChar(c);
    if (cat === 'hiragana' || cat === 'katakana' || cat === 'kanji') {
      if (buffer) { tokens.push(buffer); buffer = ''; }
      tokens.push(c);
    } else if (cat === 'space') {
      if (buffer) { tokens.push(buffer); buffer = ''; }
    } else {
      buffer += c;
    }
  }
  if (buffer) tokens.push(buffer);
  return tokens;
}
Enter fullscreen mode Exit fullscreen mode

This isn't "correct" Japanese tokenization, but it produces reasonable frequency counts for character-level analysis. The UI is clear that for mixed Japanese text, "word count" approximates character-based tokens.

Social media limits

Every major platform has a character limit, and losing track of them while drafting a post is frustrating. A progress bar per platform:

const LIMITS = {
  twitter: 280,
  sms: 160,
  instagramBio: 150,
  linkedinHeadline: 220,
  youtubeTitle: 100,
};

export function getLimits(text) {
  const len = [...text].length; // code point count, not UTF-16 units
  const result = {};
  for (const [key, max] of Object.entries(LIMITS)) {
    result[key] = { current: len, max, ok: len <= max, percent: (len / max) * 100 };
  }
  return result;
}
Enter fullscreen mode Exit fullscreen mode

[...text].length counts Unicode code points, not UTF-16 code units. For emoji and supplementary plane characters, this matters — "😀".length === 2 but Twitter counts it as 1.

Reading time math

A conventional 200 words per minute for comfortable reading, 400 for skimming, 150 for speaking aloud:

export function readingTime(wordCount, wpm = 200) {
  return Math.ceil((wordCount / wpm) * 60); // seconds
}
Enter fullscreen mode Exit fullscreen mode

For Japanese where "words" are character tokens, the number isn't perfectly accurate, but it's within a factor of 2 — enough for a "this post will take ~3 minutes to read" estimate.

Series

This is entry #76 in my 100+ public portfolio series.

Top comments (0)