A Text Statistics Tool That Classifies Hiragana, Katakana, and Kanji Separately
Word count tools work fine for English. For Japanese they're mostly useless — there's no space between words, so "word count" doesn't exist in the normal sense. But character count broken down by script (hiragana / katakana / kanji / ASCII) is useful: it tells you the vocabulary mix, the reading difficulty, and the writing style.
Paste a paragraph. Get character counts, word counts (where meaningful), reading time, speaking time, character frequency distribution, social media limit checks, and a Japanese script breakdown. All client-side, all live, zero dependencies.
🔗 Live demo: https://sen.ltd/portfolio/text-count/
📦 GitHub: https://github.com/sen-ltd/text-count
Features:
- Characters (with/without spaces), words, sentences, paragraphs, lines
- Reading time (200 wpm / 400 wpm)
- Speaking time (150 wpm)
- Japanese script breakdown (hiragana / katakana / kanji / ASCII / digit / other)
- Top 10 words by frequency
- Unique word count, average word/sentence length
- Character frequency chart (top 15)
- Platform limits (Twitter 280, SMS 160, Instagram bio 150, etc.)
- Japanese / English UI
- Zero dependencies, 73 tests
Character classification by Unicode range
export function classifyChar(c) {
const code = c.charCodeAt(0);
if (code >= 0x3041 && code <= 0x309F) return 'hiragana'; // ぁ-ゟ
if (code >= 0x30A0 && code <= 0x30FF) return 'katakana'; // ゠-ヿ
if (code >= 0x4E00 && code <= 0x9FFF) return 'kanji'; // CJK Unified
if (code >= 0x30 && code <= 0x39) return 'digit';
if (code >= 0x20 && code <= 0x7E) return 'ascii';
if (/\s/.test(c)) return 'space';
return 'other';
}
The Unicode ranges:
- Hiragana: U+3041-U+309F (96 codepoints)
- Katakana: U+30A0-U+30FF (96 codepoints)
- Kanji (CJK Unified Ideographs): U+4E00-U+9FFF (20,992 codepoints)
Everything else is categorized as digit, ASCII, whitespace, or other. This gives you a breakdown like "200 chars total: 120 hiragana, 40 kanji, 30 katakana, 10 ASCII" — which describes Japanese prose more usefully than "200 characters".
The word-count problem
In English, "words" are space-delimited. In Japanese, they aren't:
The quick brown fox → 4 words
素早い茶色の狐が → ? words
Japanese word segmentation needs a morphological analyzer (MeCab, Kuromoji) with a dictionary. That's way too heavy for a client-side tool. Instead, I fall back to treating each CJK character as an independent "token":
export function tokenize(text) {
const tokens = [];
let buffer = '';
for (const c of text) {
const cat = classifyChar(c);
if (cat === 'hiragana' || cat === 'katakana' || cat === 'kanji') {
if (buffer) { tokens.push(buffer); buffer = ''; }
tokens.push(c);
} else if (cat === 'space') {
if (buffer) { tokens.push(buffer); buffer = ''; }
} else {
buffer += c;
}
}
if (buffer) tokens.push(buffer);
return tokens;
}
This isn't "correct" Japanese tokenization, but it produces reasonable frequency counts for character-level analysis. The UI is clear that for mixed Japanese text, "word count" approximates character-based tokens.
Social media limits
Every major platform has a character limit, and losing track of them while drafting a post is frustrating. A progress bar per platform:
const LIMITS = {
twitter: 280,
sms: 160,
instagramBio: 150,
linkedinHeadline: 220,
youtubeTitle: 100,
};
export function getLimits(text) {
const len = [...text].length; // code point count, not UTF-16 units
const result = {};
for (const [key, max] of Object.entries(LIMITS)) {
result[key] = { current: len, max, ok: len <= max, percent: (len / max) * 100 };
}
return result;
}
[...text].length counts Unicode code points, not UTF-16 code units. For emoji and supplementary plane characters, this matters — "😀".length === 2 but Twitter counts it as 1.
Reading time math
A conventional 200 words per minute for comfortable reading, 400 for skimming, 150 for speaking aloud:
export function readingTime(wordCount, wpm = 200) {
return Math.ceil((wordCount / wpm) * 60); // seconds
}
For Japanese where "words" are character tokens, the number isn't perfectly accurate, but it's within a factor of 2 — enough for a "this post will take ~3 minutes to read" estimate.
Series
This is entry #76 in my 100+ public portfolio series.
- 📦 Repo: https://github.com/sen-ltd/text-count
- 🌐 Live: https://sen.ltd/portfolio/text-count/
- 🏢 Company: https://sen.ltd/

Top comments (0)