SEN LLC

Posted on May 6

Auto-Furigana in the Browser — Lazy-Loading kuromoji.js's 4 MB Dictionary from a CDN to Annotate Japanese Kanji With Their Readings

#javascript #japanese #nlp #frontend

Furigana are the small hiragana annotations that sit above kanji to show how they should be read. Schoolbooks, kid manga, and language-learning material all rely on them. The problem: typing them by hand is slow, and most online services that auto-annotate require sending your text to a server you don't control. This is a tool that does it entirely in the browser, by lazy-loading the 4 MB kuromoji.js dictionary from a CDN — about 350 lines including tests.

annotations: each kanji has its hiragana reading rendered above it in a teal accent colour ("吾輩" → "わがはい", "猫" → "ねこ", "名前" → "なまえ", etc.). A status bar at the bottom shows "tokens 42 / annotated 12 / kanji chars 17 / coverage 100% / dict load 1.98s"."/>

🌐 Demo: https://sen.ltd/portfolio/kanji-yomi/
📦 GitHub: https://github.com/sen-ltd/kanji-yomi

What it does

You paste Japanese text into a textarea. Each kanji-bearing word gets wrapped in <ruby>...<rt>reading</rt></ruby>, which browsers render with the hiragana reading floating above the kanji. No server is involved. The dictionary lives on jsDelivr's CDN and the tokenizer runs in the page.

The morphological analyzer is kuromoji.js, a JavaScript port of MeCab + the IPA dictionary. The dictionary is 12 .dat.gz files totalling about 4 MB; kuromoji.builder({ dicPath }).build(...) fetches them in sequence. First load is 1–3 seconds; subsequent loads come from the browser cache and are basically instant.

Separating the pure logic from kuromoji

kuromoji.js can't load under node --test — it expects an XHR-able dictionary path, and pulling 4 MB of dict files for unit tests is overkill anyway. The fix is to make sure all the logic that operates on tokens lives in a DOM-free module that can take synthetic tokens.

// yomi.js — DOM-free, kuromoji-free
const KANJI_RE = /[一-鿿㐀-䶿豈-﫿]/;

export function hasKanji(s) {
  return KANJI_RE.test(s);
}

// kuromoji emits readings in fullwidth katakana. Convert to hiragana
// for ruby display by shifting U+30A1..U+30F6 down by 0x60. Long-vowel
// mark ー (U+30FC) sits outside that range and is preserved as-is —
// so "コーヒー" becomes "こーひー", which is what readers expect.
export function katakanaToHiragana(s) {
  return s.replace(/[ァ-ヶ]/g, (ch) =>
    String.fromCharCode(ch.charCodeAt(0) - 0x60),
  );
}

export function shouldRuby(token) {
  if (!token || !token.reading || token.reading === "*") return false;
  if (!hasKanji(token.surface_form)) return false;
  const hiraReading = katakanaToHiragana(token.reading);
  return hiraReading !== token.surface_form;
}

export function renderToken(token) {
  if (token.surface_form === "\n") return "<br>";
  if (shouldRuby(token)) {
    const reading = katakanaToHiragana(token.reading);
    return `<ruby>${escapeHtml(token.surface_form)}<rt>${escapeHtml(reading)}</rt></ruby>`;
  }
  return escapeHtml(token.surface_form);
}

shouldRuby is the small predicate that decides whether a token is worth annotating. It returns false in three cases:

Reading is * or missing. kuromoji's "I don't know this word" sentinel. Proper nouns, latin words, emoji.
No kanji in the surface form. Annotating a hiragana word like "ねこ" with the reading "ねこ" is just visual noise.
Reading equals surface form. Sometimes kuromoji falls back to setting the reading to the surface itself for OOV tokens. The annotation would be redundant; skip it.

This is the kind of small predicate that gets refactored later and silently breaks because nobody notices the boundary cases. The 19 unit tests pin all three.

The katakana-to-hiragana shift, with one footgun

The conversion is a single Unicode-codepoint shift: U+30A1–U+30F6 (フルwidth katakana) → U+3041–U+3096 (hiragana). Subtract 0x60 from each codepoint.

"カタカナ" → "かたかな"   // each char shifted by -0x60
"トウキョウ" → "とうきょう"

The footgun is which characters you don't want to convert:

Long-vowel mark ー (U+30FC) is outside the range and stays. "コーヒー" becomes "こーひー", not "こうひい" — because users see the same long bar they'd write themselves.
ヴ (U+30F4) is in range and shifts to ゔ (U+3094) — small kana for /v/, occasionally used in loanwords.
ヵ (U+30F5), ヶ (U+30F6) are in range and shift to ゕ, ゖ — used in counters like 一ヶ月.

The class regex [ァ-ヶ] covers exactly the right range. [ァ-ヴ] would miss ヵ and ヶ; [ア-ン] would miss the small kana.

Why I lazy-load the dictionary instead of bundling it

The dictionary is 4 MB. I could bundle it with the deploy and have everything come from the same origin, but the CDN approach won out for two reasons:

The deploy is small. I'm shipping ~100 portfolio entries from the same CloudFront distribution; each one being 4 MB heavier blows up sync time and bucket size for no reason.
The cache is shared. If you visit any other site that uses kuromoji@0.1.2 from jsDelivr, your browser already has the dictionary. I get to free-ride on whatever cache hit rate jsDelivr earned.

The trade-off is "if jsDelivr is down, the tool is broken." That's a small probability and the failure mode is loud — the page shows a red status message, not a silent hang:

kuromoji.builder({ dicPath: KUROMOJI_DICT_PATH }).build((err, t) => {
  if (err) {
    setStatus("error", "Dictionary load failed", ` — ${err.message ?? err}`);
    return;
  }
  tokenizer = t;
  buildDurationMs = performance.now() - buildStartedAt;
  setStatus("ready", "Dictionary loaded", ` (${(buildDurationMs / 1000).toFixed(2)}s)`);
  renderInput();
});

I show the load time in the status bar ((1.98s) etc.) so the cache effect is visible: reload the page and it drops to roughly nothing.

Coverage as a feedback signal

The status row at the bottom of the page reads tokens 42 / annotated 12 / kanji chars 17 / coverage 100% / dict load 1.98s. The fields:

tokens — count of segments kuromoji produced. Includes punctuation, particles, newlines.
annotated — number of tokens that received ruby (passed shouldRuby).
kanji chars — number of characters in the input that match the kanji regex.
coverage — kanji characters in tokens-with-readings divided by all kanji characters.

When coverage is below 100%, it's almost always one of:

Proper nouns (person/place names) absent from the IPA dictionary. Modern surnames, recent place names, and brand names are common offenders.
OCR'd text with character recognition errors.
Coined words or weird stylistic forms.

The unannotated kanji render as plain text in the same flow, so a reader can spot exactly which words the dictionary missed and patch them by hand. For textbook-prep workflows, that's the right level of feedback — auto-annotation handles the bulk and the human cleans the tail.

Side-effects you can use

The token data kuromoji emits has more in it than just readings:

Part-of-speech tags — pos, pos_detail_1 give you "noun" / "verb" / "particle" / etc. Not used here, but a "highlight all the verbs in red" mode is a few lines away.
Base form — basic_form normalises conjugations, so "歩いた" (walked) reduces to "歩く" (to walk). Useful for search indexing.
Pronunciation vs reading — kuromoji emits both reading and pronunciation; the latter applies the actual sound shifts (e.g. mora geminations) used in TTS pipelines.

So this is less a "ruby annotator" and more a "browser-side morphological analyser with a ruby-rendering UI."

What I deliberately didn't do

Disambiguating heteronyms. "行った" can read as "いった" (went) or "おこなった" (carried out). kuromoji returns the dictionary's most-frequent reading without considering context. A real solution would offer the user a choice for ambiguous tokens — that needs a UI that's substantially more than this entry's scope.
Marking unannotated kanji. I should visually flag the kanji where coverage failed (e.g. a faint underline) so the user can find them quickly. Not done; the workaround is to read carefully.
Image OCR input. Combined with Tesseract.js this would give you "photo → annotated HTML" in one tool. That's a separate entry.

Tests

$ npm test
✔ hasKanji recognises CJK ideographs
✔ hasKanji catches Extension A and Compatibility blocks
✔ katakanaToHiragana shifts U+30A1–U+30F6 by 0x60
✔ katakanaToHiragana preserves chōonpu and punctuation
✔ shouldRuby keeps kanji tokens with a real reading
✔ shouldRuby skips hiragana-only tokens
✔ shouldRuby skips katakana-only tokens
✔ shouldRuby skips when reading is missing or unknown
✔ shouldRuby skips when reading equals surface (would be redundant)
✔ renderToken wraps a kanji token in <ruby><rt>...</rt></ruby>
✔ renderToken converts newline tokens to <br>
✔ renderToken HTML-escapes surface and reading
✔ renderTokens preserves order and concatenates results
✔ stripRuby removes <rt> annotations and <ruby> wrappers
✔ countAnnotations matches what renderTokens would mark
✔ countKanjiChars counts CJK ideographs only
ℹ tests 19  ℹ pass 19  ℹ fail 0

The tests don't load kuromoji. They cover only the responsibilities of yomi.js — the layer that has to keep working even if kuromoji starts emitting weirder tokens than it does today.

Try it

The demo at https://sen.ltd/portfolio/kanji-yomi/ ships with four sample buttons (the openings of I Am a Cat, Run, Melos, Snow Country, and Night on the Galactic Railroad) so you can click and see the annotator working without typing. Or paste your own text — first load is a couple of seconds while the dictionary streams in, after that everything is instant.

Source: https://github.com/sen-ltd/kanji-yomi — MIT, ~350 lines total, 19 unit tests, no build step. Open yomi.js first; it's the part that's unit-tested and the part you'd reuse.

🛠 Built by SEN LLC as part of an ongoing series of small, focused developer tools. Browse the full portfolio for more.

DEV Community