A searchable corpus of CJK and Unicode bugs in open-source libraries

#japan #i18n #webdev #unicode

A Japanese user types into your search box. They write とうきょう, press Space to convert it to 東京, then press Enter to confirm the candidate. The search fires. The query that went through was the half-finished one, before the conversion committed.

This is the most common internationalization bug I run into, and it is almost always one line to fix. The Enter that confirms an IME conversion is the same Enter your keydown handler is listening for. The guard is to skip the handler while a composition is still active: event.isComposing, or keyCode === 229. In React you have to read it off event.nativeEvent.isComposing, because the synthetic event drops the field.

I kept hitting variations of this across different libraries, so I started writing them down. That list is now a small public reference.

CJK / Unicode Failure Corpus

https://greymoth-jp.github.io/cjk-failure-corpus/

It is a searchable list of real CJK, IME, and Unicode text-handling bugs in open-source libraries. For each entry there is a one-line symptom, a minimal repro, the library it hits, and the fix. Right now it has 89 entries across 84 libraries. 15 of the fixes have merged, the rest are open or were closed.

The point is to have something to reach for when one of these bites you. Search the library or the symptom, get the repro and the one-line fix that already worked somewhere else. Most of these are the same handful of mistakes, made over and over, in code that works fine in English.

A few entries are not my PRs. They are cited upstream issues from the wider ecosystem that document the same failure, marked cited and linked to the original report. Everything else is a PR I opened, with the title, repo, URL, and merge status pulled from the GitHub API rather than written from memory. The build refuses to publish an entry that does not point at a real PR or issue, so the page cannot claim a fix it cannot link to.

Three entries, to show the shape

The IME Enter, in naive-ui (Vue, merged). In n-dynamic-tags, pressing Enter to confirm a kana-to-kanji conversion creates a tag from the in-progress text instead of just finishing the conversion. Repro: render <n-dynamic-tags>, focus the input, type とうきょう with a Japanese IME, Space to get 東京, then Enter to pick the candidate. A tag gets added from the unconfirmed text. Fix: skip tag creation while e.isComposing is true, and only act on the Enter that fires after compositionend. This exact category shows up across React, Vue, Svelte, and Angular, so the corpus tracks it as one pattern with per-framework notes (React needs nativeEvent.isComposing; Svelte exposes the native event directly; Safari and Chromium even disagree on whether the commit keydown reports isComposing or keyCode 229).

A dropped apostrophe, in hepburn (kana to romaji). Katakana ン before a vowel or a Y gets romanized without the syllabic-n apostrophe, unlike hiragana ん. So シンヨウ comes out as SHINYOU when it should be SHIN'YOU, and now it collides with シニョウ.

const { fromKana } = require('hepburn')
fromKana('しんよう') // SHIN'YOU
fromKana('シンヨウ') // SHINYOU  <- apostrophe dropped

Round-trip is the oracle here: kana to romaji and back should be stable, and the hiragana sibling already did it right. The fix is to map katakana ン the same way.

A locale that cannot parse its own output, in date-fns. This one is not even CJK, which is exactly why it is in the list. In the Galician (gl) locale, June formats as xuño, but the June parse pattern is /^xun/i. That matches the abbreviation xun and not the wide form, because the third character is ñ, not n. So format then parse fails, for June only:

const s = format(new Date(2021, 5, 1), 'MMMM', { locale: gl }); // 'xuño'
parse(s, 'MMMM', new Date(), { locale: gl });                   // Invalid Date

The locale's own test snapshot already records Invalid Date for June while the other eleven months parse fine. Fix: widen the pattern to /^xu[nñ]/i, the way Catalan already folds diacritics into its patterns. It belongs next to the CJK entries because it is the same class of bug: text round-tripping that nobody tested in a script with characters outside ASCII.

What it is not

It is not a linter and not a guarantee. It tells you that a specific bug existed and how it was fixed. Whether your code has the same one is still something you have to check. The detection is mechanical, the judgment is yours.

And not every PR landed. A few were closed, because the maintainer fixed it another way or did not want the change. Those stay in the list, marked closed, because a closed PR is still a documented failure with a repro attached.

If you maintain a library that takes text input and you want to know whether it has one of these, the fastest path is to search the corpus for your stack and skim the IME and locale-data sections first. That is where most of the bodies are buried.

There is also a companion repo that turns the repros into CI fixtures, so the regressions can be caught automatically instead of rediscovered: https://github.com/greymoth-jp/cjk-agent-fixtures

Corpus: https://greymoth-jp.github.io/cjk-failure-corpus/