DEV Community

greymoth
greymoth

Posted on

I cataloged 93 CJK and Unicode bugs in open source. Most are the same five mistakes.

I keep a Japanese keyboard on while reading other people's code. Not for any noble reason at first, it's just my keyboard. But after a while you start seeing the same small breakages over and over, in libraries that are otherwise excellent and work perfectly in English. So I started writing them down. The list is now 93 entries across 87 libraries, and it's public:

https://greymoth-jp.github.io/cjk-failure-corpus

It's built like caniuse, except instead of "does this browser support X" it's "here is a real text-handling bug, the library it's in, a minimal repro, and the fix." Every row links to an actual pull request or issue. I'll get to why that matters at the end.

The thing I didn't expect: 93 bugs, but they're not 93 different problems. They cluster into about five.

One bug is a third of the list

36 of the 93 are the same bug. When you type Japanese, Chinese, or Korean, you don't type final characters. You type romaji, an IME shows you a preedit, and you press Enter to confirm the conversion into kanji. That confirming Enter is the same physical Enter your form is listening for.

So a user is mid-word, hits Enter to pick the right kanji, and the handler fires onSearch or commitName or handleSave on text that isn't finished. No error, no stack trace, CI green. It only reproduces with an IME on, which most maintainers don't have, so it lives forever.

The fix is one property. While a composition is active, isComposing is true:

// before
if (e.key === 'Enter') commit();

// after
if (e.key === 'Enter' && !e.nativeEvent.isComposing) commit();
Enter fullscreen mode Exit fullscreen mode

The interesting part isn't the fix, it's where it's missing. Codebases usually already know about this. They just stopped one input short. In LibreChat the main message textarea was guarded and there was even a comment explaining it; the prompt-name field, the labels form, and the tag input next to it weren't. Trilium already had an isIMEComposing helper used by the note editor; the board view's card and column editors just never imported it. Same repo, same knowledge, one screen over.

So it's not "teams don't know about IME." The guard lives on the input everyone tests, and the secondary inputs are the ones nobody types Japanese into during review. Search box, inline rename, tag input, modal. Four shapes, over and over.

(One fiddly note if you go fix your own: in React you reach through to e.nativeEvent.isComposing rather than trust the synthetic event, and || e.keyCode === 229 is a legacy fallback for code paths that report 229 instead of setting the flag. There's a genuinely annoying edge right when composition ends where isComposing can already read false on the confirming Enter, browser depending. I haven't found one rule that holds everywhere; checking both is what's survived for me.)

The other four

After IME, the list thins out into four more shapes.

Locale leftovers (24). A key exists in en and never made it to ja, so a string silently falls back to English. select2 had removeItem and search in every locale except ja.js; screen readers read those aloud, so a Japanese user heard English. Or it's a parse table that formats a date but can't read its own output back, because the diacritic or the era character got dropped. A 和暦 library I looked at produced 令和元年5月1日 and then refused to parse it, because the year matcher was [0-9]{1,2} and 元 (gannen, "year one") isn't a digit.

Surrogate and grapheme (11). Code that walks text by code unit instead of grapheme cluster. Surrogate pairs split down the middle, ZWJ emoji get mis-counted, combining marks drift off their base, variation selectors get dropped. Anything that does str[i] or .length on user text is a candidate.

Kana and romaji (8). Transliteration tables that drop or reverse a kana. The clean test is a round-trip: convert and convert back, you should land where you started. One library could decompose ヷ and ヺ but passed ヸ and ヹ straight through, the other half of the same wa-row family.

Width and normalization (5). A CJK character renders two cells wide in a monospace terminal, but .length says one. Table formatters and truncation that count characters instead of display width overflow the box every time the text is Japanese.

That's 84 of the 93 in five buckets. The long tail is numerals (kanji numbers, including the 大字 forms used in contracts), regex round-trips, and a byte-order mark one code path strips and its sibling leaves glued to the first field name.

Why every row links to a PR

The honest part. Most of these entries are pull requests I sent. I only mark one "merged" when the GitHub API says merged, not when I push it and not while it's in review. As I write this, 15 of the 93 have merged; the rest are open. A few entries aren't mine at all, they're cited from other people's bug reports that document the same failure, and those are marked cited and link to the original report.

I built it this way on purpose. The site is one Node script over a JSON file, and the build fails loudly if an entry doesn't point at a real PR or issue. So the page physically can't claim a fix it can't link to. That constraint is the whole value of it as a reference: you don't have to trust me, you click through.

What to do with it

If you maintain something with a text input, the ten-minute version is: switch your keyboard to a Japanese IME, then type into every input that does something on Enter and watch what fires before you've confirmed the word. The main composer is probably fine. Try the search box. Try the inline rename. Try the chip input buried in a settings panel.

If you'd rather grep: find every key === 'Enter' and count how many have a composition guard. The main one will. Count the rest.

And if you hit a text-handling bug that isn't in the list, tell me and I'll add it. That's sort of the point of keeping a list instead of re-finding the same thing every month.

https://greymoth-jp.github.io/cjk-failure-corpus

Top comments (0)