greymoth

Posted on Jul 2

Three ways CJK text breaks big open-source projects, over and over

#i18n #opensource #javascript #webdev

I keep a small corpus of Japanese/CJK bugs I've found in open-source projects while sending fixes upstream. At some point I stopped looking at them as individual bugs and started looking at them as a small set of repeating shapes. Three of them show up constantly, in codebases with nothing else in common: a federated social network, a CRM, a component library, a commerce platform, a local-AI desktop app, a data-grid, a design system, a headless CMS. Different stacks, same failure.

None of these are exotic. Each one is a real merged fix, and each one is boring enough that it passed code review and CI without anyone noticing, sometimes for years. That's the actual finding: these bugs aren't hard to fix once you see them. They're hard to see, because the systems that would normally catch a regression, tests, linting, review, don't have Japanese input in them.

Pattern 1: IME composition treated as a keystroke

What it is. Typing Japanese, Chinese, or Korean doesn't produce final characters one key at a time. You type romaji, an Input Method Editor shows a preedit string, and you press Enter to confirm the conversion into kanji. That confirming Enter is the same physical key most web apps bind to "submit."

If a keydown handler doesn't check composition state, the confirming Enter fires the handler mid-word: a chat message sends half-typed, a rename commits before the kanji conversion finished, a dropdown closes on the wrong item.

Why it's invisible. It only happens with an IME switched on. Most contributors and most CI runners never turn one on. The input works perfectly for every test that types plain ASCII, which is nearly all of them. No exception is thrown, nothing fails a snapshot test, the bug just silently eats or mangles the user's keystroke.

Real example. misskey-dev/misskey#17646, merged into a repo with over 11,000 stars: the chat composer's onKeydown checked ev.key === 'Enter' and sent the message, with no composition guard at all. Mid-conversion Enter sent a half-typed message. The fix is one line: if (ev.isComposing || ev.key === 'Process' || ev.keyCode === 229) return; before the send logic runs.

It's not a one-off oversight. twentyhq/twenty#22270, a CRM with over 52,000 stars, had the identical gap in two unrelated components at once: the attachment-rename input and the AI chat-thread rename input. Same missing guard, same fix, two files, same PR. And vuetifyjs/vuetify#22974, a component library with over 41,000 stars, already had a shared isComposingIgnoreKey helper elsewhere in the codebase for exactly this problem. VAutocomplete's keydown handler just never called it. The knowledge existed one file over. It didn't reach this one.

How to catch it. Switch your OS keyboard to a Japanese or Chinese IME. Type into every input that reacts to Enter or Escape, and watch what fires before you've confirmed the conversion. Or grep for key === 'Enter' across your codebase and check each hit for a composition guard. The primary composer usually has one. Count how many of the smaller inputs next to it don't.

Pattern 2: locale files silently fall behind

What it is. A product gets translated into Japanese once, then the English source keeps shipping new strings. Every string added to en.json after that point exists only in English until someone notices and backfills it. There's no build error, no lint rule, no CI check that a locale file has drifted, because a missing key isn't invalid JSON. It's just a hole.

Why it's invisible. The UI doesn't crash. i18next and most i18n libraries fall back to the English string (or the raw key) automatically. The product looks fully localized to anyone who isn't reading it in Japanese, including most of the team that shipped it.

Real example. medusajs/medusa#15839, an e-commerce platform with roughly 34,900 stars: the admin dashboard's Japanese locale file was 511 keys behind English. Not mistranslated, just absent, across product options, inventory, order fulfillment, MFA settings, and permissions. Someone had done a full Japanese translation pass at some point; the product just kept growing past it.

Jan, a local-AI desktop client with over 43,000 stars, showed the same drift spread across multiple namespaces rather than one. settings.json alone was 69 keys short with 4 more still sitting in English (janhq/jan#8352), and common.json, the namespace backing search, the providers panel, and toast messages, was 109 strings behind (janhq/jan#8349). It took three separate PRs to bring ja back to parity because the drift had been accumulating across releases, not from one gap.

Sometimes the gap is a handful of keys, not hundreds. mui/mui-x#23001 found that four Data Grid locale strings, including the "no columns" overlay text, had already been translated for zh-CN and ko-KR but were left commented out for ja-JP since the feature shipped. Two other locales got the follow-up treatment. Japanese didn't.

How to catch it. Run a key-diff between your source locale and every target locale on every release, not just at translation time. If ja.json has fewer leaf keys than en.json, you already have this bug, whether or not anyone's filed it.

Pattern 3: translated, but wrong

What it is. The key exists, the string isn't empty, and it's still broken, because the translation carries the wrong meaning into a UI context the translator wasn't shown. This is the pattern that key-diffing and automated QA can't catch at all, because nothing is missing. Everything renders. It's just incorrect.

Why it's invisible. A native Japanese speaker skimming the label in isolation, outside the UI, might not catch it either. The error only shows up when the word sits next to the control it's supposed to describe.

Real example. ant-design/ant-design#58563, a component library with over 98,000 stars: the Typography component's expand/collapse control was labeled 拡大する ("to enlarge/zoom in") for expand and 崩壊 ("collapse," as in a building collapsing or a system failing) for collapse. Both are real, dictionary-correct Japanese words. Neither means "show more text" or "show less text." The fix swapped them for 展開する and 折り畳む, the actual UI-collapse vocabulary.

strapi/strapi#26845, a headless CMS with over 72,000 stars, had the WYSIWYG editor's character counter labeled キャラクター, a loanword that means "character" in the fictional, personified sense (a cartoon character, a game character), not "character" as in a unit of text. The correct word for a text character in this context is 文字. Someone had translated the English word, not the meaning it carried in that specific control.

How to catch it. This one doesn't have a mechanical check. It needs a native speaker actually looking at the rendered UI, not a spreadsheet of key-value pairs, because the failure lives in the gap between a word's dictionary sense and the sense the interface needs at that exact spot.

The actual pattern is one level up

Stack these three next to each other and a shape appears. Composition-state handling, key-completeness checks, and meaning-in-context review are three different kinds of infrastructure, and English-only teams don't build any of them by default, because English doesn't need them. English text is typed one character at a time, English locale files are the source of truth so they can't drift behind themselves, and translation isn't a concept that applies to the language you already wrote the UI in.

So none of this is really about translation quality. Translation is a one-time act on strings. What actually breaks is the surrounding system: does the input layer understand non-Latin text entry, does the release process notice a locale falling behind, does anyone check meaning-in-context instead of string presence. Localization is what happens when all three of those hold at once, continuously, not just on the day someone did a translation pass. Every project above is a well-maintained, actively developed repo. The gap wasn't effort. It was infrastructure nobody had a reason to build until an outsider pointed at the specific line.

I keep a running, searchable corpus of bugs like these, CJK-specific breakage across open-source input handling, locale files, and Unicode edge cases, with repro cases and the fix for each: github.com/greymoth-jp/cjk-failure-corpus. If you maintain something with text input or a translated locale, it's a fast way to check whether your project already has one of these three shapes sitting in it.

More of this kind of thing: github.com/greymoth-jp · glovrex.com

Top comments (4)

Frank • Jul 2

Have you found any common patterns in how CJK text is handled in these projects that lead to bugs? I'd love to swap ideas on this.

greymoth • Jul 2

Yeah, and it's a smaller set than the count suggests. The 97 collapse into about five recurring mistakes. The biggest by far (nearly 40%) is treating the Enter that confirms an IME conversion as a submit or commit: the user is mid-word, their first attempt gets eaten, and no English test ever catches it. Under all of them is one assumption, that one character is one byte is one column in one encoding. The moment text stops obeying that (a 3-byte kanji, a full-width space, half-width katakana, a lone surrogate) whatever was hard-coded around it breaks. Happy to swap notes. Every row in the corpus links to a real PR if you want to see them in the wild.

i18nagent • Jul 6

The IME-composition one is such a good example because it fails in the least visible way — an input handler firing mid-composition sees the romaji preedit, and something like "submit on Enter" or "search as you type" grabs a half-formed string while the user is still choosing kanji. The fix is cheap once you know it (gate on compositionstart/compositionend, or check isComposing on the event), but you're right that it survives review because there's no IME in anyone's test suite — Playwright and Cypress type final characters directly and skip the composition phase entirely, so even E2E coverage lies here. Worth noting it's not strictly CJK either: dead-key accented input and some emoji pickers go through the same composition path, so a European-only team can hit pattern 1 too. What are the other two — width/truncation and normalization?

Aldo • Jul 12

This resonates strongly. For anyone building SaaS, the intricacies of CJK text, and indeed global character sets in general, consistently expose assumptions made during initial development. I've been down the rabbit hole with utf8 vs utf8mb4 in MySQL more times than I care to admit, often finding that what seemed like a minor database configuration detail early on turns into a significant migration project later. It's not just about supporting the characters themselves; it's about the downstream implications for everything from indexing to collation, where subtle differences in how a database handles character sets can lead to silently incorrect results or even data corruption if you're not careful.

Beyond the database, string length validation and UI layout are constant battles. We often find that what's a 'character' in a Latin context (often one codepoint, roughly one glyph, and predictable byte length) is completely different for CJK. A single logical character or grapheme in Japanese might be composed of multiple codepoints, or visually take up double the horizontal space of an ASCII character. This means a VARCHAR(255) column that works fine for English names can truncate a Japanese name, or a frontend character counter gives a misleading number, leading to frustrating UX or even silent data loss. Explicitly implementing grapheme-aware length checks in our validation layers became a necessary complexity for a truly international product.