the one thing i've learned after fixing CJK input across 23 open-source PRs: if you don't have a regression fixture for IME composition, it breaks again within six months.
this isn't theoretical. i've watched the same compositionend-on-blur bug re-enter codebases that had no IME test at all. the fix goes in, the bug comes back, nobody notices until a Japanese or Chinese user files a report.
here's how to actually lock it down.
why IME input is different from regular keyboard input
when a user types Japanese on a standard western keyboard, they're not pressing letter keys one-to-one. they type romaji (the romanized phonetic form), and the operating system's IME intercepts those keystrokes to build a candidate string. the user sees a temporary underlined "composition string" and then confirms it -- either by pressing Enter/Space or by selecting a candidate.
during composition, the browser or terminal runtime fires a sequence of events:
-
compositionstart-- composition begins, the IME takes control -
compositionupdate-- the candidate string changes -
compositionend-- the user confirmed; the final character(s) are committed
the key insight: between compositionstart and compositionend, normal keypress/keydown events still fire -- but they carry isComposing: true and typically keyCode 229 (the IME virtual keycode). code that handles Enter or Backspace without checking isComposing will fire prematurely, eating the composition mid-input.
that's how "pressing Enter to send a message" also confirms the IME and sends, all at once. annoying but fixable. what's harder to catch is the combination of bugs.
the five failure modes that actually show up
1. early Enter fire
handler checks for keyCode === 13 but doesn't check isComposing. the Enter that confirms Japanese input also triggers "submit" or "next line."
2. byte-slice crash
code that truncates or slices a string by raw byte index -- common in Rust/Go/C++ integrations, or old Node.js Buffer code -- hits the middle of a multi-byte sequence. Japanese characters are 3 bytes in UTF-8. a substring(0, 10) that means "10 characters" but operates on bytes will silently corrupt text or panic at runtime.
3. fullwidth width mismatch
terminal emulators and some custom text renderers assume each character takes exactly one column. fullwidth characters (most CJK, fullwidth latin) take two columns. if column math ignores this, the cursor ends up at the wrong position and the UI tears or wraps incorrectly.
4. commit callback drop on focus shift
this one is subtle. the user is mid-composition when focus moves to another element -- say, a dropdown opens or a modal appears programmatically. some runtimes fire compositionend correctly. some don't. the pending composition string is either dropped silently or committed in a broken state. neither outcome is logged anywhere.
5. composition re-entry after blur/refocus
after the composition is committed and the field re-focuses, isComposing may still read as true on some older browser versions. code that early-exits on isComposing will then refuse all keyboard input until the user manually dismisses or clicks away and back.
building a regression fixture
the goal is three minimal fixtures that can run headless in CI.
fixture 1: compose-confirm
this is the most important one. it simulates:
- fire
compositionstart - fire
compositionupdatewith a candidate string ("にほん") - fire
compositionendwith the final committed value ("日本") - assert: the committed value is in the buffer; no side effects (submit, navigation) fired
for browser-based editors, script this with CompositionEvent:
function simulateCompose(target, candidate, final) {
target.dispatchEvent(new CompositionEvent('compositionstart', { bubbles: true }));
target.dispatchEvent(new CompositionEvent('compositionupdate', {
bubbles: true, data: candidate
}));
target.dispatchEvent(new CompositionEvent('compositionend', {
bubbles: true, data: final
}));
}
for terminal apps (Rust/Go/C++) that don't have a JS runtime, inject the raw byte sequence into the PTY or input buffer and assert the resulting text buffer state directly.
fixture 2: byte-boundary
take a string that mixes ASCII and CJK: "ok日本語test". pass it through every string operation your code performs -- truncate, pad, wrap, tokenize. assert that the result contains only complete codepoints.
a cheap check in Node.js:
function hasOrphanBytes(str) {
return Buffer.from(str, 'utf8').toString('utf8') !== str;
}
or in Go:
import "unicode/utf8"
func hasOrphanBytes(s string) bool {
return !utf8.ValidString(s)
}
run this assertion after every transform that touches string length or slice boundaries.
fixture 3: focus-shift composition drop
- start a composition (fire
compositionstart) - programmatically move focus away (call
.blur()on the element or trigger a modal) - assert: either the composition was cleanly cancelled (no partial text in buffer) or it was committed to its pre-blur state without corruption
the acceptable outcomes differ by app contract. the test doesn't enforce which -- it enforces that the outcome is one of the two valid states, not a third corrupted one.
wiring these into CI
if your test suite is Jest or Vitest, these are plain unit tests. jsdom dispatches composition events cleanly enough for fixtures 1 and 3. for fixture 2, you don't need a DOM at all -- just import the string util and assert.
if you're testing a terminal emulator or native editor, you're likely using a PTY-based harness. the principle is identical: inject the sequence, snapshot the buffer state, diff.
the test itself can be 20 lines. the value is having it in CI at all -- so the next refactor that breaks composition is caught in the pull request, not six months later when a user files a bug titled "Japanese input broken after update."
why english-native teams miss it
the honest answer: nobody on the team types Japanese. the IME code path is invisible in their daily workflow.
more specifically: the browser and OS hide the composition abstraction well enough that you can build a functional text editor entirely in English without ever triggering a compositionstart event. the bugs are invisible until a user who relies on an IME hits them.
there's also a framing problem. "i18n" gets treated as a translation layer -- add locale files, ship. the input layer is a separate problem and typically falls outside i18n tickets entirely. it lives in no one's backlog.
the fix isn't harder tooling. it's adding three fixtures to CI and putting "CJK input" on the review checklist alongside "keyboard accessibility" and "mobile viewport." if it's on the list, reviewers look for it. if it's not, it will never surface in review.
that's the whole playbook. three fixtures, one checklist item, and you've eliminated most of the regression surface for the second-largest writing system on earth.
The five fixtures from this post, runnable in CI (JS + Go, MIT): https://github.com/greymoth-jp/cjk-agent-fixtures
field notes on the Japan-shaped holes in global software · github.com/greymoth-jp
Top comments (0)