DEV Community

greymoth
greymoth

Posted on

a width check said the string was safe to cut. it split a kanji in half.

a name went into a terminal table and came out broken. the surname was 𠮷田. that first character is not the ordinary 吉 you get from the 吉 key, it is 𠮷 (U+20BB7), a rarer form that real people in Japan actually have on their family register. the table truncated the cell to fit a column, and what printed was 𠮷 followed by a replacement character. the kanji had been cut in half.

the interesting part is where the bug lived. not in the truncation loop. in a one-line shortcut that decided, before truncating, that this particular string was safe to cut by raw index. it was wrong, and it was wrong for a reason that only shows up on the exact character I just described.

three numbers that are usually the same, and one string where they aren't

a JavaScript string has more than one length depending on what you ask.

  • "𠮷".length is 2. .length counts UTF-16 code units, and 𠮷 lives outside the Basic Multilingual Plane, so it is stored as a surrogate pair: two code units, 𠮷.
  • its code-point count is 1. [..."𠮷"].length is 1.
  • its display width, the number of terminal columns it occupies, is 2. it is an East Asian wide character.

for plain ASCII these all collapse to the same number. "abc" is 3 code units, 3 code points, 3 columns. that coincidence is what a lot of text code quietly leans on. it holds right up until a character makes two of those numbers agree for different reasons.

𠮷 is exactly that character. two code units because it is a surrogate pair. two columns because it is wide. same number, 2, arrived at two completely different ways. hold onto that, it is the whole bug.

the real code

this is the truncation helper in cli-table3, the library a lot of CLIs use to draw tables. strlen here is display width. it strips ANSI color codes and runs the string through string-width, which counts a wide CJK character as 2. so strlen answers "how many columns," not "how many characters."

function truncateWidth(str, desiredLength) {
  if (str.length === strlen(str)) {
    return str.substr(0, desiredLength);
  }

  while (strlen(str) > desiredLength) {
    str = str.slice(0, -1);
  }

  return str;
}
Enter fullscreen mode Exit fullscreen mode

read the first branch as an optimization. "if the code-unit length equals the display width, then every character is one unit and one column, so there are no wide characters and nothing tricky, I can just cut by index with substr." for "abc" that is true, 3 === 3, cut away.

now feed it "𠮷𠮷". code-unit length is 4. display width is 4. 4 === 4, so the branch fires and it cuts by code unit:

"𠮷𠮷".substr(0, 3)   // "𠮷" + "\uD842"
Enter fullscreen mode Exit fullscreen mode

substr(0, 3) takes three code units: the full first 𠮷, then the high surrogate of the second one. the low surrogate is left behind. you get one clean kanji followed by a lone high surrogate \uD842, which is not a character at all. terminals render it as the replacement box. that is the half a kanji in the table cell.

the shortcut was built for the case where length equals width because everything is one-to-one. a surrogate-pair wide character satisfies length === width too, 2 === 2, but for the opposite reason, both numbers are 2 because the character is doubled on both axes. it walks straight into the fast path and gets sliced by index, which is the one thing that path assumed it would never have to do.

why it survived

the obvious question is how a CJK bug survives in a table library that people clearly use with CJK. the answer is that ordinary Japanese and Chinese text never reaches this branch.

take 漢. it is U+6F22, inside the BMP, so "漢".length is 1. its width is 2. 1 === 2 is false, so 漢 skips the fast path entirely and goes to the while loop below. every common kanji, every kana, every Hangul syllable behaves this way: one code unit, two columns, length never equals width. they are all safe.

the fast path only misfires when a single character is a surrogate pair and wide. that intersection is small. it is CJK Extension B and beyond, the rare kanji that show up in personal names and place names, plus emoji, which are also non-BMP and mostly width 2. so the library worked for years of 東京 and 漢字 and quietly mangled 𠮷田 and anything with an emoji in a narrow column. the common case took a different branch, so the shortcut looked safe.

the slow path had a milder version of the same disease, by the way. str.slice(0, -1) removes one code unit, not one character. hand the loop a string ending in a surrogate pair and it lops off a low surrogate on the first pass and leaves the high one dangling. same family, quieter symptom.

the fix

two changes. guard the fast path so it refuses any string that contains a high surrogate, and make the slow path trim whole code points instead of code units.

function truncateWidth(str, desiredLength) {
  // `str.length === strlen(str)` is also true for surrogate-pair characters
  // (e.g. CJK Extension B or emoji), which count as 2 code units and 2 columns.
  // `substr`/`slice` cut by code unit, so exclude them here and trim by code
  // point below to avoid splitting a surrogate pair into a lone surrogate.
  if (str.length === strlen(str) && !/[\uD800-\uDBFF]/.test(str)) {
    return str.substr(0, desiredLength);
  }

  let chars = Array.from(str);
  while (strlen(chars.join('')) > desiredLength) {
    chars.pop();
  }

  return chars.join('');
}
Enter fullscreen mode Exit fullscreen mode

Array.from(str) iterates by code point, so Array.from("𠮷𠮷") is a two-element array, each element a whole kanji. pop() removes one whole character. the loop can no longer stop in the middle of a surrogate pair because there is no middle to stop in. the fast path stays for the genuinely simple case, ASCII and other strings with no surrogates, where substr is both correct and cheaper.

worth naming the tools. Array.from and the spread operator both split by code point, which fixes surrogate pairs. they do not split by grapheme, so a flag emoji or a family emoji built from several code points joined with zero-width joiners will still come apart. if you need whole user-perceived characters, that is Intl.Segmenter with granularity: 'grapheme'. code point was the right level here because the unit of width is the code point, but know which one you are reaching for.

the failing fixture

this is the test that goes red before the fix and green after. it is the whole point, because the fix is one line and the value is keeping it fixed, not finding it once.

it('does not split a surrogate-pair wide char (CJK Ext B)', function () {
  let kanji = String.fromCodePoint(0x20bb7);          // 𠮷
  expect(truncate('a' + kanji + 'bc', 4)).toEqual('a' + kanji + '');
  expect(truncate('a' + kanji + 'bc', 3)).toEqual('a…');
  expect(truncate(kanji + kanji, 3)).toEqual(kanji + '');
});

it('does not split a surrogate-pair wide char (emoji)', function () {
  let emoji = String.fromCodePoint(0x1f600);
  expect(truncate('a' + emoji + 'bc', 3)).toEqual('a…');
  expect(truncate('x' + emoji + emoji + 'y', 4)).toEqual('x' + emoji + '');
});
Enter fullscreen mode Exit fullscreen mode

note the inputs are built with String.fromCodePoint, not pasted glyphs. that keeps the test readable in any editor and makes the code point explicit, so nobody later "cleans up" 𠮷 into 吉 and deletes the coverage without noticing. the assertion that matters most is truncate(kanji + kanji, 3): a width budget that lands between the two columns of the second character. the old code returned a lone surrogate there. that is the exact spot the bug lives.

the check, for the next one

the general shape is bigger than one library. any code that truncates, pads, aligns, or measures text is juggling three different numbers for one string, and it is only correct if it uses the same one throughout:

string code units (.length) code points display columns
abc 3 3 3
漢字 2 2 4
𠮷 2 1 2
😀 2 1 2

the failure mode is always the same: measure by one number, cut by another. cli-table3 measured width, then cut by code unit, and the two disagreed on the one character where they happened to be equal for different reasons. so the check is a habit, not a rule. when you slice a string with substr, slice, or a bare index, ask what unit that index is in. it is code units. then ask whether the length you compared it against was in the same unit. if you measured display width or code points and then cut by index, you have this bug, and it is invisible until a non-BMP character walks through.

and test it deliberately. one CJK Extension B character, String.fromCodePoint(0x20bb7), and one emoji, at a width that lands mid-character. ASCII will never show you this. you have to hand the function the input it is quietly afraid of.

this one is a single entry in a corpus of 97 real CJK, IME, and Unicode failures I have been collecting, most of them one-line fixes hiding in libraries that work perfectly in English. the same split-a-code-point shape shows up in opentype.js clamping cmap character codes (open), in slate keeping Indic conjuncts together (open), and in web UI truncation and a markdown smart-quotes pass where I filed the same fix and it did not land (clerk and markdown-it, both closed). the corpus and a runnable fixture suite in JS and Go are linked below. don't take my word for the diagnosis, the cli-table3 diff is public, read it and decide if it holds.

— greymoth (@greymoth__)

Top comments (0)