FEConf

Posted on Jun 29

Wait, Why Is This Character Here? (1)

#frontend #webdev

This article summarizes the FEConf2024 presentation, <Wait, Why Is This Character Here? (Subtitle: A Complete, Sometimes-Useful Guide to Hangul Unicode)>. The content is divided into two parts. In Part 1, we'll learn about Unicode and how Hangul is represented in it. In Part 2, we'll delve into why the replacement character () appears and how to fix it. All images in this article are from the presentation slides of the same name and are not separately cited.

'Wait, Why Is This Character Here? (Subtitle: A Complete, Sometimes-Useful Guide to Hangul Unicode)'
Jaehan Jae, CTO at Denier

Hello. I'm Jaehan Jae, and I'll be presenting on the topic, "Wait, Why Is This Character Here?" I previously worked on the Word and Slide products at Naver Office, and I'm currently the CTO at Denier, where we run a community for dentists and build platforms for medical professionals.

I've packed this article with valuable information so that you can easily solve this problem when you encounter it and give an outstanding answer if it comes up in a technical interview. I hope you find it useful.

Here's what we'll cover in this article:

The structure of Hangul Syllables in Unicode
The composition and limitations of Hangul characters in euc-kr/cp949
UTF-8 and UCS-2 in everyday life
Understanding composed and precomposed characters
Troubleshooting Hangul and decoding broken UTF characters

Unicode

Unicode Hell

This presentation began with the image below. Let's take a look. A manager I worked with sent me this Slack message.

"When I'm writing a long piece of text in the editor, a strange question mark character keeps appearing."

"When I edit the text to delete it and upload it again, it just shows up somewhere else. It's like playing whack-a-mole!"

The manager who sent this message described this phenomenon as "Unicode hell." The bottom right of the image shows the actual broken text they experienced. Now, let's find out why this replacement character () appears and embark on a journey to solve this problem.

What is Unicode?

First, let's learn about Unicode. Unicode is a standard, specifically ISO 10646. In other words, Unicode is a consistent standard for representing all the world's characters, as well as the organization that manages them. Within Unicode, there are several large blocks.

The first area is called the Basic Multilingual Plane (BMP). This area contains a character set called the basic language plane. In this character set, the notation U+0000 to U+FFFF is used. This is a common way to represent Unicode code points. The hexadecimal numbers following "U+" represent the code point in Unicode. The "U+" is likely used by convention to avoid confusion with plain numbers.

The second area is the Supplementary Multilingual Plane (SMP), which contains supplementary languages. This area includes many of the emoji and symbols you use. Finally, the third area is the Ideographic Plane, which contains characters like Hanja (Chinese characters).

The image above shows the Basic Multilingual Plane. As you can see, it contains a vast number of characters. Each small square represents a block of 256 characters. Rows 00 to 02 contain Roman letters, and beyond that, past the blue European characters, you can see CJK characters starting from row 34. CJK is an acronym for Chinese, Japanese, and Korean, and this block contains the Hanja used in these three countries. The Hangul characters we use are in the red-marked East Asian script area. The red-marked East Asian scripts are in rows 11, 30, and 31, and from A0 to D7.

The BMP is a 16-bit space, representing code points from U+0000 to U+FFFF, which allows for a total of 65,536 characters. Unicode states that most of the world's modern writing systems are encoded in this range.

Hangul in Unicode

So, where exactly does the Hangul we use fit in? As mentioned, it's included in the East Asian script block, starting from the section that begins with AC. The image below shows a partial list of Hangul in Unicode.

The Hangul we use is included in the range from U+AC00 to U+D7A3, starting with '가' (ga) and ending with '힣' (hih). This is the same range you use in regular expressions, from '가' to '힣', to check if a character is Hangul. This representation is how Hangul is defined in Unicode.

So how many Hangul characters are in this range? The calculation is based on 19 initial consonants (choseong), 21 medial vowels (jungseong), and 27 final consonants (jongseong), plus the case for syllables with no final consonant. The formula 19 × 21 × (27 + 1) yields 11,172 possible syllables. A system's ability to render all 11,172 characters determines whether it fully supports modern Hangul.

There are also Hangul characters not included here. As shown below, old Hangul characters like '아래 아' (arae-a), characters without an initial consonant, and characters without a medial vowel are not included in this precomposed Hangul syllables block.

Hangul Sorting Standard

Next, let's look at the Hangul sorting standard. The standard order, from '가나다라마바사' (ganadaramabasa) to '아자차카타파하' (ajachakatapaha), hasn't been established for very long. The order was finalized through a Ministry of Education notice in 1988, based on discussions about where to place double consonants and complex vowels during the computerization of Hangul. The Unicode standard reflects the order established by that notice.

Further research revealed that this order had been proposed several times before, and prior to the 1980s, some dictionaries used a different order, such as placing double consonants at the end. The order determined through these various discussions is recorded in the standard KS X 1026-1.

Now, let's take a quick look at Hangul in North Korea. The sorting order used in North Korea is slightly different from ours. As shown in the image below, double consonants come at the end, and the consonant 'ㅇ' (ieung) is also placed last. Therefore, when North Korean developers represent Hangul and use it in development, they have to implement a custom sorting order that doesn't align with Unicode.

In 1999, North Korea proposed a revised standard as shown below, but the Unicode Consortium did not accept it. This was because South Korea had already registered the sorting standard for Hangul in 1988.

So, how much of the Unicode space does Hangul occupy? Of the 65,536 characters in the BMP, Hangul accounts for 11,172 characters, a whopping one-sixth. This led to discussions within the Unicode Consortium about whether to accept such a large number of characters. As a result of these discussions, some characters were omitted.

Because of this, the Unicode 1.0 spec included fewer Hangul characters than we have now. It wasn't until the Unicode 2.0 spec in 1996 that all 11,172 Hangul characters were included.

Combinable Unicode: Hangul Jamo

There's another way to represent Hangul: using a combinable form of Unicode known as Hangul Jamo. This method allows you to display characters by combining initial, medial, and final consonants.

Cheot-ga-kkeut Unicode

This method, known as '첫가끝' (Cheot-ga-kkeut), can create the character '한' (han) from my name by combining 'ㅎ' (h), 'ㅏ' (a), and 'ㄴ' (n). With this combination method, you can represent not only the 11,172 modern Hangul characters but also old Hangul characters as a single character (though to render old Hangul properly, you'd need a font that fully supports it).

The image below shows an edge case involving composed text. We have a variable t with the value '한재는 발표중!' ("Hanjae is presenting!"). If we use substring to get the first two characters, we'd expect to get '한재'. However, the actual result is '하'.

In its decomposed form (NFD), a JavaScript string represents a Hangul syllable by listing the initial, medial, and final consonants in sequence. Since we requested the first two characters from this sequence, we got '하' (ㅎ + ㅏ).

Cheot-ga-kkeut Unicode in the Wild

This type of composed text, Cheot-ga-kkeut Unicode, can often be seen in the wild, for example, in files created on a Mac.

When you open a Hangul file created on a Mac on a Windows machine, it looks like this. If you read the filenames together, it says '개발자_매뉴얼' (Developer_Manual). macOS uses this decomposed text format for filenames, but Windows generally does not. That's why it's not recognized as a single character and is displayed as broken-down components.

So, how well do browsers support Cheot-ga-kkeut Unicode? Fortunately, most modern browsers do. However, this can lead to a problem where Hangul that displays correctly in the browser appears broken when downloaded and viewed on a Windows OS.

Typing Cheot-ga-kkeut Unicode: The Sebeolsik Keyboard

A key characteristic of Hangul Jamo is that even when an initial and a final consonant look the same, they are distinct characters with different code points.

Of course, on the keyboards we commonly use, you can't input initial and final consonants separately. However, there is a keyboard that allows for this: the Sebeolsik keyboard. As you can see in the image below, the Sebeolsik keyboard allows you to input initial, medial, and final consonants separately.

When using the common Dubeolsik keyboard, you might often encounter typos like the one below. You intend to write '옷이 없어요' (I have no clothes), but in your haste, it comes out as '옷이 ㅇ벗어요'. The Dubeolsik keyboard doesn't distinguish between initial and final consonants; it determines a character's role based on input timing, which can lead to such errors.

Even though they look the same to us, the Sebeolsik keyboard treats initial and final consonants as different characters. This provides a unique advantage and technique: Sebeolsik moa-chigi (gathering strokes). When typing the character '않' (anh), because the Sebeolsik keyboard knows the difference between initial and final consonants, you can complete the character by typing in the order of 'ㅏ' → 'ㄴ' → 'ㅎ' → 'ㅇ'. In other words, you can input the final consonant or the last sound first and still form the character correctly.

Combination Methods: NFC / NFD

Of course, the Cheot-ga-kkeut system we've discussed so far is not easy to understand, and figuring out how to combine and use them is even more difficult. To address this difficulty, Unicode introduces normalization forms: NFC (Normalization Form Canonical Composition) and NFD (Normalization Form Canonical Decomposition).

The term "canonical" essentially means "the correct form." Unicode introduces these forms to provide a standard rule for correctly composing and decomposing characters.

For example, the single character '(ㄱ)' can be canonically decomposed into three separate characters: '(', 'ㄱ', and ')'. Conversely, these can be canonically composed back into the single character '(ㄱ)'. Similarly, the character '한' can be decomposed and composed using canonical composition and decomposition.

This rule is defined in the String.prototype.normalize method in JavaScript. You can check it out at this link.

Hangul Compatibility Jamo

The last thing we'll look at is Hangul Compatibility Jamo. This is the third type of Hangul representation we've discussed.

Hangul Compatibility Jamo are the jamo characters that correspond to the keys on the Dubeolsik keyboards we commonly use. The difference from the Cheot-ga-kkeut notation is that there's no distinction between initial and final consonants. When you type 'ㄱ' on a Dubeolsik keyboard, it corresponds to the character at U+3131 in the Hangul Compatibility Jamo block.

So far, we've learned about Unicode and Hangul within Unicode.

In the next article, we'll build on what we've covered to explore why the replacement character () appears and how to solve the problem.