FEConf

Posted on Jun 29

Wait, Why Is This Character Here? (2)

#frontend #webdev

This article is a summary of the FEConf2024 presentation titled <Wai�t, Why Is This Character Here? (Subtitle: A Complete, Sometimes-Useful Guide to Hangul Unicode)>. The content is divided into two parts. Part 1 covered Unicode and Hangul's representation within it. Part 2 will explore why the replacement character () appears and how to resolve the issue. All images in this article are from the presentation slides of the same name and are not separately cited.

'Wait, Why Is This Character Here? (Subtitle: A Complete, Sometimes-Useful Guide to Hangul Unicode)'
Jaehan Jae, CTO at Denier

In the last article, we discussed Unicode and how Hangul is represented within it. In this article, we'll dive into why the strange replacement character () appears and introduce methods for solving this problem.

Problem Analysis - Why Does Appear?

Now, let's get to the bottom of why this mysterious character appears.

Is Windows the Problem?: 'euc-kr'

First, let's start with the assumption that the problem is caused by the encoding method on Windows. Windows primarily uses an encoding called euc-kr to store characters, not UTF-8. Its standard name is KS X 1001 (formerly KSC 5601-1987), which defines an arrangement of Hangul and Hanja characters. The 1987 specification included only 2,350 Hangul characters. This set was compiled from the most commonly used Hangul characters, but considering the total number of possible syllables is 11,172, this only accounts for about 20%. With so many missing characters, various problems arise when representing Hangul.

Encoding vs. Character Set

Before we dive into the various issues, let's briefly touch on the difference between encoding and a character set. A character set is a table that maps each character to a number. In other words, Unicode is a standard that defines which character corresponds to which number. On the other hand, encoding is the method of representing these characters when storing them in memory or transmitting them over a network.

Sometimes, encoding and character sets are explained without distinction, as many character sets also function as encodings themselves. Since euc-kr both maps Hangul characters to numbers and provides a method for storing those numbers directly, it functions as both a character set and an encoding.

UTF-8 is a prime example of an encoding, and Unicode is a prime example of a character set.

Issues with Hangul Code Standardization: KSC-5601 / euc-kr

The 2,350 Hangul characters mentioned earlier were the first to be standardized, and most Hangul text on computer networks at the time was represented based on this standard. Soon, this led to various problems.

The story of a person named '서설믜' (Seo Seol-mui) is a classic example. Because the character '믜' (mui) was not included in the euc-kr range, this person reportedly had trouble not only in online environments but also when opening a bank account or writing their name on their college entrance exam application.

Another interesting case is the 'Hangul 815' editor tool. euc-kr includes the character '쓩' (ssyung) but not '쓔' (ssyu). Because of this, other editors at the time had a bug where typing 'ㅆ' (ss) and 'ㅠ' (yu) consecutively would cause the input to freeze, as there was no way to represent the resulting character. The Hangul 815 editor even advertised that it could represent the character '쓩'.

cp949

To solve these problems with euc-kr, a character set called cp949 was introduced. It's a character set that includes the Hangul characters that were excluded from euc-kr's original 2,350. However, cp949 is not the official name. Although the HTML standard specifies using euc-kr, most machines actually process it as cp949.

cp949 works by keeping the positions of the original 2,350 characters and inserting the remaining characters in between. Therefore, the alphabetical (ganada) sorting order is incorrect. In the image below, the yellow section at the beginning represents symbols, numbers, or alphabets. The part after that, KS X 1001, contains the initial 2,350 characters. The remaining characters are inserted into the purple area. As a result, sorting cp949 strings alphabetically requires a separate process to achieve the correct order.

This is why when an euc-kr document's encoding is broken or mishandled, it results in question marks or strange, unreadable Hangul characters.

euc-kr in HTML

Pages built with euc-kr are usually processed as cp949. However, the actual meta tag is saved with euc-kr as shown below. This might raise another question. If you read the textContent of a DOM on an euc-kr page, is it actually processed as euc-kr?

Before answering that, let's first look at UTF-8.

UTF-8

UTF-8 was devised in 1992. Considering Unicode was introduced in '91 and the Hangul standard in '87, UTF-8 is a relatively new technology. UTF-8 uses a variable-length encoding to represent text, typically covering the 1-byte ASCII range up to 3 bytes for the BMP area. Therefore, Hangul characters in UTF-8 are typically represented using 3 bytes. Emojis are represented in 4 bytes.

*Note: UTF-8 can physically represent up to 5 or 6 bytes, but a 2003 revision declared that encodings longer than 4 bytes would not be used.

Strings in JavaScript

So how does JavaScript represent strings? JavaScript primarily uses UCS-2 and UTF-16 to process strings. UCS-2 means that one Unicode character is represented by 2 bytes. However, UCS-2 can only represent characters within the Basic Multilingual Plane (BMP). To represent a wider range, UTF-16 is used. UTF-16 maintains the UCS-2 format but extends it to represent characters beyond the BMP, like emojis, using 4 bytes.

With this understanding of UTF-8 and how JavaScript handles strings, let's return to our earlier question.

What happens if you read an euc-kr document and store it in a JavaScript variable? It gets converted and stored in JavaScript using the UCS-2 encoding. In other words, all browsers read documents encoded in various ways, but when processing them with JavaScript, they convert and store them as UTF-16.

If so many web documents are encoded in UTF-8, why was JavaScript designed to use UTF-16? UTF-8 can represent alphabets, symbols, and ASCII codes in a single byte, making it a compact representation, while Hangul and many other characters can require 2 or 3 bytes. Therefore, UTF-8 is an efficient encoding for data transmission.

However, despite this advantage, calculating the length of a variable-length string is computationally inefficient. To process strings of any length quickly, JavaScript represents them internally using UTF-16 or UCS-2.

So how does JavaScript actually represent length? In the first example in the image below, the special character made of three horizontal bars is represented by two UTF-16 characters. A duck emoji, being in the 1F986 range (beyond the BMP), has a length of 2. Finally, a family emoji, which includes four human emojis and a connector character, has a length of 11.

When explaining UCS-2 earlier, I mentioned that JavaScript processes all strings as 2-byte units. That's why charAt, which retrieves a character at a specific index in a string, operates on 16-bit (2-byte) code units.

Applying charAt(0) to a duck emoji returns the high-surrogate character \uD83E. On the other hand, using codePointAt(0) to get the Unicode code point returns the hexadecimal value 1F986.

So, What Is?

So far, we've looked at euc-kr, UTF, and string representation in JavaScript. Actually, this problem isn't just caused by Windows using euc-kr. The same issue occurs on macOS as well.

U+FFFD

We need to take another look at the replacement character () itself. Looking up this character in the Unicode standard, I found that it's assigned to the code point U+FFFD, located at the very end of the BMP. The Unicode standard describes this character as a substitute for a character that is unrecognizable or unrepresentable. Looking closer at Unicode spec 3.0, it gives the following explanation. It recommends three ways to handle a byte sequence that cannot be interpreted as a valid UTF character.

The recommended actions are: return an error, delete the invalid sequence, or insert a U+FFFD marker to indicate that a character was malformed.

Finally, we know the cause of the replacement character: a character was broken during UTF-8 processing. Based on this hint, I checked the code I had written.

Solving the Problem

I checked if the problem was in the editor, but the http response was fine. I also confirmed the issue wasn't the database itself, although I did find that U+FFFD characters were being stored in it. Finally, I looked at the backend code to see if there was a problem. Soon, I found code that was processing a UTF-8 data stream by splitting it into 1,000-byte chunks.

When a 3-byte UTF-8 Hangul character was processed by this code, it would get split in the middle. Then, when toString was executed on the chunk, U+FFFD was generated.

The manager who reported the problem mentioned that the characters were only breaking in Hangul text. This gave me confidence that this was indeed the problem.

After identifying this issue, I modified the backend logic to prevent multi-byte characters from being split during the chunking process, and the problem was resolved.

In Conclusion

To understand why the replacement character () appeared, we've explored Unicode, examined the structure of Hangul characters, and looked into euc-kr and UTF. Through this process, we traced the issue to the U+FFFD specification in the Unicode standard and were able to resolve the problem.

I hope this article helps you remember how Hangul is stored and how it's represented in the browser. I also hope that by following my problem-solving process and checking for issues in your own backend encoding logic, you'll be able to solve similar problems more easily.

DEV Community