Decoding Character Encoding

#todayilearned #codenewbie #beginners

There was a time when...

...If my grandmother had typed out a message, "哈囉" (meaning "hello" in Chinese) and sent it out from her computer in Taiwan to mine in the US, I wouldn't have been able to read it.

You might think, But, of course! She was writing in Chinese, and your Chinese skills would've been too terrible back then to be able to read it!

(...which I suppose could have been true...)

...except that the message would've been unrecognizable and unreadable by anyone, including herself, once it had reached me! When I should've seen this message that she sent me, "哈囉," I might've instead seen something like "" or some other text that clearly wouldn't have read like Chinese. The same might've happened if a message was sent the other way around.

And that wouldn't have been my fault! Or my grandmother's!

For that, we could blame character encoding.

Before the standardization of character encoding in electronic communication, computers made by different manufacturers utilized different encoding schemas, which meant that even users communicating in the same language would've been presented with messy strings of nonsense (also known as mojibake) when receiving a message from someone else. By using different encoding schemas, their computers weren't "speaking the same language," and so they read the same data differently.

We take it for granted now, but with present standards for character encoding like Unicode and ASCII, problems like the one I just described are easily prevented from happening. They help us type, send, and receive data the way it was intended to be written and read, and they are what makes our computing and internet experiences productive and enjoyable.

What is character encoding?

Character encoding is using an encoding system to represent text characters like letters, numbers, punctuation, or spaces. One of the most famous and recognizable examples of character encoding is Morse Code, which is a system of letters and numbers that are each encoded into a sequence of dots and dashes. We could send an encoded message using those dots and dashes, and the recipient of that message could then translate it back into their own language.

In the context of computers, ASCII and Unicode are two encoding schemas that utilize the binary number system. Computers can't read human languages, but they read binary numbers super well, so both schemas assign characters to binary numbers so that computers can read the binary sequence and convert it into a language we can understand.

How does it work?

The ASCII conversion table with code points assigned to each character

Let's assume that our computers both use ASCII's encoding schema. ASCII has the capability to support 127 different characters, and each character is assigned a code point. The code point can be a unique decimal number between 1 and 127** as well as a unique binary number whose value is between those numbers. The decimal code point for the character “A” is 65, which is represented as 1000001 in binary.***

So, if I were to type out the letter “A” on my keyboard, the keyboard would send the binary sequence, 1000001, to the computer, which would then look up the ASCII table to convert the binary code point to the character "A" for display on my screen.

When I send it to your computer, it wouldn't be sent as the character "A", but as its binary code point. When your computer receives the message, it would also use the ASCII table to decode and convert the binary code and display the expected “A” onto your screen.

If, however, your computer had utilized a different encoding schema where code point 65 was assigned differently (let's just assume it was assigned to "Z"), you would have read "Z" because was converted according to that specific schema rather than the ASCII schema.

Why is character encoding important?

If the reasons aren't obvious by now, character encoding, especially standardized character encoding, is important because it helps all parties understand each other. It helps us as users to communicate with our computers, and it helps our computers transmit and receive accurate data so that we can also communicate with each other!

Notes About This Series

This is the first blog of my new series on standardized character encoding and Unicode. Over the following month and a half, I'll be explaining how the binary system works, diving into ASCII and Unicode, and discussing why Unicode is especially important to our global community.

Character encoding itself is pretty straightforward, but I believe that as we dive deeper into it together, you'll be very surprised by how multifaced and fascinating the topic can be! This series deeply involves many of my personal passions, not just technology, but also language, culture, and social issues.

I'm so excited to share these upcoming blogs with you, and I really hope you'll follow along!

Footnotes:

** Technically, there are 128 code points in the ASCII range, but 0 is assigned as a null character, so that leaves 127 usable code spaces.
*** It's worth noting that decimal code points are meant for human readability, but binary code points are for computers.