loading...

Dev 101: Unicode

lgranger profile image Lauren ・4 min read

What is it?

Unicode is a standard of encoding sets and characters established by The Unicode Consortium. It is the most widely used standard, makes internationalization easier, and the reason we can have emoji 😍

Why do we need it?

If you gather, display, transmit or otherwise use strings on the internet, desktop software, or mobile apps, then you need to encode them. Unicode maintains the most popular standard, UTF-8 [1]. Understanding it is key to being a competent developer[2]). (It was also named one of the 12 Things Every Junior Developer Should Learn[3]) Failing to understand and properly use it can result in those empty boxes, odd-looking question marks, and general frustration for you and whoever wants to use whatever it is you've coded up.

Where did it come from?

Let's back up and talk about where letters come from. Spoken language is a way we humans have figured out how to encode meaning in sound. If I say "Hello" you understand that I'm greeting you. Written language encodes those sounds, so when I write "World", you understand I'm talking about our planet and all of humanity. That is if you understand the spoken and written language I'm using.

If you don't understand the language I'm using then my sounds or arrangement of symbols mean nothing. Computers don't understand human languages, they operate using binary. So if we want to use a computer to gather, display, transmit or otherwise understand a human language we have to give the instructions to it in binary.

Encodings between binaries and characters make this possible. Different types of computers used to have different encodings and this worked just fine until we wanted to share information between computers. To do that we need to have an encoding standard that both computers share so that when a set of binaries is received the correct letters are displayed or save. In 1963 the American Standard Code for Information Interchange (ASCII) became that standard.

ASCII has uppercase, lowercase, punctuation, symbols, and control codes. It lacked accented characters like Γ¨, non-American symbols like Β£ or any non-latin characters, but it got the job done. The ASCII set was 128 8-bit characters which left 128 characters free. This lead to a hybrid of pre-standard and standard encoding evolved. The first 128 characters were the ASCII standard, the last 128 bits used by different groups to encode different symbols and letters.

Clashes between the way different machines used those last 128 characters happened, but it was good enough, and ASCII was the standard for three decades. Support for many more languages and their characters lead to the publication of Unicode in 1991. Unicode has evolved a bit over the years, from all 16-bit characters to the current standard, UTF-8, which is variable length encoding.

In UTF-8 the first 128 bytes are just like ASCII. This means that the most commonly used characters on the internet still just use one byte per character. After that characters aren't limited to one byte, instead they can be two, three or four! Having the set of characters supported by ASCII take up the least amount of space in UTF-8 makes sense given that most of the internet uses UTF-8 (93% of websites), but it still gives us access to all the other characters that Unicode supports.

So what doesn't Unicode do well? Most of the Unicode issues relate to Unicode providing single characters which are then displayed in different ways depending on their fonts. Characters that are different but look similar make homograph attacks possible. Chinese, Japanese, & Korean all share a character set in Unicode, relying on different fonts to differentiate between the way each language displays those characters. In languages like Arabic and Vietnamese single characters are connected with ligatures to make glyphs, meaning that any given character might look different depending on what characters it's connected to (think cursive writing in English). For these languages Unicode and fonts aren't enough and secondary processing needs to be done to display them correctly.

And last, but not least, the new international language, emoji, are even subject to this font issue. Emoji are a bit more consistent than they used to be, but differences still exist between platforms. The most notable difference is the the dizzy face, which reads more like death on some platforms. Seven yellow emoji faces in a row. They are images of the same unicode character, 😡, but they look different. The first has x eyes, the next three swirl eyes, the next x eyes, then swirl eyes, and finally x eyes again.

Summary

  • You should care about character encoding becuase it is how we store and display language on computers.
  • Unicode, and specifically UTF-8 is the most widly used character encoding on the internet.
  • It uses variable length encoding to give us fast loading for what we use most, while also providing us with more than 137,000 other characters.
  • Unicode, like ASCII before it, is character encoding only, relying on fonts for language diferentiation and emoji style.
  • Emoji are πŸŽ‰ πŸ™ŒπŸΌ πŸ’» πŸ”₯ 😍

Discuss
What's your fav emoji? What are your top used emoji? What do they mean to you/what are you trying to say with them when you use them?

Posted on by:

lgranger profile

Lauren

@lgranger

πŸ“± vim efficiency, emoji style ❀️ android, ios, react native πŸ’»

Discussion

markdown guide
 

Emoji are kind of an accident. There were some Japanese character sets, and boards, that used a bunch of them. Unicode wanted to ensure it could actually encode all those existing sets, thus added those funny symbols. Of course, it'd didn't take long before use of them exploded and they gained their own life.

They're no even part of the basic multilingual plane, in Unicode, since they weren't deemed essential enough. This is still a problem today, since languages like Java don't handle them correctly.

 

Thanks for the comment Edaqa!

I'd like to do a follow up (Dev 102?) about some of the specific ways in which Unicode isn't handled well by various programing languages. In my research for the above I read some about langauges (PHP I believe was one) which assume all Unicode characters are 16 bit, and they calculate string lenght based on this assumption. I'm sure there are others!

From a psychology and linguistics point of view I find the history and adoption of emoji facinating, and I'd love to know more about how people interpret the different emoji meanings, and how that impacts communication.

(PS I checked out at your cooking blog and it looks amazing!! Gonna give the soy seitan a try πŸ˜‹)

 

Let me know if you will write that article, as I have one I could update. The string type is broken goes into some of the common problems that exist with Unicode in languages. I've been curious if it applies still, and also it needs a good editing. :)

 

I don't know how to pick a favorite, but I'm quite fond of these: β˜•οΈ πŸ€¦πŸ»β€β™‚οΈ πŸ˜‚

My top used emoji are different on my work MacBook than my iPhone.

MacBook:
πŸ™πŸΌ - Thank you!
πŸ€¦πŸ»β€β™‚οΈ - Doh! My mistake!
🚒 - Ship it!!
iPhone:
❀️ - Love you
😭 - I read the news article you send me and now I'm distraught over the state of the world
πŸ‘ - Good job!