DEV Community

Cover image for A Historical Perspective on Character Encodings 🕰️
Bruno Ciccarino λ
Bruno Ciccarino λ

Posted on

A Historical Perspective on Character Encodings 🕰️

Hey, developers! 🎉 Let's cut the confusion around character sets and encodings once and for all. Ever wondered about the significance of that mysterious "Content-Type" tag you should include in your HTML? Or why sometimes your emails show up with "?????" instead of meaningful text? Let's talk about why understanding Unicode is not just important but absolutely essential if you want your apps to work globally. 🌍

source: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Alright, let’s talk about the mysterious Content-Type tag you’ve probably seen floating around in HTML headers. 🌐 Think of it as a little “instruction manual” for browsers and servers on how to read and display your content. When you include something like in your HTML, you’re essentially saying, “Hey browser, this page is using UTF-8 encoding, so interpret it that way!” But wait, Content-Type goes even further—it's a full package of info that can tell the browser not just the encoding, but also what type of file it’s dealing with (like text/html for web pages, or application/json for JSON data). So, if the browser doesn’t know the content type, it might misinterpret the data, causing weird symbols, broken images, or just a total page meltdown. 💥 Including Content-Type is like setting a dress code for your content—everyone knows how to show up, looking exactly as they should! 😎

To understand this story of character sets, the best way is to go back in time! Without going too far back, let's go straight to when Unix was emerging and the K&R guys invented the C Programming Language. Back then, things were simple: there was ASCII, a code that used numbers from 32 to 127 to represent characters – enough for the letters of the English alphabet, numbers, and some signs.

In the ASCII table, for example, space was 32, "A" was 65, and so on. And since everything worked in 7 bits, while computers used 8-bit bytes, there was one bit left over for people to use as they wished. With that, creative (or devious, depending on your point of view) uses for this extra bit emerged. A famous example is WordStar, which used the extra bit to mark the end of words – a kind of “shortcut” at the time.

ASCII was great… if you only spoke English! 😅

What if the World Wants More? 🌎
Since bytes can have up to 8 bits, people soon realized that codes from 128 to 255 could be used for more things. But then a mess started: each group wanted this “extra” space to be used for their own characters. Then came the famous OEM character set, popular on the IBM-PC, with special characters such as accented letters and line drawings, which were all the rage at the time for creating tables and windows on the screen.

However, each country or region began to create its own “OEM” character sets. In the US, a specific character could be an accent, but in Israel, the same number represented a Hebrew letter, and in Greece, a Greek letter. Sending documents from one place to another became total chaos. 📄 ➔ 🌀

The ANSI Attempt – A Pseudo-Solution 💼

To provide some organization, the ANSI standard emerged, in which everyone agreed to keep the first 128 characters the same as ASCII and use the rest according to the region, which became known as "code pages". For example, page 862 was for Hebrew, while 737 was for Greek. Even so, translating languages, such as Hebrew and Greek in the same text, would still be impossible, unless you created a custom program just for that. 😩

Meanwhile, in Asia, with languages ​​that have thousands of characters, things were even more complicated. DBCS (Double-Byte Character Set) emerged, a system in which some characters occupied one byte and others two. This was so confusing that even today many programmers cringe just remembering DBCS. 🫣

The Internet Arrives and the Mess Goes Global 🌐

All this worked… until the day people were able to exchange files between countries over the internet. Suddenly, the confusion of languages ​​became global, and local solutions, such as ANSI, were no longer enough. That's when Unicode came on the scene with the bold proposal of representing all the languages ​​and symbols of the world in a single system.

Unicode: The Revolution

Unicode was a global effort to create a single set of characters that encompassed all the writing systems on the planet – until Klingon joined the list! Each character in Unicode is represented by a magic number called a code point, which is written like this: U+0041 for the letter A, for example. This system of unique codes for each character is a way of thinking of characters as "ideal", which are independent of how they will be stored on the computer.

Unicode has no character limit (ok, technical specifications, but it's very large). With this, technology people were able to include not only traditional characters, but also emojis, symbols and even old alphabets.

Unicode Encodings: How to Represent in Memory 💾

Now that Unicode had a “universal code” for each letter and symbol, it was necessary to decide how to store all of this in the computer’s memory. And that’s when Unicode encodings came into being.

UCS-2 and UTF-16
Initially, Unicode used two main encodings: UCS-2 and UTF-16, which stored each character in two bytes (16 bits). But I soon realized that not every machine “thought” the same way, and there was confusion about which byte should come first (high-endian or low-endian). To solve this, they invented a Byte Order Mark (BOM) at the beginning of each Unicode string, which indicated the correct order. This solved the “order” issue, but not the space issue – after all, for English texts, two bytes per character were a waste. 😳

The Magic Solution: UTF-8 🧙‍♂️

That's when UTF-8 came along, a brilliant modification that stores common English characters in one byte, and uses more bytes only when it needs to represent more complex characters. This means that, for English texts, UTF-8 takes up practically the same space as ASCII – and that's why many people thought they were using Unicode.

In UTF-8, characters between 0 and 127 look exactly like ASCII, and it's only when you insert with accents, Greek letters, Japanese letters, etc., which UTF-8 expands and uses more bytes.

Conclusion: The Connected World Needs Unicode 🧩

Today, Unicode and UTF-8 are the most common standards on the web, because they allow text to be read and written consistently anywhere in the world, without compatibility issues. With this, we can write in any language, exchange emojis, and still ensure that our text will be read correctly, whether in ASCII, ANSI or any other format. Unicode unified all of this, and UTF-8 made the transition seem smooth.

Now, when you see those strange marks in place of letters (�), you already know that it probably has to do with a failure in the interpretation of character retrieval. Unicode and UTF-8 have solved many headaches, but we still depend on well-implemented software for everything to work 100%.

And so, the world of character encodings went from being a chaotic puzzle to something that, much of the time, "just works." 🎉

Top comments (0)