Unicode

#computerscience #software #encoding #i18n

Unicode is an international character encoding standard. It provides a unique number (code point) for every character, no matter what the platform, program, or language is. Furthermore, it represents the most commonly used encoding today.

ASCII

ASCII (American Standard Code for Information Interchange) is one of the first widely used character encoding standards. People from the telecommunication and computing industries in America created it during the 1960s. As a 7-bit coding system, it supported 128 (i.e. 2⁷) characters, 96 printing characters, and 32 control characters. That was sufficient to encode numbers, some special characters, and the letters of the English alphabet.

However, the spread of computing and the Internet has created a need for other characters as well. As computers used 8-bit bytes, some manufacturers decided to use the remaining 8th bit in the ASCII code and thus expand the number of characters to 256. This 8-bit encoding is often referred to as “Extended ASCII” or “8-bit ASCII“. With the growth of different 8-bit encoders, data exchange became complicated and error-prone. That was a sign that it was necessary to find some universal solution that would work for all languages and cover all the special characters.

Unicode

Unicode provides a unique code for every character, in every language, in every program, on every platform. It enables a single document to contain text from different writing systems, which was nearly impossible with earlier native encodings. Moreover, Unicode supports emojis, which are an indispensable part of communication today.

Unicode Transformation Formats

Unicode defines several transformation formats, also known as UTFs (Unicode Transformation Formats). These transformation formats define how each code is represented in bits in memory. Below is a brief overview of the three UTFs that Unicode Standard provides.

UTF-8
- variable-length character encoding that uses from 1 to 4 bytes (from 8 to 32 bits)
- backward compatible with ASCII
- the most common encoding on the web (~98% of all web pages)
UTF-16
- variable-length character encoding that uses 2 or 4 bytes (16 or 32 bits)
- internally used by Microsoft Windows, Java, JavaScript, etc.
UTF-32
- fixed length character encoding that uses 4 bytes (32 bits)
- faster to operate but uses more memory and wastes a lot of bandwidth

Final thoughts

Thanks to Unicode, today's software runs on a variety of languages and platforms. That was hard to imagine a few decades ago. In other words, today's software localization would be impossible without such an encoding standard.

More details regarding Unicode you can find in the original post.

DEV Community

Unicode

ASCII

Unicode

Unicode Transformation Formats

Final thoughts

Top comments (0)