DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

2

UTF variable length to the rescue

In previous post of this series I explained how extending base 7 bit ASCII led to encodings chaos of biblical proportions. The issue was in the method used - everyone were trying to jam non-ASCII characters into free space of a single byte, causing conflicts and compatibility issues.

UTF people took a different approach - to store all characters* in a single encoding and dynamically extend its available space by using multiple bytes to store some less common characters.
[*] They are actually codepoints, but that will be explained later, let's call them characters for now.

Note: They were not the first to have this idea. CJK for Chinese and Shift-JIS for Japanese were hacking around single byte limitation of necessity, because even whole single byte could not fit those alphabets. If you like clever algorithms read about Shift-JIS, it is mind blowing. Also UCS can be considered the real technical predecessor of UTF.

Back to UTF - it comes in 3 variants:

  • UTF-8 stores character using 1, 2, 3 or 4 bytes.
  • UTF-16 stores character using 2 or 4 bytes (this is common misconception that "16" means 16 bits / 2 bytes only).
  • UTF-32 stores character using 4 bytes.

So theoretical maximum capacity for characters is 2^32=4_294_967_296. Real capacity is way lower because many bytes are used for namespace organization purposes. For example for UTF-8 it is 1_112_064, but it still can be considered "unlimited" when compared to 128 capacity of 7 bit ASCII or 256 capacity of various encodings mentioned in previous post.

UTF-16 and UTF-32

Before I focus on UTF-8 I'd like to briefly talk about those two. UTF-16 was an abomination and never got popular. Main issue was weird characters organization, lack of backward compatibility with 7 bit ASCII and 0x00 (null bytes) used. Null bytes terminate strings in C language and it required extra care with memory management when reading this encoding. Here is rare picture of an unaware C programmer who read his first UTF-16 string ;)

C 0x00

UTF-32 has the same flaws and on top of that it seems like a huge waste of space - you need 4 times more bytes to store simple a than in ASCII. However sometimes having predictable, fixed characters-to-bytes ratio is so beneficial that it outweighs additional space cost. For example if you need to access 128th character in a string it starts at 127*4+1th byte in memory (assuming using composed form, which will be explained later).

Coming up next: Genius design of UTF-8.

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay