DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

2

Unicode vs UTF

Hi and welcome to the series that will explain various aspects of UTF encoding.

Let's start with common misconception: Unicode and UTF. Many people use those terms interchangeably and say that "This text has Unicode encoding". However these are not synonyms.

Unicode is a consortium. Non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data. Here is their logo:

Unicode consortium logo

They created and maintain Unicode standard, which catalogues all characters used worldwide. Current version 15.0 contains 149 186 characters.

UTF stands for Unicode Transformation Format and it is the technical implementation of Unicode standard. Tells how to represent all those catalogued characters as bytes. It has UTF-8, UTF-16 and UTF-32 variants (which will be explained later). But also less common encodings like BOCU and SCSU implement the same standard but are binary incompatible with UTF.

So if you refer to specific byte representation of a text (like a document on a disk or variable in a memory) you should say precisely "This text has UTF-8 encoding".

Coming up next: Madness before UTF - a short history lesson about dark times.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more