What are supplementary characters?

#unicode #webdev #programming #javascript

This article explains what supplementary characters are, how UTF-16 represents them, where developers make mistakes, and what you should do instead.

Start with the basic model

Unicode assigns a unique code point to each character.

Many commonly used characters fit in the range from U+0000 to U+FFFF. This range is called the Basic Multilingual Plane, or BMP.

Unicode also includes characters above U+FFFF. These are called supplementary characters, and live in the supplementary planes.

Examples include many historic scripts, many musical notation symbols, many emoji, and some CJK ideographs.

That fact that not every Unicode character fits in 16 bits is the source of most confusion.

A supplementary character is any Unicode character with a code point from U+10000 to U+10FFFF. For example:

U+1F600 😀
U+1D11E 𝄞
U+20000 𠀀

Why does this matter? Because some systems, languages, and APIs still expose text through UTF-16 code units.

That means one character may take more than one storage unit.

How UTF-16 represents them

UTF-16 uses 16-bit code units. For BMP characters, one code unit is enough.

For supplementary characters, UTF-16 uses two code units. This pair is called a surrogate pair, including a high surrogate, followed by a low surrogate.

For example, the character 😀 U+1F600 is represented in UTF-16 as U+D83D U+DE00.

Together they encode one supplementary character.

If you inspect UTF-16 data directly, remember these ranges:

High surrogates: U+D800 to U+DBFF
Low surrogates: U+DC00 to U+DFFF

These code points are reserved for UTF-16 encoding mechanics. They are not valid standalone Unicode scalar values for text. So:

A lone surrogate is invalid as a character in normal Unicode text processing.

If your system allows isolated surrogates to leak into strings, you may see invalid output.

Regular expressions

In some environments, regex tokens operate on UTF-16 code units unless you enable a Unicode-aware mode.

In JavaScript, for example, the Unicode flag matters:

const s = "😀";
console.log(/^.$/.test(s));   // false in many cases without Unicode mode
console.log(/^.$/u.test(s));  // true

When matching Unicode text, enable the Unicode-aware regex mode your platform provides.

Then test with real supplementary characters, not only ASCII.

DEV Community

What are supplementary characters?

Start with the basic model

How UTF-16 represents them

Regular expressions

Top comments (0)