Martin Licht

Posted on Mar 14

Universal character names in C and C++

#programming #c #cpp #coding

Universal character names (UCNs) are a mechanism in C and C++ to designate characters from the ISO/IEC 10646 character set (Unicode).
The C language is not restricted to the English alphabet, and non-ASCII characters can appear in source codes. For example, German umlauts or emojis may appear in character and string literals.

char c = '😂';
char *s3 = "Hätte 😂 Würde 😍 Könnte 😀";

They can even appear in identifiers (e.g., variable names) or in names of preprocessor macros.

// variable names
int größe = 10;
int länge = 20;
int 水 = 1;
double π = 3.14159;

// macro names
#define müll 42
#define 火 100

The difficulty is portability: different platforms use different source encodings. Universal character names provide a portable way to encode non-ASCII characters.
This blog post will explore the concept of universal character name (UCN) in the C programming language.

Syntax

A universal character name encodes a Unicode code point using only ASCII characters. The character is identified by its code point in hexadecimal form.
UCNs start with a single \ followed by either u or U, and a sequence of hexadecimal digits. Depending on whether the leading u is capitalized or not, either four or eight hexademical digits must appear:

\uABCD // // 4 hex digits, code point up to U+FFFF
\U1A2B3C4D // // 8 hex digits, code point up to U+10FFFF

The examples above can be written as:

char c = '\U0001F602'; // 😂
char *s3 = "H\u00E4tte \U0001F602 W\u00FCrde \U0001F60D K\u00F6nnte \U0001F600";

int gr\u00F6\u00DFe = 10; // größe
int l\u00E4nge = 20;      // länge
int \u6C34 = 1;           // 水
double \u03C0  = 3.14159; // π

#define m\u00FCll 42      // müll
#define \u706B 100        // 火

Note that the leading \ must be single. A succession of \ will not be interpreted as a UCN:

printf("\u03C0");  // prints "π"
printf("\\u03C0"); // prints "\\u03C0"

Comparision with escape sequences

The syntax of universal character names look similar to escape sequences but their implementation is quite different.
Universal character names are processed in an earlier translation phase than escape sequences, during the phase that maps source characters into the character set internally used by the compiler.
By comparison, escape sequences (like \n or \x41) are processed when string or character literals are interpreted.
Obviously, the syntax of UCN is inspired by escape sequences but the mechanism is different.

Hexadecimal esacpe sequences serve a similar purpose but are less portable: hexadecimal escape sequences produce a character in the execution character set (the character set in which character and string literals are saved). The execution character set might be too small. Instead, UCNs produce a unambigous character according to UNICODE.

Comparison with trigraphs

Universal character names are close in spirit to trigraphs, an ancient and now largely deprecated language feature of C and C++. Trigraphs were introduced to substitute for certain special characters such as # when these were not available on the keyboard. These special characters were then replaced by a trigraph sequence.

For example, the trigraph ??= acts like # at any position in the source code.

Trigraphs are obsolete because most keyboards now support the C basic character set. UCNs serve a similar purpose but will remain relevant much longer, since no keyboard can cover the full range of characters used in the world.

Usage

The universal character names (UCNs) can appear in most places in the source code. They provide a portable way to include non-ASCII characters regardless of source file encoding. Whether the program compiles and executes as intended depends on how those characters are allowed in identifiers or how they are represented in the execution character set.

For example, identifiers (such as variable names, functions, preprocessor tokens, ...) can include many non-English characters, such as German umlauts or Chinese hanzi, but they cannot include symbols such as emojis.

How a non-ASCII character in a character or string literal is interpreted is a separate matter. For example, a character literal such as '火', which is equivalent to '\u706B', will probably not behave as expected.

In C, an ordinary character literal has type int. Writing '\u706B' yields an int value corresponding to U+706B. If this value is assigned to a char variable, the result is implementation-defined. Some compilers issue a diagnostic warning, and other compilers truncate to the low byte (U+006B, 'k').
In C++, an ordinary character literal has type char. If the character cannot be represented by char, the program is ill-formed.

However, both u'火' and u'\u706B' will produce the same UTF-16 character literal of type char16_t.

Restrictions

However, there are some constraints: not all characters of UTF-32 may appear. The excluded characters are:

Code points within the surrogate range (D800 through DFFF) of UNICODE are disallowed.
They cannot denote characters that are not valid Unicode scalar values. That means the numerical value cannot be greater than 10FFFF
Outside of character and string literals, they cannot denote basic source characters. The 96 characters in the basic execution character set, such as ASCII letters, digits, and punctuation, must always be written literally. Identifiers must still obey the general rules: cannot start with a digit, cannot clash with keywords, etc.

Furthermore, they cannot appear in escape sequences inside character constants or strings where they would be ambiguous with existing escape rules.

Applications

There are numerous features and reasons why this language feature is useful.

Embedding non-ASCII text in character/string literals: UCNs allow unambiguous portable inclusion of characters in string or character literals.
Ensuring identifiers with non-ASCII names are valid C: Variable names such as größe can be written as gr\u00F6\u00DFe, ensuring validity even if the source editor defaults to plain ASCII.
Precise specification of characters: UCNs make explicit which Unicode scalar value is intended, avoiding confusion in visually similar glyphs (e.g., Greek alpha vs. Latin a).
Portability of source code across encodings: source files may be saved in various encodings such as ASCII, UTF-8, or a legacy code page. UCNs are pure ASCII, so the compiler always interprets them the same way. For example, \u00FC is always ü, regardless of whether the file is opened in UTF-8, Latin-1, or Windows-1252.
Substitution by editors: editors may offer reading and modifying the source code in any desired character set but then save any non-ASCII characters as UCNs. This preserves the convenience of editting a broader character set but ensures portability because the source code uses only ASCII characters.
Interoperability with generated code: C code generators may output identifiers or strings containing international text. To guarantee correct compilation, they can emit UCNs instead of raw characters.
Unicode beyond the basic multilingual plane: Characters like emojis or rare scripts (U+1F600 😀, U+20000 CJK ideographs) require \UXXXXXXXX. These cannot be represented with simple \xNN escapes.

DEV Community