DEV Community

Martin Licht
Martin Licht

Posted on

Character types in C

We want to continue the discussion of escape sequences. As a preparation, we review the different character types in C.

The source code, with the possible exception of char and string literals, is written in the source character set. This character set includes the basic character sets, which is necessary to write the source code. The basic character set is the alphanumeric characters, numerous white space characters, and the special characters:

_ # { } [ ] ( ) = < > + - * / % ? : , ; . ^ & | ~ ! \
Enter fullscreen mode Exit fullscreen mode

String and character literals may comprise additional characters beyond the basic source character set. These depend on the source file's encoding. It is implementation-defined how these are mapped to internal character representations.

The execution character set is the set of characters used at runtime to represent values of type char and related character types. Character and string literals are translated into this set during compilation in an implementation-defined manner.

Code units, code points, and glyphs

Let us review the terminology, which is not necessarily consistent across areas.

A character is a visible symbol such as a letter, a digit, or an emojis. Such printable characters are rendered in accordance to a font. Non-printable characters like tab or newline influence layout do generally not correspond to a distinct glyph but may affect the layout.

Code points are the encoding of a character in terms of bytes. An encoding is basically a list of code points, which specifies what byte sequences are supposed to represent what glyph.

A code point is a numerical value assigned to a character in a character set. For example, the Unicode code point for 'A' is U+0041.

A code unit is the smallest unit of data used by a character encoding. The size of the code unit depends on the respective encoding, it can practically vary from one byte to four bytes. Depending on the specific encoding, one or several code units constitute a single code point. The code units are the items of a string in C.

A code unit is the smallest unit of data used by a character encoding. In encodings such as UTF-8, UTF-16, or UTF-32, a code point may be represented by one or more code units. In C, strings of type char* consist of 8-bit code units; strings of type wchar_t*, char16_t*, or char32_t* use wider code units.

In many encodings, namely the fixed-length encodings, one code unit always represents one code point. One example is UTF-32. By contrast, in variable-length encodings, one or more code unit may represent a code point. Examples are UTF-8 and UTF-16. The number of code points in a variable-length encoding is not just the same as the number of code units. As you can imagine, string algorithms on fixed-length encodings are considerably simpler than on variable-length encodings.

So the code units are the basic building blocks of the code points, which encode the actual characaters. We continue how this works out.

Fixed-length encodings

A char holds a code unit of the narrow execution encoding. The encoding and exact semantics are implementation-defined but often correspond to UTF-8 or some legacy encoding such as ISO-8859 or Windows-1252. The width of char is at least 8 bits. Here, we see that the terminology is somewhat inconsistent: for such legacy encodings, the character, code points, and code units coincide, and character can be used interchangeably with the other two terms.

The type char has had many different purposes since the ancient days of C. Char can be signed or unsigned, depending on the implementation, but it is always a distinct type from signed char and unsigned char.

A wchar_t holds a code unit of the wide execution encoding. Once again, its size and encoding are implementation-defined. On Windows, it is typically 16-bit and corresponds to UTF-16; on POSIX systems, it is usually 32-bit and corresponds to UTF-32. Such fixed-width representations simplify indexing but sacrifice compatibility and space efficiency.

A char32_t holds a single 32-bit code unit that directly encodes one Unicode code point. Each code unit maps one-to-one with a code point, which is why this constitutes a fixed-length encoding for Unicode.

The character and string literals for these types are marked with prefixes

char* str = "Hello World!\n"; // no prefix 
wchar_t* str = L"Hello World!\n"; // L indicates a wide character string 
char32_t* str = U"Hello World!\n"; // U indicates a Unicode string 
Enter fullscreen mode Exit fullscreen mode

Generally speaking, we use char for raw byte data or single-byte text encodings (such as ASCII, ISO-8859). The type char32_t, introduced in C11, can represent a much larger scope of characters, at the expense of quadrupled memory usage. The wide character type wchar_t strikes a balance between memory usage and flexibility.

Variable-length encodings

A char8_t holds a code unit of the UTF-8 encoding of size one byte. UTF-8 is a variable-length encoding where code points are represented by one to four 8-bit code units. The standard ASCII characters occupy one single byte, but higher code points require multiple bytes.

Similarly, a char16_t holds a code unit of the UTF-16 encoding, where each code unit is composed of 16 bits. UTF-16 is a variable-length encoding where code points can be stored as multiple bytes. Characters in the Basic Multilingual Plane (BMP) are stored as one 16-bit code unit, while supplementary characters (above U+FFFF) use a pair of 16-bit code units called surrogate pairs.

char8_t* str = u8"Hello World!\n"; // u8 indicates UTF-8
char16_t* str = u"Hello World!\n"; // u indicates UTF-16
Enter fullscreen mode Exit fullscreen mode

The type char8_t was introduced in C23, whereas char16_t was introduced in C11.
Notably, char8_t is backward-compatible with ASCII for code points U+0000–U+007F.

Summary

Encoding Storage Type Width (bits) Variable-length? Literal Prefix
Narrow exec char ≥8 impl-defined "..."
Wide exec wchar_t impl-defined impl-defined L"..."
UTF-8 char8_t 8 Yes u8"..."
UTF-16 char16_t 16 Yes (surrogates) u"..."
UTF-32 char32_t 32 No U"..."

Top comments (0)