loading...
Cover image for Why No Modern Programming Language Should Have a 'Character' Data Type

Why No Modern Programming Language Should Have a 'Character' Data Type

awwsmm profile image Andrew (he/him) Updated on ・6 min read

Photo by Henry & Co. from Pexels


Standards are useful. They quite literally allow us to communicate. If there were no standard grammar, no standard spelling, and no standard pronunciation, there would be no language. Two people expressing the same ideas would be unintelligible to one another. Similarly, without standard encodings for digital communication, there could be no internet, no world-wide web, and no DEV.to.

When digital communication was just beginning, competing encodings abounded. When all we can send along a wire are 1s and 0s, we need a way of encoding characters, numbers, and symbols within those 1s and 0s. Morse Code did this, Baudot codes did it in a different way, FIELDATA in a third way, and dozens -- if not hundreds -- of other encodings came into existence between the middle of the 19th and the middle of the 20th centuries, each with their own method for grouping 1s and 0s and translating those groups into the characters and symbols relevant to their users.

Some of these encodings, like Baudot codes, used 5 bits (binary digits, 1s and 0s) to express up to 2^5 == 32 different characters. Others, like FIELDATA, used 6 or 7 bits. Eventually, the term byte came to represent this grouping of bits, and a byte reached the modern de facto standard of the 8-bit octet. Books could be written about this slow development over decades (and many surely have been), but for our purposes, this short history will suffice.

It was this baggage that the ANSI committee (then called the American Standards Association, or ASA) had to manage while defining their new American Standard Code for Information Interchange (ASCII) encoding in 1963, as computing was quickly gaining importance for military, research, and even civilian use. ANSI decided on an 7-bit, 128-character ASCII standard, to allow plenty of space for the 52 characters (upper and lowercase) of the English language, 10 digits, and many control codes and punctuation characters.

Even though ASCII was defined as a 7-bit encoding, the popularity of 8-bit bytes meant that ASCII characters commonly included a high 8th bit which went unused. In some applications, that bit acted as a toggle to make text italic.

In spite of this seeming embarrassment of wealth with regards to defining symbols and control codes for English typists, there was one glaring omission: the remainder of the world's languages.

And so, as computing became more widespread, computer scientists in non-English-speaking countries needed their own standards. Some of them, like ISCII and VISCII, simply extended ASCII by tacking on an additional byte, but keeping the original 128 ASCII characters the same. Logographic writing systems, like Mandarin Chinese, require thousands of individual characters. Defining a standard encompassing multiple logographic languages could require multiple additional bytes tacked onto ASCII.

Computer scientists realised early on that this would be a problem. On the one hand, it would be ideal to have a single, global standard encoding. On the other hand, if 7 bits worked fine for all English-language purposes, those additional 1, 2, or 3 bytes would simply be wasted space most of the time ("zeroed out"). When these standards were being created, disk space was at a premium, and spending three quarters of it on zeroes for a global encoding was out of the question. For a few decades, different parts of the world simply used different standards.

But in the late 1980s, as the world was becoming more tightly connected and global internet usage expanded, the need for a global standard grew. What would become the Unicode consortium began at Apple in 1987, defining a 2-byte (16-bit) standard character encoding as a "wide-body ASCII":

Unicode aims in the first instance at the characters published in modern text... whose number is undoubtedly far below 2^14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.

And so Unicode fell into the same trap as ASCII in its early days: by over-narrowing its scope (focusing only on "modern-use characters") and prioritising disk space, Unicode's opinionated 16-bit standard -- declaring by fiat what would be "generally useful" -- was predestined for obsolescence.

This 2-byte encoding, "UTF-16", is still used for many applications. It's the string encoding in JavaScript and the String encoding in Java. It's used internally by Microsoft Windows. But even 16 bits' worth (65536) of characters quickly filled up, and Unicode had to be expanded to include "generally useless" characters. The encoding transformed from a fixed-width one to a variable-width one as new characters were added to Unicode.

Modern Unicode consists of over 140,000 individual characters, requiring at least 18 bits to represent. This, of course, creates a dilemma. Do we use a fixed-width 32-bit (4-byte) encoding? Or a variable-width encoding? With a variable-width encoding, how can we tell whether a sequence of 8 bytes is eight 1-byte characters or four 2-byte characters or two 4-byte characters or some combination of those?

UTF-8, the modern, variable-width incarnation of Unicode, is actually a code-within-a-code. The bit sequence in the first byte of a multi-byte character encodes within it the number of bytes in that sequence.

This is a complex problem. Because of its UTF-16 encoding, JavaScript will break apart multibyte characters if they require more than two bytes to encode:

Clearly, these are "characters" in the lay sense, but not according to UTF-16 strings. The entire body of terminology around characters in programming languages has now gotten so overcomplicated, we have characters, code points, code units, glyphs, and graphemes, all of which mean slightly different things, except sometimes they don't.

Thanks to combining marks, a single grapheme -- the closest thing to the non-CS literate person's definition of a "character" -- can contain a virtually unlimited number of UTF-16 "characters". There are multi-thousand-line libraries dedicated only to splitting text into graphemes. Any single emoji is a grapheme, but they can sometimes consist of 7 or more individual UTF-16 characters.

In my opinion, the only sensibly-defined entities in character wrangling as of today are the following:

  • "byte" -- a group of 8 bits
  • "code point" -- this is just a number, contained within the Unicode range 0x000000 - 0x10FFFF, which is mapped to a Unicode element; a code point requires between 1 to 3 bytes to represent
  • "grapheme" -- an element which takes up a single horizontal "unit" of space to display on a screen; a grapheme can consist of 1 or more code points

A code point encoded in UTF-32 is always four bytes wide and uniquely maps to a single Unicode element. A code point encoded in UTF-8 can be 1-4 bytes wide, and can compactly represent any one Unicode element. If there were no such thing as combining marks, either or both of those two standards should be enough for the foreseeable future. But the fact that combining marks can stack Unicode elements on top of each other in the same visual space blurs the definition of what a "character" really is.

You can't expect a user to know -- or care about -- the difference between a character and a grapheme.

So what are we really talking about when we define a character data type in a programming language? Is it a fixed-width integer type, like in Java? In that case, it can't possibly represent all possible graphemes and doesn't align with the layperson's understanding of "a character". If an emoji isn't a single character, what is it?

Or is a character a grapheme? In which case, the memory set aside for it can't really be bounded, because any number of combining marks could be added to it. In this sense, a grapheme is just a string with some unusual restrictions on it.

Why do you need a character type in your programming language anyway? If you want to loop over code points, just do that. If you want to check for the existence of a code point, you can also do that without inventing a character type. If you want the "length" of a string, you'd better define what you mean -- do you want the horizontal visual space it takes up (number of graphemes)? Or do you want the number of bytes it takes up in memory? Something else maybe?

Either way, the notion of a "character" in computer science has become so confused and disconnected from the intuitive notion, I believe it should be abandoned entirely. Graphemes and code points are the only sensible way forward.

Posted on May 27 by:

awwsmm profile

Andrew (he/him)

@awwsmm

Got a Ph.D. looking for dark matter, but not finding any. Now I code full-time. Je parle un peu français. dogs > cats

Discussion

markdown guide
 

In Swift, a Character is an extended grapheme cluster, which will consist of one-or-more Unicode scalar values. It's what a reader of a string will perceive as a single character. And a String consists of zero or more Characters.

 

This is, I think, the compromise that comes closest to making sense. Check out the examples at grapheme-splitter -- I think the resulting graphemes align closely with the intuitive definition of a "character". However, think about how you would access and manipulate these graphemes programmatically: one code point at a type (or even one byte at a time). There's a disconnect between the programmer's understanding of a character and the layperson's understanding of a character. What I'm arguing is that eliminating the term "character" should eliminate that ambiguity.

 

The API in Swift allows getting to a UTF-8 Encoding Unit, or a UTF-16 Encoding Unit, or a UTF-32 Codepoint. Treating them as an index into an array of those sub-Character things. (Depending on what the developer is trying to do.)

Swift and Python 3 both seem to have a good handle on Unicode strings.

Alas, I have to work with C++, which has somewhat underwhelming support for Unicode.

 

Another nice article related to the subject is mortoray.com/2013/11/27/the-string...

I found this article through the Elixir docs

 

@mortoray with the smart commentary, as usual 😎

 

You wrote:
"This 2-byte, fixed-width encoding, "UTF-16""
But UTF-16 is "encoded with one or two 16-bit code units" (cf Wikipedia), hence it is a variable length 2 or 4 bytes encoding.
UTF-32 is a fixed-width encoding
Also, UTF-8 can be 1 to 4 bytes, the last 4th byte represents code points U+10000 to U+10FFFF.

 

You're right. When UTF-16 was introduced, it was fixed-width. But -- to accommodate 4-byte-width characters -- it's now a variable-width encoding. I'll edit the text to clarify that. Thanks!

 

I learnt from Golang, that there is Rune.

Though, I am a little concerned about Byte vs String performance, if all you do is within 1 byte (e.g. ASCII / extended ASCII).

 

UTF-8 tries to straddle performance and usability. By using a variable-width encoding, you're minimising memory usage. It just means that a few of the bits in the leading byte "go unused" because they indicate the number of bytes in the multi-byte character. This is a good compromise, but it still doesn't mean that a "character" should be defined as a variable-width UTF-8 element.

 

Hmm? What it have to do with combined characters and Zalgo?

Zalgo is just an extreme form of combining marks / joiner characters. Most people would consider

Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦

...to be a single character that just happens to have a lot of accent marks on it. If you define "character" as "grapheme", this is true. If you define "character" as "Unicode code point", it is not. That single "character" contains 34 UTF-16 elements. Try running this on CodePen and have a look at the console:

let ii = 0
const zalgo = "Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦"


while (true) {
  let code = zalgo.charCodeAt(ii)
  if (Number.isNaN(code)) break
  console.log(`${ii}: ${code}`)
  ii += 1
}

The problem arises because programmers' intuitive understanding of "character" tends to be closer to "code point", while the average person's understanding of "character" tends to be closer to "grapheme".

 

I think 4-byte len UTF-8 is possible (not essentially max to 3 bytes)

It is, UTF-8 can carry up to 4 bytes of information.

My point is that the terminology around what a "character" is has gotten so confusing that we should just stick to well-defined terms like "code point" and "grapheme". "Character" is sometimes confused with one or other of those (or something else entirely) and so I don't think it's a good name for a data type.

If you want to loop over "characters" in a string, you should loop over code points (which are composed of between 1-4 bytes). But why should someone ever want to loop over the individual bytes of a code point? This functionality could be provided, but not at the expense of clarity.