Remo Dentato

Posted on Jul 26, 2021 • Edited on Aug 6, 2021

UTF-8 strings in C (1/3)

#c #utf8 #encoding

Unicode

Today the World speaks Unicode. More precisely, today the World speaks UTF-8 encoded Unicode. So long to ISO-8859-x, KOI8 and Shift-JIS; and also to UCS-2 and other multibyte encodings!

7-bit ASCII is heartly welcome to stay but let's do our best to promote the use of UTF-8 everywhere. It is worthwhile!

For us C programmers, the price to pay is to get rid of the char type we all know and love; with the assumption that a character will fit into a byte it is no longer adequate.

C90 introduced wchar_t and a bunch of functions to help dealing with non-ASCII encodings but I always found them unnecessarily complex and confusing.

Of course, when dealing with Unicode strings, you can grab one of the Unicode library available for C (like ICU and libunistring ) and go with it but they are very complex and, maybe, you don't need all the features they offer.

Thanks to the beauty of UTF-8 (pure genius), most likely you need very little or no code at all to handle UTF-8 encoded strings depending on what you have to do with those strings.

Let's delve deeper to understand when a full-fledged library is needed and when you can just use the tools you already have at your disposal.

The Encoding

Let's just remind ourselves how UTF-8 works. Actually, it's very simple: given an Unicode codepoint (let's call it a character even if we know it's not 100% accurate) its bits are spread into multiple bytes according the following table:

range	Byte 1	Byte 2	Byte 3	Byte 4
0000 - 007F	0xxxxxxx
0080 - 07FF	110xxxxx	10xxxxxx
0800 - FFFF	1110xxxx	10xxxxxx	10xxxxxx
10000 - 10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Additional rules for a valid UTF encoding:

it must be minimal (it must use the smallest possible number of bytes)
codepoints U+D800 to U+DFFF (known as UTF-16 surrogates) are invalid and, hence, their encoding is invalid.

I'll deal with validating the encoding in a future post, for now let's see what UTF-8 allows us to do by simply ignoring the fact that the string is, indeed, UTF-8 encoded!

Useful properties

The UTF-8 encoding has many useful properties. Most notably:

The first 128 characters occupy just one byte and their encoding is the same both in ASCII and UTF-8.
The two most significant bits of the first byte of a multibyte encoding are 11 (i.e. if (b & 0xC0) == 0xC0 the byte b is the first byte of a multibyte encoding);
The two most significant bits of the next bytes of a multibyte encoding are 10 (i.e. if (b & 0xC0) == 0x80 the byte b is part of a multibyte encoding);
No NUL character ('\0') is introduced as byproduct of the encoding, meaning that our convention that a string is 0 terminated, is safe.
UTF-8 preserves ordering: the relative order of two encoded character is the same as their unencoded order.

The fact that any ASCII character is also an UTF-8 encoded text greatly simplify some tasks. For example, if you have to work with CSV (comma separated values) files, and you are not interested in the content of the fields, you can completely ignore the fact that the file is UTF-8 encoded since the separators are most likely to be also ASCII characters (',', ';', '', ...)

Also note that being able to easily identify the first byte of an encoded character makes possible to easily move to the next or previous character in the string starting from any point; even from a byte in the middle of a multibyte encoding. This is a very desirable property for an encoding, meaning that one can quickly re-sync if something went wrong in decoding.

Nothing (or very little) to do here

As results of the above mentioned properties, many functions in the C standard library continue to work (possibly with some caveat):

strcpy(), strcmp(), strstr(), fgets(), and any other function that relies on ASCII terminators (\0, \n, \t, ...) are completely unaffected.
strtok(), strspn(), strchr(), will work as long as their other argument is within the ASCII range.
For strlen(), strncpy(), and other size limited functions, the n parameter express the size (in bytes) of the buffer the string is in, not the number of character in the string.

In general, for any function you want to use, ask yourself if it makes any difference if the characters are encoded as UTF-8 or not and just write that minimal code you may need.

You may take advantage of the UTF-8 encoding to write simple functions like this:

// Returns the number of characters in an UTF-8 encoded string.
// (Does not check for encoding validity)
int u8strlen(const char *s)
{
  int len=0;
  while (*s) {
    if ((*s & 0xC0) != 0x80) len++ ;
    s++;
  }
  return len;
}

Or something more complex (but still not so complicated):

// Avoids truncating multibyte UTF-8 encoding at the end.
char *u8strncpy(char *dest, const char *src, size_t n)
{
  int k = n-1;
  int i;
  if (n) {
    dest[k] = 0;
    strncpy(dest,src,n);
    if (dest[k] & 0x80) { // Last byte has been overwritten
      for (i=k; (i>0) && ((k-i) < 3) && ((dest[i] & 0xC0) == 0x80); i--) ;
      switch(k-i) {
        case 0:                                 dest[i] = '\0'; break;
        case 1:  if ( (dest[i] & 0xE0) != 0xC0) dest[i] = '\0'; break;
        case 2:  if ( (dest[i] & 0xF0) != 0xE0) dest[i] = '\0'; break;
        case 3:  if ( (dest[i] & 0xF8) != 0xF0) dest[i] = '\0'; break;
      }
    }
  }
  return dest;
}

Conclusion

As a rule of thumb:

When you're asked to deal with UTF-8 encoded strings in C, ask yourself what aspect of the encoding really impacts your work. You may discover that being UTF-8 encoded is immaterial for the work you have to do!

Next steps

This post focused on the easy part to avoid scaring you away but there are two major aspects that needs to be discussed:

validation: how to determine if a sequence of bytes is really an UTF-8 encoded character;
folding: transforming characters between their uppercase and lowercase form (if any). That's a very complex point and most relevant for case insensitive comparison which is a very common task.

I'll address them in the next posts on this topic.

Please let me know if I missed something or if it wasn't clear enough. Your feedback is what makes this posts worth to write.

Top comments (4)

Florian Pigorsch • Aug 16 '21

Very interesting series on Unicode in C. I wonder if properties of UTF-8, most importantly: never introduce a NULL in the encoding, were added specifically to allow for interoperability with C's NULL-terminated strings.

Remo Dentato • Aug 17 '21

I think this is the case, as UTF-8 is the brainchild of Ken Thompson and Rob Pike, two protagonists of the Unix world since the beginning.

Florian Pigorsch • Aug 17 '21

Ah, wasn't aware if the origin story - now it makes perfect sense...

DEV Community

UTF-8 strings in C (1/3)

Unicode

The Encoding

Useful properties

Nothing (or very little) to do here

Conclusion

Next steps

Top comments (4)

Read next

A Comprehensive Introduction to C++: Master the Basics, syntax and key concepts

AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities

Boost Go Network App Performance: Zero-Copy I/O Techniques Explained

GSSOC, SWOC, and the GitHub Glow-Up: A Beginner’s Guide to Open Source