UTF-8 strings in C (3/3)

#c #utf8 #encoding

It's complicated

I said in the previous post that dealing with UTF-8 strings in C is easy, and I was not lying: dealing with UTF-8 encoding is just a matter of bit juggling. What is really complicated is dealing with the many existing characters and their use in different languages.

Consider the simple operation of converting a string to upper case. The uppercase for German word "straße" (street) is "STRASSE" going from a six characters string to a seven characters string! Even worse, going back to lower case would give "strasse" which is not the original string!

Unicode standard specifies a set of rules to fold characters to their lower case form. If you open the file, you'll see that some rules are marked with "F" (Full) and others with "S" (Simple); you may want to use one set or the other depending on your needs.

Take the word "straße" again. If you use the Full set of rules (which may make the string bigger), you will have that

fold("straße") -> "strasse"
fold("STRASSE") -> "strasse"

meaning that you can successfully perform a case-insensitive comparation. But if you use the Simple set of rule, you won't be successful in recognizing the two strings "straße" and "STRASSE" to be the same, beside letter cases.

Sorting is even more complicated: it is the realm of Unicode Collation where you have to consider not only the strings themselves but also the usage.
From the Unicode document:

in Swedish 'z' < 'ö' but in German 'ö' < 'z'
in a German Dictionary 'of' < 'öf' but in a German Phonebook 'öf' < 'of'

And let us not go into the rabbithole of ispunct(), isdigit() and so on. The, rather huge, Unicode Database has all the answers but still you have to be careful in what you need. Is 'Ⅸ' (roman numeral 9) a digit? Probably not, but what about the Cuneiform 9: '𒐞'?

So, there is a good reason why full Unicode libraries are big and difficult to use: it's complicated!

Do we really need all of this?

If the task you have at hand requires the full power of Unicode standard, your best option is to use a good library like ICU or libunistring but if you just need to support a single script or many less properties than Unicode specifies, you can easily code them yourselves.

For example, say you just need to support case insensitive comparison for the Latin script (and you are content with just Basic Latin, Latin-1 and Latin-A) you may implement the Simple Folding rules very easily:

u8chr_t u8chrfold(u8chr_t c)
{
  // Folding according to the Simple rules in Unicode 13.0
  //
  if ('A' <= c && c <= 'Z') return (c | 0x20); // Basic Latin

  if (0xC380 <= c && c <= 0xC39E && c != 0xC397) return (c | 0x20); // Latin-1

  if (0xC5B8 == c) return 0xC3BF; // # LATIN CAPITAL LETTER Y WITH DIAERESIS
  if (0xC4B0 == c) return 0x69; // # LATIN CAPITAL LETTER I WITH DOT ABOVE

  if (0xC480 <= c && c <= 0xC4B7 && !(c&1)) return c+1;  // Latin-A
  if (0xC4B8 <= c && c <= 0xC588 &&  (c&1)) return c+1;  // Latin-A
  if (0xC58A <= c && c <= 0xC5B8 && !(c&1)) return c+1;  // Latin-A
  if (0xC5B9 <= c && c <= 0xC58E &&  (c&1)) return c+1;  // Latin-A

  return c;
}

Adding other blocks of the Latin script, is just a matter of adding some more if's.

To recognize extended blank characters you may write something like:

/* BLANK CHARACTERS
  0x000009  U+0009 tab
  0x000020  U+0020 space
  0x00C2A0  U+00A0 no-break space
  0xE19A80  U+1680 ogham space mark
  0xE28080  U+2000 en quad
  0xE28081  U+2001 em quad
  0xE28082  U+2002 en space
  0xE28083  U+2003 em space
  0xE28084  U+2004 three-per-em space
  0xE28085  U+2005 four-per-em space
  0xE28086  U+2006 six-per-em space
  0xE28087  U+2007 figure space
  0xE28088  U+2008 punctuation space
  0xE28089  U+2009 thin space
  0xE2808A  U+200A hair space
  0xE280AF  U+202F narrow no-break space
  0xE2819F  U+205F medium mathematical space
  0xE38080  U+3000 ideographic space
*/
int u8chrisblank(u8chr_t c)
{
  return (c == 0x20)
      || (c == 0x09)
      || (c == 0xC2A0)
      || (c == 0xE19A80)
      || ((0xE28080 <= c) && (c <= 0xE2808A))
      || (c == 0xE280AF)
      || (c == 0xE2819F)
      || (c == 0xE38080)
      ;
}

Conclusion

Is it worthwile to write your own functions instead of relying on existing libraries? It depends (of course)!

If you need to perform complex operations on string with unknown script, an existing library is your best option.

But if you're tight on memory, just need to handle few character properties, and want to keep things simple, maybe it is worthwile to develop your own set of functions.

What is, for sure, worthwile, is to know how the UTF-8 encoding of Unicode characters works so that your C application can speak with the rest of the modern world.

DEV Community

UTF-8 strings in C (3/3)

It's complicated

Do we really need all of this?

Conclusion

Top comments (0)