Rune struct

#textprocessing #csharp #dotnet

Char has a problem. You don't see it for what it is, and it hurts the poor little types feeling. So, we're gonna work on that. What does Char want you to know about what it is? It's right there in the documentation.

Represents a character as a UTF-16 code unit.

What does this mean, though? String happens to be a bunch of Char, so, String is a bunch of UTF-16 code units; it is UTF-16 encoded text.

Now, I could explain plenty about encodings and all that, but, no. That gets super boring. Instead it's easier to show the problem with perception and then explain things.

Char @char = "𝄞 (g clef)"[0];

What is @char? '𝄞'? No, not even close. @char is '�'. That's not an error with your font, your system is rendering everything correctly. Well this isn't very clear about what is going on, just that something unexpected is going on.

Let's do the same thing, but this time around, I'll escape every single character.

Char @char =  "\ud834\udd1e\u0020\u0028\u0067\u0020\u0063\u006c\u0065\u0066\u0029"[0];

So now it should be a bit more clear what's going on, at least with the indexer. [0] would be the first element, which is U+D834, or '�'. So what that and why is it not '𝄞'?

This is why Char is confused and misunderstood.

Remember the documentation?

Represents a character as a UTF-16 code unit.

The issue is, the indexer, and really everything we think of when it comes to working with String work with UTF-16 encoding (code) units, not characters. The naming is really confusing, but bear with me: Char is not a character. It was, historically, but UNICODE and its history is... complicated.

So what did we wind up getting with the indexer? A UTF-16 code unit, sure, but why is it often what we want, and why not now? UTF-16 uses 16-bit units for encoding, but there's 21-bits or something, reserved by UNICODE. So how would you represent the higher characters? Well, use two code units. UTF-8 does a similar thing, uses 1-4 byte units. '𝄞' is encoded as U+D834, U+DD1E, as it got us the first code unit. But note that [1] wouldn't get us the character we want either, it instead gets us U+DD1E!

So how does one actually get the characters from a String as we think of them?

This isn't a simple answer, actually. Text is extremely complicated. But there's something that Microsoft introduced with .NET Core 3.0, that helps a ton: Rune. So what's this? Basically an analogue to Char, but it represents a UNICODE scalar value rather than a UTF-16 code unit. The Rune API provides all sorts of almost identical methods to what's in Char, which is fantastic because it means code you already wrote can easily be adapted, and it's far easier to learn. Now let's consider that analogue.

Rune rune = "𝄞 (g clef)".GetRuneAt(0);

What's rune? Well, good news, it's 𝄞 (U+01D11E).

But wait, is this just some trickery where GetRuneAt() reads two chars and combines them? No, not at all. It reads enough of a UTF-16 sequence to get a UNICODE scalar value, and that's that. Try it, you'll see.

Now, as I said, Rune doesn't solve everything. There's combining marks and ligatures and other things it doesn't handle but we still think of as a "character". But it gets us into a lot better situation. Code using Rune will be more likely to support users of non-Latin scripts (most of the world).

But I said .NET Core 3.0, right? So if you can't update your app because of constraints, or you're a library author like myself who's targeting .NET Standard instead, what's the options?

library author like myself

Good news guys. Stringier.Rune is part of a large set of libraries I develop and maintain. Only in this case, I didn't develop it. The .NET Foundation does. But I backported it all the way to .NET Standard 1.3. Yeah, that far back. So almost anything you could still be using, could be updated to use Rune.

DEV Community

Rune struct

Oldest comments (0)