Char
has a problem. You don't see it for what it is, and it hurts the poor little types feeling. So, we're gonna work on that. What does Char
want you to know about what it is? It's right there in the documentation.
Represents a character as a UTF-16 code unit.
What does this mean, though? String
happens to be a bunch of Char
, so, String
is a bunch of UTF-16 code units; it is UTF-16 encoded text.
Now, I could explain plenty about encodings and all that, but, no. That gets super boring. Instead it's easier to show the problem with perception and then explain things.
Char @char = "𝄞 (g clef)"[0];
What is @char
? '𝄞'
? No, not even close. @char
is '�'
. That's not an error with your font, your system is rendering everything correctly. Well this isn't very clear about what is going on, just that something unexpected is going on.
Let's do the same thing, but this time around, I'll escape every single character.
Char @char = "\ud834\udd1e\u0020\u0028\u0067\u0020\u0063\u006c\u0065\u0066\u0029"[0];
So now it should be a bit more clear what's going on, at least with the indexer. [0]
would be the first element, which is U+D834
, or '�'
. So what that and why is it not '𝄞'
?
This is why Char
is confused and misunderstood.
Remember the documentation?
Represents a character as a UTF-16 code unit.
The issue is, the indexer, and really everything we think of when it comes to working with String
work with UTF-16 encoding (code) units, not characters. The naming is really confusing, but bear with me: Char
is not a character. It was, historically, but UNICODE and its history is... complicated.
So what did we wind up getting with the indexer? A UTF-16 code unit, sure, but why is it often what we want, and why not now? UTF-16 uses 16-bit units for encoding, but there's 21-bits or something, reserved by UNICODE. So how would you represent the higher characters? Well, use two code units. UTF-8 does a similar thing, uses 1-4 byte units. '𝄞'
is encoded as U+D834, U+DD1E, as it got us the first code unit. But note that [1]
wouldn't get us the character we want either, it instead gets us U+DD1E
!
So how does one actually get the characters from a String
as we think of them?
This isn't a simple answer, actually. Text is extremely complicated. But there's something that Microsoft introduced with .NET Core 3.0, that helps a ton: Rune
. So what's this? Basically an analogue to Char
, but it represents a UNICODE scalar value rather than a UTF-16 code unit. The Rune
API provides all sorts of almost identical methods to what's in Char
, which is fantastic because it means code you already wrote can easily be adapted, and it's far easier to learn. Now let's consider that analogue.
Rune rune = "𝄞 (g clef)".GetRuneAt(0);
What's rune
? Well, good news, it's 𝄞 (U+01D11E)
.
But wait, is this just some trickery where GetRuneAt()
reads two char
s and combines them? No, not at all. It reads enough of a UTF-16 sequence to get a UNICODE scalar value, and that's that. Try it, you'll see.
Now, as I said, Rune
doesn't solve everything. There's combining marks and ligatures and other things it doesn't handle but we still think of as a "character". But it gets us into a lot better situation. Code using Rune
will be more likely to support users of non-Latin scripts (most of the world).
But I said .NET Core 3.0, right? So if you can't update your app because of constraints, or you're a library author like myself who's targeting .NET Standard instead, what's the options?
library author like myself
Good news guys. Stringier.Rune is part of a large set of libraries I develop and maintain. Only in this case, I didn't develop it. The .NET Foundation does. But I backported it all the way to .NET Standard 1.3. Yeah, that far back. So almost anything you could still be using, could be updated to use Rune
.
Top comments (0)