UTF-8 tries to straddle performance and usability. By using a variable-width encoding, you're minimising memory usage. It just means that a few of the bits in the leading byte "go unused" because they indicate the number of bytes in the multi-byte character. This is a good compromise, but it still doesn't mean that a "character" should be defined as a variable-width UTF-8 element.
Zalgo is just an extreme form of combining marks / joiner characters. Most people would consider
Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦
...to be a single character that just happens to have a lot of accent marks on it. If you define "character" as "grapheme", this is true. If you define "character" as "Unicode code point", it is not. That single "character" contains 34 UTF-16 elements. Try running this on CodePen and have a look at the console:
The problem arises because programmers' intuitive understanding of "character" tends to be closer to "code point", while the average person's understanding of "character" tends to be closer to "grapheme".
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
I learnt from Golang, that there is Rune.
Though, I am a little concerned about Byte vs String performance, if all you do is within 1 byte (e.g. ASCII / extended ASCII).
UTF-8 tries to straddle performance and usability. By using a variable-width encoding, you're minimising memory usage. It just means that a few of the bits in the leading byte "go unused" because they indicate the number of bytes in the multi-byte character. This is a good compromise, but it still doesn't mean that a "character" should be defined as a variable-width UTF-8 element.
Hmm? What it have to do with combined characters and Zalgo?
Zalgo is just an extreme form of combining marks / joiner characters. Most people would consider
...to be a single character that just happens to have a lot of accent marks on it. If you define "character" as "grapheme", this is true. If you define "character" as "Unicode code point", it is not. That single "character" contains 34 UTF-16 elements. Try running this on CodePen and have a look at the console:
The problem arises because programmers' intuitive understanding of "character" tends to be closer to "code point", while the average person's understanding of "character" tends to be closer to "grapheme".