For further actions, you may consider blocking this person and/or reporting abuse
For further actions, you may consider blocking this person and/or reporting abuse
Madhu Kumar -
Java para Iniciantes (Oracle) -
bsorrentino -
Raihan Yudo Saputra -
Top comments (1)
Probably easiest if we build up from base concepts, so (note, all Unicode code points listed below are in decimal to specifically avoid any ambiguity involved in the discussion of encodings):
Unicode
Unicode is, quite simply, a method of representing (almost) every grapheme (character in common terminology) used in known writing systems in the world alongside a very large number of symbols and a lot of different control codes as unsigned integers known as 'code points'. Originally code points were 32-bit integers, but it very quickly became obvious that this was beyond excessive, and when UTF-16 finally gained traction the standard was revised so that code points are now only 25-bit unsigned integers.
All of the code points are sorted into groups of 65536 known as 'planes'. The plane a code point belongs to is indicated by the value of the top two bytes of it's overall value, with the other two bytes representing where in that plane it is. Seven of these planes actually have code points assigned to them (specifically planes 0-3 and 14-16), but about 99% of actual usage involves only the first two planes, known as the Basic Multilingual Plane (which holds a vast majority of the most common writing systems except certain less frequently used Hanzi characters and most of the Bhramic scripts) and the Supplementary Multilingual Plane (which includes the aforementioned Bhramic scripts, an assortment of historical scripts, a number of supplemental characters for scripts in the BMP, and large number of symbols).
Of specific note for later discussion, the first 128 code points (numbered 0-127) have the exact same meaning as the bytes with the same value interpreted as 7-bit ASCII or any of the ISO-8859 character encodings (and a number of other encodings as well).
UTF-32
Of course, Unicode is kind of pointless unless you have a way for computers to represent those characters. The most naive encoding for this is to just have each character be four bytes interpreted as an integer which represents the code point. This has three major issues:
/
,\
, and%
), and those characters may appear as part of actual UTF-32 characters.Obviously there's got to be a better solution...
UTF-16
UITF-16 evolved out of an interest of dealing with the efficiency issues with UTF-32. Instead of using a sequence of 32-bit integers, it uses a sequence of 16-bit integers. Code points in the BMP are represented simply as their exact value as a 16-bit integer. Code points beyond the BMP up through plane 16 (the highest plane likely to actually be used during our lifetime) are represented by a special pair of code points known as a surrogate pair. Anything beyond plane 16 cannot be represented (though this is no longer an issue, as the Unicode standard has been revised as mentioned above).
This sounds complicated, and it kind of is, but for most everyday text not involving a Bhramic script or Hanzi, it's more efficient than UTF-32 by a significant margin. It's also greatly simplified by the fact that the surrogate code points used for encoding things beyond the BMP are specifically reserved for that purpose, so there's no guesswork involved to figure out whether or not something is a character by itself or part of a surrogate pair.
UTF-16 has it's own issues though:
Microsoft Windows uses UTF-16 internally, as do a handful of other systems.
UTF-8
UTF-8, in turn, evolved out of an interest in fixing that compatibility issue that UTF-16 and UTF-32 have. It's much more complicated than both, and works as follows:
This allows representation of the same set of characters as UTF-16, with the following specific benefits:
There are still some downsides to UTF-8 though, though they're different from those with UTF-16 and UTF-32:
Combining characters, normalization, and special symbols.
Of course, the 'complicated long form' part of UTF-8 is not really as specific to it as it seems. One of the more interesting aspects of Unicode from an implementer's perspective is the concept of 'combining characters'. These are special code points that work similarly to how old typewriters could over-type on top of existing text (used for things like underlines and diacritical marks). The most common examples are diacritical marks, but others exist.
What makes this interesting is that many of the characters you can produce by combining combining characters with other characters have their own pre-combined forms assigned to other code points. This, in turn, results in a 1:N mapping of actual characters to code points, which makes parsing any Unicode text more complicated no matter what encoding system you're using. Many systems prefer one form or the other for these characters, and will 'normalize' any text they process to be in that form.
This gets even more interesting though when you consider that Unicode actually differentiates characters used as letters versus the same characters used as symbols, which in turn can lead to an even larger number of ways to represent a character.
An easy example of this is 'Å'. It's a character used in a number of Nordic languages, and also is used as the symbol for the unit of measurement known as an Ångstrom. In Unicode, it's got three representations: