These are some personal notes about why character encoding is still a problem in 2022. Like most of my postings these are written to be something I can email a link to various people. In this case it is my own abbreviated distillation of facts mixed with opinions so it will include some generalisations. Nonetheless, commenters should feel free to nitpick and discuss.
Note: this will be given from an English language perspective, for two de facto reasons:
- that's what most of the tech world uses;
- the full multi-language situation was and remains complicated, and I wanted to write a simpler summary.
The main basic encodings
Here are the headings I'm going to write under:
- IBM PC Extended ASCII
- So Called ANSI
- Unicode the Idea
- Unicode As Done By Microsoft
- Unicode UTF-8
Those are the minimum that I think programmers and data scripters need to know about.
Defined as a 7 bit encoding, therefore 128 codes, with about 96 "printable" ones and about 32 "control characters".
Originally many of the control characters had meaning (e.g. as serial transmission controls) but gradually these fell out of use. In practice by now, the only ones in consistent use are:
- Line Feed (LF) aka New Line (NL)
- Carriage Return (CR)
Even so, there is a schism between the Microsoft world and everyone else.
- Microsoft uses a CRLF pair to mark line ends in text files;
- Unix use a single NL to mark line ends in text files - ergo so do Linux, Android, MacOS and iOS et al.
While ASCII only defined a 7-bit encoding, when the Personal Computer revolution later occurred, the 8-bit Byte was the standard way of holding an ASCII character. This meant there was room for doing things with the extra 128 values a byte could store.
IBM PC Extended ASCII
As the IBM PC took over the market through the 1980s a character set supplied in it became commonly used to draw things that looked like borders for boxes and windows etc. A rudimentary style of GUI interface became common, especially as the full graphical display of the Macintosh became widely known.
This set also added enough extra Latin characters to cover most of the Western European languages as well as a few currency symbols (e..g. the British Pound and Japanese Yen).
So Called ANSI
As Windows grew in market share - essentially with the Windows 3 release - the set that it came with took over the mind share of what the second 128 characters looked like.
Because Windows didn't need the border block characters any more, places were freed to have other uses. The selection was one of Microsoft's choosing so it doesn't exactly match any other "standard" character encoding.
While it was always a misnomer, many of us came to refer to that set as "the ANSI set". See the link for its more precise name.
Unicode the Idea
The very mixed up multi-language side of character encoding was a continual frustration and eventually some people got together to sort out a better approach.
A crucial thing to understand about Unicode is that it defines the meaning of each "code point" and then there are various ways of encoding those for actual storage and/or transmission.
That still leads to a lot of change issues and human politics that I'm not going to cover. Here we'll just take it as read that there is a single current definition of Unicode.
Unicode As Done By Microsoft
Which is to say "UCS-2" - because despite whatever you may read, for the most part Microsoft - i.e. Windows - does not use UTF-16.
- Microsoft were quick to adopt Unicode and went ahead with a 16-bit encoding. They deserve praise for that, as this probably helped Unicode gain wide usage.
- Alas, when better encodings soon settled, they just ignored that and kept using their own.
Alas, no single link for this, so interpretation will require reading between the lines of:
- UCS-2 - an obsolete Universal Character Set encoding
- UTF-16 - which for example has this:
- "Older Windows NT systems (prior to Windows 2000) only support UCS-2. Files and network data tend to be a mix of UTF-16, UTF-8, and legacy byte encodings."
- Yet my observation is that most Windows applications including those by Microsoft still just use UCS-2.
By 1993 the concepts for UTF-8 were settled and its run as the most commonly used encoding of Unicode began.
- "UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98% of all web pages, and up to 100.0% for some languages, as of 2022."
This is a variable-byte encoding method, so that for most English texts only a byte is needed per character. Depending on its Unicode code point a character may be encoded as 1, 2, 3 or 4 bytes.
A bit less obvious is that not all byte sequences make for valid UTF-8 - there are sequences that simply should not happen - see "Invalid sequences and error handling" at the Wikipedia page. This has ramifications when UTF-8 is used as a declared part of some other standard - e.g. XML, which takes it as the default.
Byte Order Marking
As much as I'd like to skip mentioning this, it may come up at times. This is a feature of the UTF style encodings. Interestingly it is both sensible and yet problematic.
I'll defer to Wikipedia for the details, but for this article, here is the key quote:
- "BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream."
I've mentioned it partly because the Tools that I provide later on, can show whether the byte order mark (BOM) was noticed and what it indicated.
Let's recap the sequence of events:
- ASCII was defined and people used it
- the IBM PC came out and programmers used its extended set to draw boxes and do simple GUI combinations
- Windows came along with fully drawn graphics and the so called ANSI set was used to cover most of the Latin variation languages (i.e European)
- Unicode came along and Microsoft implemented a 16-bit character encoding for it
- as Unicode settled down, UTF-8 was defined and became the standard used by almost everyone
- except Microsoft who stubbornly stuck to their own 16-bit encoding.
Not quite in that list is that even though Microsoft changed to a 16-bit encoding, most programs still had to deal with 8-bit encoded files. Whether a given tool would save/store using 8-bit or 16-bit could/can be hard to guess.
In 2022 it's still a mess
The example that has prompted me to write this piece is that I emailed an exported source code file from my Windows-based workplace to my home. When I went to open the file on my Linux system, the text editor complained that it couldn't make sense of the character encoding. To handle the file it would require me to specify the appropriate encoding for reading the file.
This begged the question of: did I know for sure what encoding the file really had? On Windows, the file just seemed like any other text file.
In truth I've been through this step many times, so this wasn't a surprise. However, this time I had planned to inspect the file, maybe edit it a little, then email it to someone else. I have no idea what computer platform that person uses.
Ergo, I had no idea what encoding they would want/need to receive as a text file.
In the spirit of this piece I'm going to ignore the many very technical options and just suggest two tools as good for a general user:
- Notepad++ - for Windows
- Notepadqq - for Linux
Both of these do a very good job of:
- recognising the character set in a file as they open it
- letting you tell which encoding the file should be interpreted as
- letting you tell it to convert the encoding to something else
Those features will probably cover most situations.
From my experience, the feature still missing is:
- an analyser or verifier of the encoding formats.
An example of this topic is this posted question and the type of answers it received:
Note that the unanswered part of the question is claiming that the suggested solution still only "reports invalid UTF-8 files but doesn't report which lines/characters to fix." (emphasis mine).
Caveat: as I'm old school, when I hit problems of this type I usually break out a hex editor/view and look at the raw bytes at issue. But this is usually because I've already spotted where there is an issue to resolve. I don't have a go to tool for proving that an encoding is consistent and showing exactly where it is not.
As noted above, most references to Microsoft's support of Unicode talk of it using "UTF-16" even though it doesn't. Microsoft are not the only case of this. At my workplace I've had to point out that Teradata also uses invalid terminology. In their documentation Teradata say both that they use UTF-16 but also that UTF-16 is a fixed-two-byte format (it's not, like UTF-8 it is a variable length encoding) so one of those two claims is false.
Another example seen by this writer is the blind usage of UCS-2 encodings embedded inside something for which they are invalid, such as UTF-8. I saw this in a mass of XML, which led the XML to fail schema validation - that being an important feature of XML.
Hence, something that competent programmers will have to deal with, is the outcomes caused by incompetent ones.
Browsers versus Native tools
This is really only an issue on Windows, as there the predominant usage is still UCS-2 for programs compiled for use on Windows.
However, as web browsers use the Internet and there the predominant usage is UTF-8 - that provides a point of discord.
If you highlight text in the browser and then paste it into a Windows application, what form of encoding gets copied? Is that determined by the browser (in placing content onto the clipboard) or by the receiving application (as it pastes from the clipboard)?
Ditto for saving from the browser to a file, if the content was UTF-8 does the browser save to that or to UCS-2. And if saving as UTF-8 does it set the byte order mark?
Arrays versus Lists
Yes finally after all the above, here is the sticking point for programmers:
- do variable length encodings, such as UTF-8, affect the ability or efficiency when handling text strings?
As usual, the answer is: it depends.
Fundamentally this is about when they get treated as arrays versus as lists.
- Now to be pedantic, here what we mean by an array is a form of internal storage by which the place in memory for an item can be simply calculated by the position of the array then adding the size of the element multiplied by its index in the array. Some of you, and some languages have been using the term "array" for other things. Please stop doing that, you're part of the problem not part of the solution, to anything.
How important this is, is not quite clear. Indeed it's easier to pose questions than it is to answer them.
A key question about usage is:
- How often when writing code do we actually want to jump to the nth character?
- How often do we want to step through all the characters in the string?
- How often will the characters be outside the ASCII set?
A key question about implementations is:
- What are the methods available?
- How important is the amount of memory or storage required?
- How important is predictability for the amount of memory or storage?
As an example, it would be possible to use an array of variable length encoded characters.
And so, having posed more questions than answers, I'm skipping to the next heading! No links for this as there's a bazillion web forum questions and even more attempts at answering them.
And of course, there are other matters such as:
- character collation (e.g. Unicode collation algorithm)
- uppercase/lowercase concepts
- left-to-right versus right-to-left order
- and ... well the list is endless really.
And again, the question of whether these are logical issues and/or are matters about the actual character encodings can be hard to pin down.
Re all the above, some quick comments about the situation in Python.
- Python 2 - yes that had problems, well documented, go see about that elsewhere
- Python 3 - works with Unicode by default and has explicit features for bringing in from and exporting to various encodings.
- in theory there is no need for a Python programmer to know how CPython actually handles text strings internally.
From which I think there are three takeaways:
- handling Unicode strings in Python should "just work"
- they are to be treated by the programmers as ordered lists of characters
- if the speed or space worries you, it's time to get or write a library in some other language for Python to call upon.
p.s. yes, you can go read the current CPython source code to see how it currently does it. That may or may not help for use of Jython et al.
This article has deliberately glossed over a lot of other details, for which you are encouraged to do further reading. Some example points:
- the CRLF pair instead of NL was inherited by Microsoft from CP/M which was itself a world of "teletype" terminals including those that literally printed onto paper so the distinction between Carriage Return and Line Feed were actually useful. Indeed so was the FF (Form Feed) character - for jumping to the start of the next page/sheet.
- there were and are many distinct approaches used in various countries around the world
- ditto the support and use of "code pages"
- even currency symbols have had complex usage patterns, e.g. the UK placement of their Pound Sterling symbol and later the introduction of the Euro with its symbol to be accommodated.
- yes, because Unicode also covers the idea of national flags as symbols and as that is done as a pair of Unicode characters then in theory such a symbol could be 8 bytes long in UTF-8.
Top comments (2)
Nice post! One thing I had to learn the hard way: some languages do not use a single codepoint to display characters, but glyphs - multiple codepoints representing parts of a symbol that are combined if codepoints are arranged in the right sequence.
After writing the above I thought I should at least try some of the available tools on Linux. Here is how that went.
Note: neither 'uchardet' nor 'enca' were present on my Xubuntu setup, but both were easily added from the stock Ubuntu repositories.
Among the things that some searching brought up was this one, written in Python. The documentation is worth a read even without trying the program.