geraldew

Posted on Mar 26, 2022 • Edited on Oct 9, 2023

Grrr Character Encoding

#unicode #text #programming #database

These are some personal notes about why character encoding is still a problem in 2022. Like most of my postings these are written to be something I can email a link to various people. In this case it is my own abbreviated distillation of facts mixed with opinions so it will include some generalisations. Nonetheless, commenters should feel free to nitpick and discuss.

Note: this will be given from an English language perspective, for two de facto reasons:

that's what most of the tech world uses;
the full multi-language situation was and remains complicated, and I wanted to write a simpler summary.

The main basic encodings

Here are the headings I'm going to write under:

ASCII
IBM PC Extended ASCII
So Called ANSI
Unicode the Idea
Unicode As Done By Microsoft
Unicode UTF-8

Those are the minimum that I think programmers and data scripters need to know about.

ASCII

Defined as a 7 bit encoding, therefore 128 codes, with about 96 "printable" ones and about 32 "control characters".

Originally many of the control characters had meaning (e.g. as serial transmission controls) but gradually these fell out of use. In practice by now, the only ones in consistent use are:

Tab
Line Feed (LF) aka New Line (NL)
Carriage Return (CR)

Even so, there is a schism between the Microsoft world and everyone else.

Microsoft uses a CRLF pair to mark line ends in text files;
Unix use a single NL to mark line ends in text files - ergo so do Linux, Android, MacOS and iOS et al.

While ASCII only defined a 7-bit encoding, when the Personal Computer revolution later occurred, the 8-bit Byte was the standard way of holding an ASCII character. This meant there was room for doing things with the extra 128 values a byte could store.

Links:

IBM PC Extended ASCII

As the IBM PC took over the market through the 1980s a character set supplied in it became commonly used to draw things that looked like borders for boxes and windows etc. A rudimentary style of GUI interface became common, especially as the full graphical display of the Macintosh became widely known.

This set also added enough extra Latin characters to cover most of the Western European languages as well as a few currency symbols (e..g. the British Pound and Japanese Yen).

Code page 437

So Called ANSI

As Windows grew in market share - essentially with the Windows 3 release - the set that it came with took over the mind share of what the second 128 characters looked like.

Because Windows didn't need the border block characters any more, places were freed to have other uses. The selection was one of Microsoft's choosing so it doesn't exactly match any other "standard" character encoding.

While it was always a misnomer, many of us came to refer to that set as "the ANSI set". See the link for its more precise name.

Windows-1252

Unicode the Idea

The very mixed up multi-language side of character encoding was a continual frustration and eventually some people got together to sort out a better approach.

A crucial thing to understand about Unicode is that it defines the meaning of each "code point" and then there are various ways of encoding those for actual storage and/or transmission.

That still leads to a lot of change issues and human politics that I'm not going to cover. Here we'll just take it as read that there is a single current definition of Unicode.

Unicode As Done By Microsoft

Which is to say "UCS-2" - because despite whatever you may read, for the most part Microsoft - i.e. Windows - does not use UTF-16.

In short

Microsoft were quick to adopt Unicode and went ahead with a 16-bit encoding. They deserve praise for that, as this probably helped Unicode gain wide usage.
Alas, when better encodings soon settled, they just ignored that and kept using their own.

Alas, no single link for this, so interpretation will require reading between the lines of:

UCS-2 - an obsolete Universal Character Set encoding
UTF-16 - which for example has this:
"Older Windows NT systems (prior to Windows 2000) only support UCS-2. Files and network data tend to be a mix of UTF-16, UTF-8, and legacy byte encodings."
Yet my observation is that most Windows applications including those by Microsoft still just use UCS-2.

Unicode UTF-8

By 1993 the concepts for UTF-8 were settled and its run as the most commonly used encoding of Unicode began.

To quote:

"UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98% of all web pages, and up to 100.0% for some languages, as of 2022."

This is a variable-byte encoding method, so that for most English texts only a byte is needed per character. Depending on its Unicode code point a character may be encoded as 1, 2, 3 or 4 bytes.

A bit less obvious is that not all byte sequences make for valid UTF-8 - there are sequences that simply should not happen - see "Invalid sequences and error handling" at the Wikipedia page. This has ramifications when UTF-8 is used as a declared part of some other standard - e.g. XML, which takes it as the default.

UTF-8

Byte Order Marking

As much as I'd like to skip mentioning this, it may come up at times. This is a feature of the UTF style encodings. Interestingly it is both sensible and yet problematic.

I'll defer to Wikipedia for the details, but for this article, here is the key quote:

"BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream."

I've mentioned it partly because the Tools that I provide later on, can show whether the byte order mark (BOM) was noticed and what it indicated.

Byte order mark

What Happened

Let's recap the sequence of events:

ASCII was defined and people used it
the IBM PC came out and programmers used its extended set to draw boxes and do simple GUI combinations
Windows came along with fully drawn graphics and the so called ANSI set was used to cover most of the Latin variation languages (i.e European)
Unicode came along and Microsoft implemented a 16-bit character encoding for it
as Unicode settled down, UTF-8 was defined and became the standard used by almost everyone
except Microsoft who stubbornly stuck to their own 16-bit encoding.

Not quite in that list is that even though Microsoft changed to a 16-bit encoding, most programs still had to deal with 8-bit encoded files. Whether a given tool would save/store using 8-bit or 16-bit could/can be hard to guess.

In 2022 it's still a mess

The example that has prompted me to write this piece is that I emailed an exported source code file from my Windows-based workplace to my home. When I went to open the file on my Linux system, the text editor complained that it couldn't make sense of the character encoding. To handle the file it would require me to specify the appropriate encoding for reading the file.

This begged the question of: did I know for sure what encoding the file really had? On Windows, the file just seemed like any other text file.

In truth I've been through this step many times, so this wasn't a surprise. However, this time I had planned to inspect the file, maybe edit it a little, then email it to someone else. I have no idea what computer platform that person uses.

Ergo, I had no idea what encoding they would want/need to receive as a text file.

Good Tools

In the spirit of this piece I'm going to ignore the many very technical options and just suggest two tools as good for a general user:

Notepad++ - for Windows
Notepadqq - for Linux

Both of these do a very good job of:

recognising the character set in a file as they open it
letting you tell which encoding the file should be interpreted as
letting you tell it to convert the encoding to something else

Those features will probably cover most situations.

Lacking Tools

From my experience, the feature still missing is:

an analyser or verifier of the encoding formats.

An example of this topic is this posted question and the type of answers it received:

How to check whether a file is valid UTF-8?

Note that the unanswered part of the question is claiming that the suggested solution still only "reports invalid UTF-8 files but doesn't report which lines/characters to fix." (emphasis mine).

Caveat: as I'm old school, when I hit problems of this type I usually break out a hex editor/view and look at the raw bytes at issue. But this is usually because I've already spotted where there is an issue to resolve. I don't have a go to tool for proving that an encoding is consistent and showing exactly where it is not.

False combinations

As noted above, most references to Microsoft's support of Unicode talk of it using "UTF-16" even though it doesn't. Microsoft are not the only case of this. At my workplace I've had to point out that Teradata also uses invalid terminology. In their documentation Teradata say both that they use UTF-16 but also that UTF-16 is a fixed-two-byte format (it's not, like UTF-8 it is a variable length encoding) so one of those two claims is false.

Another example seen by this writer is the blind usage of UCS-2 encodings embedded inside something for which they are invalid, such as UTF-8. I saw this in a mass of XML, which led the XML to fail schema validation - that being an important feature of XML.

Hence, something that competent programmers will have to deal with, is the outcomes caused by incompetent ones.

Browsers versus Native tools

This is really only an issue on Windows, as there the predominant usage is still UCS-2 for programs compiled for use on Windows.

However, as web browsers use the Internet and there the predominant usage is UTF-8 - that provides a point of discord.

If you highlight text in the browser and then paste it into a Windows application, what form of encoding gets copied? Is that determined by the browser (in placing content onto the clipboard) or by the receiving application (as it pastes from the clipboard)?

Ditto for saving from the browser to a file, if the content was UTF-8 does the browser save to that or to UCS-2. And if saving as UTF-8 does it set the byte order mark?

Arrays versus Lists

Yes finally after all the above, here is the sticking point for programmers:

do variable length encodings, such as UTF-8, affect the ability or efficiency when handling text strings?

As usual, the answer is: it depends.

Fundamentally this is about when they get treated as arrays versus as lists.

Now to be pedantic, here what we mean by an array is a form of internal storage by which the place in memory for an item can be simply calculated by the position of the array then adding the size of the element multiplied by its index in the array. Some of you, and some languages have been using the term "array" for other things. Please stop doing that, you're part of the problem not part of the solution, to anything.

How important this is, is not quite clear. Indeed it's easier to pose questions than it is to answer them.

A key question about usage is:

How often when writing code do we actually want to jump to the nth character?
How often do we want to step through all the characters in the string?
How often will the characters be outside the ASCII set?

A key question about implementations is:

What are the methods available?
How important is the amount of memory or storage required?
How important is predictability for the amount of memory or storage?

As an example, it would be possible to use an array of variable length encoded characters.

And so, having posed more questions than answers, I'm skipping to the next heading! No links for this as there's a bazillion web forum questions and even more attempts at answering them.

Etc

And of course, there are other matters such as:

character collation (e.g. Unicode collation algorithm)
uppercase/lowercase concepts
left-to-right versus right-to-left order
and ... well the list is endless really.

And again, the question of whether these are logical issues and/or are matters about the actual character encodings can be hard to pin down.

Python

Re all the above, some quick comments about the situation in Python.

Python 2 - yes that had problems, well documented, go see about that elsewhere
Python 3 - works with Unicode by default and has explicit features for bringing in from and exporting to various encodings.
in theory there is no need for a Python programmer to know how CPython actually handles text strings internally.

From which I think there are three takeaways:

handling Unicode strings in Python should "just work"
they are to be treated by the programmers as ordered lists of characters
if the speed or space worries you, it's time to get or write a library in some other language for Python to call upon.

p.s. yes, you can go read the current CPython source code to see how it currently does it. That may or may not help for use of Jython et al.

Post Script

This article has deliberately glossed over a lot of other details, for which you are encouraged to do further reading. Some example points:

the CRLF pair instead of NL was inherited by Microsoft from CP/M which was itself a world of "teletype" terminals including those that literally printed onto paper so the distinction between Carriage Return and Line Feed were actually useful. Indeed so was the FF (Form Feed) character - for jumping to the start of the next page/sheet.
there were and are many distinct approaches used in various countries around the world
ditto the support and use of "code pages"
even currency symbols have had complex usage patterns, e.g. the UK placement of their Pound Sterling symbol and later the introduction of the Euro with its symbol to be accommodated.
yes, because Unicode also covers the idea of national flags as symbols and as that is done as a pair of Unicode characters then in theory such a symbol could be 8 bytes long in UTF-8.

Post-post-post-post-script

An excellent article on UTF-8 can now be seen at:

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)

Oldest comments (4)

geraldew • Mar 26 '22

After writing the above I thought I should at least try some of the available tools on Linux. Here is how that went.

Using file

$ file String_General.bas
String_General.bas: ISO-8859 text, with CRLF line terminators

$ file -bi String_General.bas
text/plain; charset=iso-8859-1

Using encguess

$ encguess String_General.bas
String_General.bas  unknown

Using uchardet

$ uchardet String_General.bas
WINDOWS-1252

Using enca

$ enca -g -L none -d String_General.bas
String_General.bas: Unrecognized encoding
  Failure reason: Multibyte tests failed, language contains no 8bit charsets.

Note: neither 'uchardet' nor 'enca' were present on my Xubuntu setup, but both were easily added from the stock Ubuntu repositories.

Chardet

Among the things that some searching brought up was this one, written in Python. The documentation is worth a read even without trying the program.

Documentation: chardet
Source: Chardet: The Universal Character Encoding Detector

Detects:

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR, Johab (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-5, windows-1251 (Bulgarian)
ISO-8859-1, windows-1252 (Western European languages)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

Quote:

This is a continuation of Mark Pilgrim's excellent original chardet port from C, and Ian Cordasco's charade Python 3-compatible fork (No Longer Maintained)
Charade: The Universal character encoding detector

geraldew • Aug 17 '24

Ok, it's been a while since I wrote all that, but I've come back to add some notes after watching an excellent pair of videos by Dylan Beattie

From NUL to DEL: Why 7 Bit ASCII IS Actually Really Clever - which covers things I comment on below.
Code Pages and Kohuepts: The Chaos of 8 Bit Extended ASCII - which says some of things that I only alluded to in my original post above.

The video about 7-bit ASCII reminded me of my early experiences of the ASCII control codes. and frankly, there are plenty of resources online, so I'll merely recount some aspects from my memory of the early-middle "microcomputing" era.

XON/XOFF

We really did Ctrl-S and Ctrl-Q to pause and then continue program output on the screen (because it handled by emulating a terminal). These were also really used by two computers connected by a serial cable, for the receiver to signal when it was capable of receiving more content. This was for when a serial cable was a 3-wire cable - one for ground and one for transmitting bits in each direction. In this case the characters were referred to as XON and XOFF. When I could, I preferred to use a 5-wire serial cable, where the two extra wires did that stop and continue signalling instead. However, this variation of circumstance meant you often had a setting in software at each end to tell them whether or not to do XON/XOFF control.

Control-Zed

Ctrl-Z really was routinely used to indicate the end of a text file - in a way similar to what Dylan describes as being used in memory handling with C programming. This was especially so on CP/M because that system only knew how many blocks of storage a file was using, so the exact end of a text file needed the marker character (unless it coincided with the end of the last block. Even some time in the early 1990s I once had to diagnose a case of an Excel file refusing to fully load data from a text file. Sure enough it proved to have a rogue Ctrl-Z character inside, that I found with a hex dump.

Form Feed

With dot matrix and daisy wheel printers, we really did use FF to tell them to skip ahead to the next page of fan fold paper. This meant that these printers were actually counting the number of lines they'd printed so far, so as to know how many to scroll part to reach line 1 of the next page. And that also meant we had to line them with the head ready to print on line 1 before we ran a print. And, that also meant they needed to be told how many lines apart those pages were. For this reason, most (of not all) fan-fold paper supplies conformed to the American system rather than be the European A4 sizing. I could probably write a whole article just on how much complexity was available in those printers, as we are merely scratching the surface here.

Cursor Control Keys

The early screen based text editors, used the keys Ctrl-H, Ctrl-J, Ctrl-K, Ctrl-L for the "cursor" movements of left, down, up and right. Yes if you look at a QWERTY keyboard those are a run of four on a row. Two of those made some sense - Ctrl-H = Backspace for Left and Ctrl-J = Line Feed for down. I remember editing with WordMaster that way. And the makers of it, MicroPro used a different set of four for their next word editor - WordStar - that was so successful that for many years, caused the ESDX set of four control keys to become the default cursor keys that software would use. Notably the Turbo Pascal editor, and which is still echoed in the Linux editor "joe".

btw: yes, I'm aware that VI and VIM also use what I'm referring to as "the WordMaster keys" but I'm only quoting my personal heritage experience here (for otherwise, go read the Internet). I'll assume VI came first.

Keyboard Mappings

Finally, the other ASCII sequence to hardware coincidence to comment on,is that the Microbee computer (of which I owned several, and for whom I worked) used a keyboard layout based on the ADM-3A terminal. In both of these, the characters put as the SHIFT of the numeric keys, were a simple ASCII code displacement from each of the number keys - and is why there is nothing above the zero.

geraldew • Oct 6 '24

And yet another postscript, because I've since posted a related article which goes into some of the detail about how Python - and in particular the os module - handles Unicode. This gist of this was written at the same time as the article above, but wasn't in a publishable state.

See:

Character Encoding with the Python os module and Unicode