Jonathan Bowman

Posted on Oct 4, 2020 • Edited on Feb 9 • Originally published at bowmanjd.com

Character Encodings and Detection with Python, chardet, and cchardet

#python

If your name is José, you are in good company. José is a very common name. Yet, when dealing with text files, sometimes José will appear as JosÃ©, or other mangled array of symbols and letters. Or, in some cases, Python will fail to convert the file to text at all, complaining with a UnicodeDecodeError.

Unless only dealing with numerical data, any data jockey or software developer needs to face the problem of encoding and decoding characters.

Why encodings?

Ever heard or asked the question, "why do we need character encodings?" Indeed, character encodings cause heaps of confusion for software developer and end user alike.

But ponder for a moment, and we all have to admit that the "do we need character encoding?" question is nonsensical. If you are dealing with text and computers, then there has to be encoding. The letter "a", for instance, must be recorded and processed like everything else: as a byte (or multiple bytes). Most likely (but not necessarily), your text editor or terminal will encode "a" as the number 97. Without the encoding, you aren't dealing with text and strings. Just bytes.

Encoding and decoding

Think of character encoding like a top secret substitution cipher, in which every letter has a corresponding number when encoded. No one will ever figure it out!


a: 61	g: 67	m: 6d	s: 73	y: 79
b: 62	h: 68	n: 6e	t: 74	z: 7a
c: 63	i: 69	o: 6f	u: 75
d: 64	j: 6a	p: 70	v: 76
e: 65	k: 6b	q: 71	w: 77
f: 66	l: 6c	r: 72	x: 78

Let's do the encoding with a table like the above and write everything as numbers:

print("\x73\x70\x61\x6d")

The above 4 character codes are hexadecimal: 73, 70, 61, 6d (the escape code \x is Python's way of designating a hexadecimal literal character code). In decimal, that's 115, 112, 97, and 109. Try the above print statement in a Python console or script and you should see our beloved "spam". It was automatically decoded in the Python console, printing the corresponding letters (characters).

But let's be more explicit, creating a byte string of the above numbers, and specifying the ASCII encoding:

b"\x73\x70\x61\x6d".decode("ascii")

Again, "spam". A canned response, if I ever heard one.

We are encoding and decoding! There you have it.

The complex and beautiful world beyond ASCII

What happens, however, with our dear friend José? In other words, what is the number corresponding to the letter "é"? Depends on the encoding. Let's try number 233 (hexadecimal e9), as somebody told us that might work:

b"\x4a\x6f\x73\xe9".decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)

That didn't go over well. The error complains that 233 is not in the 0-127 range that ASCII uses.

No problem. We heard of this thing called Unicode, specifically UTF-8. One encoding to rule them all! We can just use that:

b"\x4a\x6f\x73\xe9".decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: unexpected end of data

Still, no dice! After much experimentation, we find the ISO-8859-1 encoding. This is a Latin (i.e. European-derived) character set, but it works in this case, as the letters in "José" are all Latin.

b"\x4a\x6f\x73\xe9".decode("iso-8859-1")

'José'

So nice to have our friend back in one piece.

ISO-8859-1 works if all you speak is Latin.

That is not José. It is a picture of another friend, who speaks Latin.

UTF-8 is our friend

Once upon a time, everyone spoke "American" and character encoding was a simple translation of 127 characters to codes and back again (the ASCII character encoding, a subset of which is demonstrated above). The problem is, of course, that if this situation ever did exist, it was the result of a then U.S. dominated computer industry, or simple short-sightedness, to put it kindly (ethnocentrist and complacent may be more descriptive and accurate, if less gracious). Reality is much more complex. And, thankfully, the world is full of a wide range of people and languages.

Good thing that Unicode has happened, and there are character encodings that can represent a wide range of the characters used around the world. You can see non-Ascii names such as "Miloš" and "María", as well as 张伟. One of these encodings, UTF-8, is common. It is used on this web page, and is the default encoding since Python version 3.

With UTF-8, a character may be encoded as a 1, 2, 3, or 4-byte number. This covers a wealth of characters, including ♲, 水, Ж, and even 😀. UTF-8, being variable width, is even backwards compatible with ASCII. In other words, "a" is still encoded to a one-byte number 97.

Character encoding detection

While ubiquitous, UTF-8 is not the only character encoding. As José so clearly discovered above.

For instance, dear Microsoft Excel often saves CSV files in a Latin encoding (unless you have a newer version and explicitly select UTF-8 CSV).

How do we know what to use?

The easiest way is to have someone decide, and communicate clearly. If you are the one doing the encoding, select an appropriate version of Unicode, UTF-8 if you can. Then always decode with UTF-8. This is usually the default in Python since version 3. If you are saving a CSV file from Microsoft Excel, know that the "CSV UTF-8" format uses the character encoding "utf-8-sig" (a beginning-of-message, or BOM, character is used to designate UTF-8 at the start of the file). If using the more traditional and painful Microsoft Excel CSV format, the character encoding is likely "cp1252" which is a Latin encoding.

Don't know? Ask.

But what happens if the answer is "I don't know"? Or, more commonly, "we don't use character encoding" (🤦). Or even "probably Unicode?"

These all should be interpreted as "I don't know."

chardet, the popular Python character detection library

If you do not know what the character encoding is for a file you need to handle in Python, then try chardet.

pip install chardet

Use something like the above to install it in your Python virtual environment.

Character detection with chardet works something like this:

import chardet
name = b"\x4a\x6f\x73\xe9"
detection = chardet.detect(name)
print(detection)
encoding = detection["encoding"]
print(name.decode(encoding))

That may have worked for you, especially if the name variable contains a lot of text with many non-ASCII characters. In this case, it works on my machine with just "José" but it cannot be very confident, and chardet might get it wrong in other similar situations. Summary: give it plenty of data, if you can. Even b'Jos\xe9 Gonz\xe1lez' will result in more accuracy.

Did you see in response to print(detection), that there is a confidence level? That can be helpful.

Two ways to use character detection

There are two ways I might use the chardet library.

First, I could use chardet.detect() in a one-off fashion on a text file, to determine the first time what the character encoding will be on subsequent engagements. Let's say there is a source system that always exports a CSV file with the same character encoding. When I contact the ever-helpful support line, they kindly inform me that they have no clue what character encoding even is, so I know I am left to my own devices. Good thing the device I have is chardet. I use it on a large source file, and determine that the encoding is cp1252 (no big surprise) and then I write my code to always with open("filename.csv", encoding="cp1252") as filehandle: and go on my merry way. I don't need character detection anymore.

The second scenario is more complex. What if I am creating a tool to handle arbitrary text files, and I will never know in advance what the character encoding is? In these cases, I will always want to import chardet and then use chardet.detect(). I may want to throw an error or warning, though, if the confidence level is below a certain threshold. If confident, I will use the suggested encoding when opening and reading the file.

cchardet, the crazy-fast Python character detection library

In the second scenario above, I may appreciate a performance boost, especially if it is an operation that is repeated frequently.

Enter cchardet, a faster chardet. It is a drop-in replacement.

Install it with something like:

pip install cchardet

Import it thusly, for compatibility with chardet:

import cchardet as chardet

A simple command line tool

Here is a full example using cchardet, with the ability to read a filename from the command line:

"""A tool for reading text files with an unknown encoding."""

from pathlib import Path
import sys

import cchardet as chardet


def read_confidently(filename):
    """Detect encoding and return decoded text, encoding, and confidence level."""
    filepath = Path(filename)

    # We must read as binary (bytes) because we don't yet know encoding
    blob = filepath.read_bytes()

    detection = chardet.detect(blob)
    encoding = detection["encoding"]
    confidence = detection["confidence"]
    text = blob.decode(encoding)

    return text, encoding, confidence


def main():
    """Command runner."""
    filename = sys.argv[1]  # assume first command line argument is filename
    text, encoding, confidence = read_confidently(filename)
    print(text)
    print(f"Encoding was detected as {encoding}.")
    if confidence < 0.6:
        print(f"Warning: confidence was only {confidence}!")
        print("Please double-check output for accuracy.")


if __name__ == "__main__":
    main()

You can also download this code from Github here.

Place the above in an appropriate directory, along with a text file. Then, from the terminal, in that directory, something like the following (use python instead of python3 if necessary) should work:

python3 somefile.csv

Do you see output and detected encoding?

I welcome comments below. Feel free to suggest additional use cases, problems you encounter, or affirmation of the cute pig picture above.

You are welcome to view and test the code along with some text file samples at the associated Github repo. Some variation of the following should get you up and running:

git clone https://github.com/bowmanjd/python-chardet-example.git
cd python-chardet-example/
python3 -m venv .venv
. .venv/bin/activate
pip install cchardet
python detect.py sample-latin1.csv

Enjoy the characters.

DEV Community