Darth Espressius

Posted on Jan 20, 2022

3 Ways to Handle non UTF-8 Characters in Pandas

#pandas #datascience #python #linux

So we've all gotten that error, you download a CSV from the web or get emailed it from your manager, who wants analysis done ASAP, and you find a card in your Kanban labelled URGENT AFF,so you open up VSCode, import Pandas and then type the following: pd.read_csv('some_important_file.csv').

Now, instead of the actual import happening, you get the following, near un-interpretable stacktrace:

What does that even mean?! And what the heck is utf-8. As a brief primer/crash course, your computer (like all computers), stores everything as bits (or series of ones and zeros). Now, in order to represent human-readable things (think letters) from ones and zeros, the Internet Assigned Numbers Authority came together and came up with the ASCII mappings. These basically map bytes (binary bits) to codes (in base-10, so numbers) which represent various characters. For example, 00111111 is the binary for 063 which is the code for ?.

These letters then come together to form words which form sentences. The number of unique characters that ASCII can handle is limited by the number of unique bytes (combinations of 1 and 0) available. However, to summarize: using 8 bits allows for 256 unique characters which is NO where close in handling every single character from every single language. This is where Unicode comes in; unicode assigns a "code points" in hexadecimal to each character. For example U+1F602 maps to 😂. This way, there are potentially millions of combinations, and is far broader than the original ASCII.

UTF-8

UTF-8 translates Unicode characters to a unique binary string, and vice versa. However, UTF-8, as its name suggests, uses an 8-bit word (similar to ASCII), to save memory. This is similar to a technique known as Huffman Coding which represents the most-used characters or tokens as the shortest words. This is intuitive in the sense that, we can afford to assign tokens used the least to larger bytes, as they are less likely to be sent together. If every character would be sent in 4 bytes instead, every text file you have would take up four times the space.

Caveat

However, this also means that the number of characters encoded by specifically UTF-8, is limited (just like ASCII). There are other UTFs (such as 16), however, this raises a key limitation, especially in the field of data science: sometimes we either don't need the non-UTF characters or can't process them, or we need to save on space. Therefore, here are three ways I handle non-UTF-8 characters for reading into a Pandas dataframe:

Find the correct Encoding Using Python

Pandas, by default, assumes utf-8 encoding every time you do pandas.read_csv, and it can feel like staring into a crystal ball trying to figure out the correct encoding. Your first bet is to use vanilla Python:



with open('file_name.csv') as f:
    print(f)

Most of the time, the output resembles the following:



<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf16'>


```.
If that fails, we can move onto the second option

### Find Using Python Chardet
[chardet](https://github.com/chardet/chardet) is a library for decoding characters, once installed you can use the following to determine encoding:
```python


import chardet
with open('file_name.csv') as f:
    chardet.detect(f)

The output should resemble the following:



{'encoding': 'EUC-JP', 'confidence': 0.99}

Finally

The last option is using the Linux CLI (fine, I lied when I said three methods using Pandas)



iconv -f utf-8 -t utf-8 -c filepath -o CLEAN_FILE

The first utf-8 after f defined what we think the original file format is
t is the target file format we wish to convert to (in this case utf-8)
c skips ivalid sequences
o outputs the fixed file to an actual filepath (instead of the terminal)

Now that you have your encoding, you can go on to read your CSV file successfully by specifying it in your read_csv command such as here:



pd.read_csv("some_csv.txt", encoding="not utf-8")

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

DEV Community

3 Ways to Handle non UTF-8 Characters in Pandas

UTF-8

Caveat

Find the correct Encoding Using Python

Finally

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Top comments (0)

A Workflow Copilot. Tailored to You.

Read next

The Evolution of Data Analytics Roadmaps: Preparing for Industry Demands in 2025

How Artificial Intelligence and Data Science Work Together to Solve Complex Problems in 2025

hey-py: A Feature-Rich CLI for DuckDuckGo's AI Chat

Large file transfer from VPS to local machine

Okay