Python strings are not UTF-8

#python #unicode

Over the past ten years, UTF-8 has become the leading Unicode data representation. Virtually every modern website is served as UTF-8 and most modern programming languages tout first-class support for it.

For those who don't know what UTF-8 is about, it's a variable-width binary encoding format for Unicode code points. Many other write-ups have been done on it, which you should absolutely consult for more details. Right now, my focus is the variable-width aspect, which is very useful, but has two annoying consequences:

Code points cannot be randomly indexed in constant time.
The number of code points in a string cannot be determined in constant time.

If you've worked with Rust or Go, these limitations are familiar. Those languages make working with UTF-8 much less verbose and error-prone than say, C, but they do not hide these two annoyances. Instead they let you either work directly with individual bytes, or use syntax that explicitly iterates through the code points of a string to count or index them.

But what about Python? In Python, we regularly do things like:

>>> s = 'foobar'
>>> s[3]
'b'
>>> len(s)
6

It's believable that Python might be implementing both these operations as linear-time iteration over the code points of the string. Python hides stuff like this under the hood all the time. We can easily check this unscientifically by doing a benchmark of how long it takes to index into different positions of a large string:

>>> timeit.timeit('assert s[3] == "b"', 's = "foobar" * 100_000', number=1_000_000)
0.04597386400200776
>>> timeit.timeit('assert s[500001] == "b"', 's = "foobar" * 100_000', number=1_000_000)
0.04409515299994382

If Python were iterating over code points to do string indexing, we'd see a noticeable difference in runtime of these snippets, but we don't. So what's going on? The short answer is they aren't UTF-8.

Since Python 3.0, UTF-8 has been the assumed source code encoding, and if you construct a string from a bytes, the default encoding is UTF-8. So you could be forgiven for assuming, as I did, that Python's str type is UTF-8.

So what is it actually? I found out by looking through the source code for Python's Unicode implementation, but it's also documented in PEP 393. Basically, under the hood, modern versions of Python keep tabs on the contents of any str object. If every character in it is in the valid range for Latin-1 (8 bits), it uses that encoding for the in-memory representation, else if every character is within the valid range of a UTF-16 code point, it uses UTF-16, else it uses UTF-32.

This fixed-width encoding is not bad. If you're mostly working with homogeneous European text, it's just as space-efficient as UTF-8, but it gives you constant time indexing and length computation. There are two trade-offs of course:

Text with even a single character outside the Latin-1 range instantly doubles in memory usage, and doubles again with a single character outside UTF-16.
Modifying a string can potentially require reallocating the entire string to a new code point width.

Python strings are immutable, so I'm sure the devs felt the second concern wasn't important, since anyone trying to do something along those lines has already signed themselves up for memory allocations. In reality, though, I think most Python users implicitly consider mystring = mystring + 'foo' to be an inexpensive operation. This leads to our next benchmark:

>>> timeit.timeit('s + "foobar"', 's = "foobar" * 100_000', number=100_000)
2.1425487420019635
>>> timeit.timeit('s + "🏴󠁧󠁢󠁳󠁣󠁴󠁿"', 's = "foobar" * 100_000', number=100_000)
10.080080990999704

Starting with a 600,000 character Latin-1 string, it takes 5x longer to concatenate the Scottish flag emoji than to concatenate another 6 characters of Latin-1. To be fair, the Scottish flag emoji weighs in at a whopping 7 code points, but it's the fact that they are 32-bit code points that causes this, not that there are 7 of them.

>>> sys.getsizeof("foobar" * 100_000)
600049
>>> sys.getsizeof("foobar" * 100_000 + "🏴󠁧󠁢󠁳󠁣󠁴󠁿")
2400104

Adding that one emoji quadruples the size of the object in memory.

Every serious Python programmer is well aware that these kinds of performance traps exist, particularly when you do something the language spec did not perceive as a common use case. But characters requiring 1 or more 32-bit Unicode code points to represent them are becoming more and more common, and programming language support is not as great as I thought it would be.

Don't believe me? Then I challenge you to write me a python function that recognizes palindromes. That's basically grade school stuff, right? Here are my test cases:

>>> is_palindrome("🏴󠁧󠁢󠁳󠁣󠁴󠁿🤷🏾‍♂️🏴󠁧󠁢󠁳󠁣󠁴󠁿")
True
>>> is_palindrome("🇬🇧🇧🇬")
False
>>> is_palindrome("👩‍🍼🍼‍👩")
False

Leave a comment if you think you have a nifty solution to this.

There's a lot of mishandling and misinformation surrounding Unicode out there on the web, so if there's interest, I may write a few more posts like this clarifying some pitfalls.

Top comments (1)

deveringham • May 13 '20

🙋‍♂️ As a Python dev who regularly uses emojis everywhere for clarity (variable/function/parameter names, debug print statements, file names, data file headers, etc.), this is 💯. I hope this article makes these practices accessible to more devs! 😍👌