Summary: Set the
PYTHONUTF8=1 environment variable.
On macOS and Linux, UTF-8 is the standard encoding already.
But Windows still uses legacy encoding (e.g. cp1252, cp932, etc...) as system encoding.
Python works very well about file names and console IO (e.g. use ~W APIs). But the legacy system encoding is used for the default encoding of text files and pipes.
It is a very common mistake that omits the
encoding="utf-8" option. Developers who use macOS or Linux doesn't have any trouble by the mistake.
For example, even the packaging tutorial in the packaging.python.org uses this code snippet:
with open("README.md", "r") as fh: long_description = fh.read()
README.md contains some non-ASCII characters (e.g. Unicode emoji),
setup.py will fail on Windows. Windows users can not install the package when wheel is not provided. (I sent a pull request to fix this example code already.)
I believe many Python programmers are suffered by this default text file encoding on Windows because:
- The default encoding of Python source code is UTF-8
- UTF-8 is the standard encoding of the Web
- Modern text editors like VS Code use UTF-8 by default. And even the notepad.exe chose UTF-8 for the default encoding!
But it is difficult to change the default encoding of text files because it is backward incompatible change. It will break some legacy applications which relying on the legacy encoding.
But there is good news: Python 3.7 introduced the "UTF-8 mode". (thanks to Victor Stinner!!)
When UTF-8 mode is enabled, Python uses UTF-8 as default encoding for text files instead of system encoding.
You can live in the world "UTF-8 is the default. Other legacy encodings are used only when explicitly specified." like macOS and Linux.
To enable UTF-8 mode:
- Set the the environment variable
-Xutf8command line option.