DEV Community

curioustore
curioustore

Posted on • Originally published at var.gg

Python 3.15's UTF-8 Default (PEP 686): I Verified the Silent Byte Changes on Windows

I ran the same Python script on two versions. Both print 문서 to the screen, identically. But the moment I redirected that output to a file with > out.bin, the bytes left on disk diverged. One side wrote 4 bytes, the other 6.

python 3.14.6:  b9 ae bc ad
python 3.15.0b2: eb ac b8 ec 84 9c
Enter fullscreen mode Exit fullscreen mode

Not a single line of code changed. I only bumped the interpreter version, yet the same characters 문서 get written to disk as different bytes. This is the defining scene of the change coming in Python 3.15, which ships its final release on October 1, 2026. The contract isn't the characters you see on screen — it's the bytes that actually get stored.

This article is a hands-on record of touching that change directly. On a Korean-language Windows environment, I pulled down two real builds, Python 3.14.6 and 3.15.0b2, and ran the same code A/B to verify — byte by byte — what changes and what doesn't, and where the silent breakage that's more dangerous than an outright error shows up.

[!NOTE]
Every measurement in this article was obtained by the author personally running two of python.org's embedded (portable) builds — python-3.14.6-embed-amd64 and python-3.15.0b2-embed-amd64 — under a Windows 11 Korean locale. 3.15.0b2 is the second beta, released on June 2, 2026; the feature is fully implemented, but this is not yet a production-recommended version.

Before we begin: encoding as a "translation dictionary"

Before the main discussion, let me plainly define a few terms you'll need to follow this article all the way through. Readers who already know them can skip ahead.

Computers don't understand characters. What gets stored on disk is, ultimately, a sequence of numbers (bytes). So we need an agreed-upon rule for which character gets written as which bytes, and that rule is called an encoding. Think of it as a translation dictionary that converts back and forth between characters and bytes.

  • UTF-8 — the de facto standard encoding of today's internet and modern text files, representing every character in the world in 1–4 bytes. A single Hangul character is usually 3 bytes.
  • CP949 — the Korean code page (the Windows-style character table) Microsoft uses. A single Hangul character is 2 bytes. Old Windows Korean-language programs, .txt files saved from Notepad, Excel CSVs, and the like commonly use this encoding. (It has the aliases ms949 and uhc, and while it overlaps heavily with its cousin euc-kr, the two are not the same standard.)
  • Locale — the operating system's bundle of regional settings (language, country, date format, and so on). Traditionally, the locale also decided the default for "what to use when no encoding is specified." The default code page on Korean Windows is usually 949, i.e. CP949.
  • UTF-8 mode — a runtime switch decided when Python starts. When it's on, Python uses UTF-8 instead of the locale for several defaults. PEP 686 is exactly what flips this switch's default.

If I had to state the crux in one sentence, it's this.

Python 3.15 does not convert your existing CP949 files to UTF-8. It only swaps the default translation dictionary that Python picks on your behalf when you don't specify an encoding — from "regional convention (CP949)" to "UTF-8."

When a package has no label describing its contents, the recipient has to guess what's inside by convention. Code that omits encoding= is like a box with that label torn off. PEP 686 is a policy that changes the default convention to "if an unlabeled box arrives, assume UTF-8." The data inside the box itself doesn't change.

Exactly what changes — the defaults I measured directly

First I launched both interpreters with no options and printed out the defaults. This is an environment where the Korean ACP (ANSI code page) is 949.

Item Python 3.14.6 Python 3.15.0b2
sys.flags.utf8_mode 0 1 ← the master switch flips
open(path) default encoding cp949 utf-8 ← the core change
sys.stdout.encoding (file/pipe) cp949 utf-8 ← changed
locale.getpreferredencoding(False) cp949 utf-8 ← changed
locale.getencoding() cp949 cp949 unchanged
sys.getfilesystemencoding() utf-8 utf-8 already UTF-8
encoding="locale" explicit cp949 cp949 unchanged
encoding="utf-8" / "cp949" explicit as-is as-is unchanged

Let me start with the point that's most often misunderstood. The filesystem encoding (how file names and paths are handed to the OS) is already UTF-8 in both versions. On Windows it has been ever since PEP 529 in Python 3.6. In other words, the ability to find a file name like C:\자료\한글.txt does not change this time. What changes is the default dictionary used when moving the contents of that file between bytes and strings.

These are two completely separate questions.

  • Can you locate a file named C:\자료\한글.txt? → path encoding (unchanged)
  • Do you read that file's contents as CP949 or as UTF-8? → text I/O encoding (changed this time)

Same program, different bytes

open(p, "w").write("한글 데이터: 매출 1234원") — I ran this write, with no encoding specified, on both versions and compared the bytes left on disk.

3.14.6 (cp949):  24 bytes  c7 d1 b1 db 20 ... bf f8
3.15.0b2 (utf-8): 32 bytes  ed 95 9c ea b8 80 20 ... ec 9b 90
Enter fullscreen mode Exit fullscreen mode

The write succeeds without error on both sides. I only bumped the interpreter, yet the output file's bytes silently change. The breakage surfaces not here, but later, in whoever else reads that file — an Excel that expects CP949, an older Notepad, the neighboring team's legacy script.

Distinguishing the three layers

Why does the "screen is fine but only the file changed" situation arise? Because Windows text I/O is actually split across three different boundaries.

📊 The diagram for this section renders in the original article on var.gg →

A directly attached modern Windows console (Windows Terminal and the like) has used the Unicode API since PEP 528, so Hangul already prints fine even on 3.14. So if you see print("한글") come out cleanly on screen and conclude "the default encoding must be the same," you'd be wrong. When you speak to the screen, a translator was already standing by; what changed this time is the envelope spec used when you mail a letter to a file or a pipe. The 문서 redirect experiment at the very top of this article — where the screen (or pipe) and the disk bytes diverged — happened for exactly this reason.

Why now — nine years in the making

PEP 686 isn't a brand-new feature that popped up out of nowhere. It's a carefully pre-announced compatibility change that flips only the default of a UTF-8 mode that has been available for years already.

Point in time Change Role
Python 3.6 PEP 528·529 Modernized the Windows console and filesystem paths around UTF-8 (already complete)
Python 3.7 PEP 540 Introduced the switch called UTF-8 mode — but default OFF, activated manually with -X utf8
Python 3.10 PEP 597 Introduced EncodingWarning and encoding="locale" — tools to find where implicit encoding is used
Python 3.11 preparatory work Corrected behavior so that even in UTF-8 mode, encoding="locale" uses the real locale
Python 3.15 PEP 686 Flips the default of that switch to ON

The interesting part is the timeline itself. PEP 686 was first discussed targeting Python 3.13, but the Steering Council asked for 3.15 to allow a longer preparation period, and it was approved on that condition in June 2022. It was deliberately brought in slowly, as a change whose compatibility ripple effects were well understood.

Whether it actually landed in 3.15.0b2 cross-checks three ways. (1) PEP 686's status is Final and its target is 3.15. (2) The 3.15.0b2 release notes call it out as a major change. (3) The "What's New in Python 3.15" document states that I/O which omits the encoding — like open() — uses UTF-8. And above all, the measurements in the table above are direct evidence.

"Silent breakage" — more dangerous than an error

This is where the real trap of this change begins. It's easy to think, "So if I read a CP949 file with 3.15, it'll throw an error, right?" That's only half right. And the case where it does error is actually the lucky one.

Common Hangul fails loudly (the lucky case)

Reading 한글 saved as CP949 with 3.15's default open() (UTF-8):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0
Enter fullscreen mode Exit fullscreen mode

Conversely, reading 한글 saved as UTF-8 with 3.14's default open() (CP949) also throws UnicodeDecodeError. The Hangul byte patterns of CP949 and UTF-8 are, for the most part, mutually "grammatically impossible combinations," so reading them the wrong way blows up immediately with an exception. It's loud. Loud is good. It lands in the logs, gets caught by try/except, and is discovered before deployment.

But some bytes silently become different characters

The problem is the range that is CP949 bytes and simultaneously valid UTF-8. Such bytes throw no error read either way. They just become a different character.

I counted by brute force, and among 2-byte sequences there were 1027 cases that decode "without error in either codec, yet differently from each other." I reproduced two representative examples by actually reading them from files.

Disk bytes 3.14.6 default (CP949) read 3.15.0b2 default (UTF-8) read Error?
c2 af '짱' '¯' (macron symbol) No error in either
eb ac b8 ec 84 9c '臾몄꽌' '문서' No error in either

Same file, same single line of open(p).read(). Only the interpreter differs, yet the resulting string is completely different — and there's no exception anywhere. No log, no traceback. The wrong data simply loads into memory as-is and gets re-saved or re-transmitted as-is. This is why it's more dangerous than UnicodeDecodeError: an error can be caught, but silent mojibake — a state where the bytes were successfully interpreted but the characters are garbled — gives you not even a clue to catch.

[!WARNING]
The default error handling for open() is strict in both versions. But what strict checks is only "is this byte valid under this encoding's grammar," not "are these the right characters that were originally intended." The c2 af¯ above is perfectly valid UTF-8 grammatically, so strict lets it pass too.

Where it's costlier and surfaces later: writes and appends

The silent breakage on reads is scary, but the costlier accident in a migration actually happens with writes and appends — because the damage surfaces much later.

Suppose 3.15 appends one line, with its default (UTF-8), to a log file that had already been accumulating as CP949. Here's what I reproduced directly.

# 기존 CP949 로그에 3.15 기본 open(..., "a")로 append
open("mixed.log", "wb").write("기존=한글\n".encode("cp949"))
with open("mixed.log", "a") as f:   # 3.15 기본 = UTF-8
    f.write("추가=문서\n")
Enter fullscreen mode Exit fullscreen mode

Append doesn't inspect the existing file contents. It just tacks UTF-8 bytes onto the end. The result is a mixed-encoding file that's CP949 in the front and UTF-8 in the back.

raw: b1 e2 c1 b8 3d c7 d1 b1 db 0a | ec b6 94 ea b0 80 3d eb ac b8 ec 84 9c 0d 0a
     └──────── CP949 ───────────┘   └──────────── UTF-8 ─────────────────┘
Enter fullscreen mode Exit fullscreen mode

If you then try to read this file in its entirety:

cp949 full decode → UnicodeDecodeError
utf-8 full decode → UnicodeDecodeError
Enter fullscreen mode Exit fullscreen mode

It has become a file that can't be read by any single encoding. Yet at the moment of appending there was no error at all. The accident surfaces only weeks later, in log analysis, a backup restore, or a downstream ETL. By then both the original and the exact transition point are hazy.

There's one more, similar trap. CP949 can't represent every Unicode character (emoji, certain symbols, and so on). So on 3.14, f.write("상태🙂") could fail early with a UnicodeEncodeError. 3.15, being UTF-8, writes this quietly and successfully. On the surface it's an improvement, but if the consumer is still CP949, the old early warning has vanished and the compatibility problem has simply been pushed downstream.

To summarize the risk types:

Type Example Does strict block it? Risk
Wrong read, byte grammar also mismatches CP949 한글 as UTF-8 Blocks (exception) Found immediately — lucky
Wrong read, valid under both grammars CP949 as UTF-8 Doesn't block Silent mojibake
New file written in a different encoding 3.15 generates a UTF-8 log Doesn't block Found at the consumption stage
UTF-8 append to a CP949 file Mixed-encoding file Doesn't block Delayed failure, harder to recover
Re-saving a wrongly-read string Permanently storing mojibake Doesn't block Hard to recover without the original

subprocess: when you capture the tool next door

Silent change isn't confined to files. It's also in subprocess(text=True), which calls an external program and receives its output — because a fair number of Korean Windows console tools still emit CP949 bytes.

I created a child process that outputs CP949 and had the parent capture its output with text=True.

  • 3.14.6 parent: normal capture (rc 0).
  • 3.15.0b2 parent: because the parent's pipe decoder is UTF-8, the moment it meets a CP949 byte the reader thread crashes with UnicodeDecodeError.

The subprocess API itself wasn't redesigned in 3.15. The default codec of the TextIOWrapper the parent wraps around the pipe follows sys.flags.utf8_mode, and this is a chain reaction caused by that default flipping 0 → 1. Looking at the CPython source, the logic is identical across both versions.

# Lib/subprocess.py (3.14.6 == 3.15.0b2)
if sys.flags.utf8_mode:
    return "utf-8"
return locale.getencoding()
Enter fullscreen mode Exit fullscreen mode

[!WARNING]
A subtle trap: if you bump both the parent and the child to 3.15, both switch to UTF-8 together and the test passes. But that success does not prove compatibility with actual CP949 legacy tools. So when verifying, you must make the child emit fixed encoding bytes.

The safety net is "explicit," not "automatic"

So how do you prepare? The core conclusion first: encoding= is not an optional argument but a protocol declaration. In every place where bytes cross a process boundary — files, pipes, external tools — declaring the actual data contract is the only version- and OS-independent safety net.

The precise meaning of encoding="locale"

Here's a common misunderstanding. "To keep the old behavior, can't I just use encoding="locale"?" Only partly right. encoding="locale" (PEP 597, 3.10+) means "use this machine's locale encoding right now," not "use CP949." I confirmed directly that on 3.15 this expression still behaves as intended and read CP949 files correctly on a Korean machine. But run the same code on Japanese or English Windows, or on a UTF-8 ACP environment, and it changes again.

The right expression differs per data contract.

Data contract Recommended expression
This format is UTF-8 everywhere encoding="utf-8"
This legacy format is CP949 everywhere encoding="cp949"
The file is designed to follow the current user locale encoding="locale"
External data of unknown encoding Don't auto-guess — check the metadata/protocol, or receive it as binary and validate

If you're dealing with a fixed CP949 format, encoding="cp949" is a stronger contract than "locale".

Toggles and precedence

The default changed, but the option didn't disappear. Here's the precedence I confirmed directly.

-X utf8 / -X utf8=0   (command line, highest priority)
   >  PYTHONUTF8=1 / =0   (environment variable)
   >  version default  (3.14 Windows = 0, 3.15 = 1)
Enter fullscreen mode Exit fullscreen mode
  • On 3.15, PYTHONUTF8=0 or -X utf8=0 → immediately rolls back to the old CP949 behavior (reconfirmed utf8_mode 0, getpreferredencoding cp949).
  • On 3.14, PYTHONUTF8=1 or -X utf8 → rehearse the 3.15 behavior in advance. Without a separate 3.15 install, you can pre-check whether your current codebase breaks under UTF-8 mode.
  • PYTHONIOENCODING applies with priority only to the standard streams; it cannot change the defaults of ordinary open() or subprocess pipes. Don't overestimate its scope.

Migration priorities

This is the order I'd recommend, based on the behavior I verified directly.

  1. For each file and pipe, confirm what encoding the actual producer and consumer use. (First, and most often skipped.)
  2. Pin the formats we newly create with encoding="utf-8".
  3. Pin actual CP949 contracts with encoding="cp949", then draw up a separate migration plan.
  4. Use encoding="locale" only when following the user locale is a genuine requirement.
  5. Audit where implicit encoding is used with -X warn_default_encoding (or -W error::EncodingWarning).
  6. A system-wide PYTHONUTF8=0 is not a root fix — keep it only as a last resort for dodging an outage.

Configuration files, logs, caches, CSV exports, and external CLI calls in particular are more easily missed in audits than the main code. And if a third-party library internally calls open(config) or subprocess(text=True) without an encoding, you're affected even if your own code is clean. Here too, the top priority is finding the actual execution path with EncodingWarning.

Closing — the direction is right, but the lesson is the opposite

The direction of PEP 686 itself is sound. UTF-8 is already the de facto standard of the internet and modern text, and it narrows the long-standing asymmetry where an open("README.md") that ran fine on Linux broke only on Windows. For a cross-platform project starting fresh, it's clearly less painful.

But the compatibility cost is much wider than "reading a CP949 file throws an error." Compressing the conclusions I reached firsthand:

  • Some CP949 bytes are also valid UTF-8 and silently become different characters (¯).
  • The bytes of new logs and export files change to UTF-8 without error.
  • Appending to an existing CP949 file produces a mixed file that's readable by no single codec.
  • If both parent and child move up to 3.15 together, subprocess tests yield a false positive.
  • Because the console and filesystem are already UTF-8, a simple on-screen test misses the change.

So the practical lesson of this change is not "you can now omit encoding=." Quite the opposite.

Python 3.15 made the implicit default more reasonable. But for long-term archival files, logs, pipes, and external-program protocols, explicit encoding is the only real safety net.

Characters showing up fine on screen and the correct bytes being written to disk are different problems. 3.15 reminds us — quietly, but unmistakably — that the two are not the same.


This article follows the same methodology — firsthand verification that pulls down two versions directly and contrasts their behavioral differences in bytes and measurements — and is a sibling to the direct TypeScript native compiler (tsgo) benchmark and the PostgreSQL 19 deep-dive verification.

References (official docs)

Top comments (0)