Felipe Gasper

Posted on Jan 29, 2021 • Edited on Jan 24, 2022

Perl, Unicode, and Bytes

Felipe Gasper · 2021-01-29T00:29:50Z

Wide character in print at Foo/Bar.pm line 27. Enter fullscreen mode Exit fullscreen mode We’ve all been here: that maddening “wide character” warning. Why does it happen? How can we fix it? How can we prevent it in the future? Let’s take a look. Lots of early Perl adopters were C programmers. C strings are arrays of bytes, which allow code points up to 255, and that’s it. Perl used that model for many years. Along came Unicode, and with it a need for Perl to store code points that exceed 255 (i.e., “wide characters”). The solution—which Perl retains today—was to give Perl a 2nd way of storing a string: in addition to C-style “byte strings”, Perl can store strings in an internal, Unicode-compatible encoding. Thus, a Perl string can now natively store any Unicode code point. Of course, programs don’t generally receive “wide characters” as inputs. They receive bytes , then decode those bytes into “characters”. Then they encode the characters back into bytes for output. In general, then, each program: … receives bytes as input, … decodes those bytes to characters, … does something with those characters, … encodes its output characters to bytes, … and outputs those bytes. Here’s the trick: lots of Perl programs simply don’t care about “characters”; for example, if all you’re doing is piping a stream from one filehandle to another, there’s no reason to decode bytes to characters since we’re just going to re-encode those characters to bytes right away. For such programs, Perl’s pre-Unicode, a-byte-is-a-character-is-a-byte model works just fine. Let’s call these two workflows “character-oriented” and “byte-oriented”. Most character encoding problems in Perl arise from a conflict between these two. Byte-Oriented Data in a Character-Oriented World Suppose we omit step 2 above. Consider the following: > perl -MJSON::PP -E'my $s = "…"; say JSON::PP::encode_json([$s])' ["â¦"] Enter fullscreen mode Exit fullscreen mode To grok the above, first consider $s . Most folks nowadays probably use UTF-8 terminals, which means … takes 3 bytes: 0xe2 0x80 0xa6 . Our one-liner doesn’t decode $s , so as far as Perl’s concerned $s is 3 characters: 0xe2 0x80 0xa6 . encode_json() , though, expects its input strings to be decoded. It also outputs a byte sequence; thus, it applies a UTF-8 encode to each of $s ’s 3 characters, which yields 6 bytes: 0xe2 becomes 0xc3 0xa2 , 0x80 becomes 0xc2 0x80 , and 0xa6 becomes 0xc2 0xa6 . To fix this, we can do one of: A) Decode the input, e.g.: my $s = "…"; Encode::Simple::decode_utf8($s); say JSON::PP::encode_json([$s]); Enter fullscreen mode Exit fullscreen mode B) Provide a “pre-decoded” string: my $s = "\x{2026}"; say JSON::PP::encode_json([$s]); Enter fullscreen mode Exit fullscreen mode C) Make the JSON encoder forgo character encoding, e.g.: my $s = "…"; say JSON::PP->new()->utf8(0)->encode([$s]); Enter fullscreen mode Exit fullscreen mode CAVEAT: This latter approach can yield invalid JSON. Character-Oriented Data in a Byte-Oriented World The opposite problem—omitting step 4 in our 5-step workflow above—is a bit more interesting: > perl -MJSON::PP -E'say JSON::PP::decode_json(q<["…"]>)->[0]' Wide character in print at -e line 1. … Enter fullscreen mode Exit fullscreen mode Unlike before, where the mangled characters in the output reveal a palpable problem, here the program actually prints the right thing ; it’s just throwing a warning along the way. What gives? Just as encode_json() does a UTF-8 encode on its input, decode_json() does a UTF-8 decode . That means that decode_json(q<["…"]>)->[0] is a single character, 0x2026 . So before we print it we’re supposed to encode it. Indeed, once we do that, the warning goes away: > perl -MEncode::Simple -MJSON::PP -E'say encode_utf8( JSON::PP::decode_json(q<["…"]>)->[0])' … Enter fullscreen mode Exit fullscreen mode So can I just ignore that warning? Maybe. But don’t. As we know, Perl can store strings as “byte strings”: simple sequences of code points 0-255. Perl can also, though, store strings in an “upgraded”, abstract Unicode encoding. Such an “upgraded” string falls into one of two categories: 1) “Bytes-compatible”: All code points fall in the 0-255 range. In other words, Perl could store this string “downgraded”, but for whatever reason isn’t. 2) “Bytes-incompatible”: One or more code points exceed 255. When outputting upgraded strings, Perl follows these rules: 1) If the string is bytes-compatible: output the string’s “downgraded” form. 2) Otherwise: Output the code points encoded to UTF-8, and “complain”: if we’re syswrite() ing, Perl throws an exception, but if we’re say() ing or print() ing then Perl just warns. Of course, lots of applications output UTF-8 anyway, in which case #2 above happens to be “the right thing”. But Perl would rather you be explicit: encode your strings before outputting them. That Encoding Behind the Curtain … Perl’s “internal Unicode encoding” is, in fact, just UTF-8. (Actually a “loose” variant thereof, but we digress.) It’s really better to forget this unless you’re maintaining Perl itself— even XS modules shouldn’t care! —but for the sake of a concrete understanding we’ll look at a few examples here. Perl Internals: Wide Characters Compare the following: perl -MDevel::Peek -MEncode::Simple -e'my $s = "…"; decode_utf8($s); Dump $s' SV = PV(0x7fc992804c70) at 0x7fc992816348 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x7fc9927006d0 "\342\200\246"\0 [UTF8 "\x{2026}"] CUR = 3 LEN = 10 Enter fullscreen mode Exit fullscreen mode … versus: > perl -MDevel::Peek -e'my $s = "…"; Dump $s' SV = PV(0x7f9e5e804c70) at 0x7f9e5e8162a0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x7f9e5e40bbe0 "\342\200\246"\0 CUR = 3 LEN = 10 COW_REFCNT = 1 Enter fullscreen mode Exit fullscreen mode The important piece here is that [UTF8 "\x{2026}"] bit that we see only in the top example. This is the string’s content as Perl code sees it: a single character with code point 0x2026. Perl Internals: UTF8-Invariant Characters Now consider: > perl -MEncode::Simple -MDevel::Peek -e'Dump( decode_utf8("abc") )' SV = PV(0x7f81bc004d30) at 0x7f81bc0042a8 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x7f81bbf46770 "abc"\0 [UTF8 "abc"] CUR = 3 LEN = 10 Enter fullscreen mode Exit fullscreen mode A special feature of UTF-8 is that, unlike other Unicode encodings (UTF-16 & al.), it encodes code points 0-127 identically to US-ASCII and ISO-8859-1. We call these code points “UTF8-invariant” because Perl stores them as the same bytes regardless of whether the string is upgraded or not. Watch this, though: > perl -MDevel::Peek -MEncode -e'my $s = "abc"; utf8::decode($s) or die "bad"; Dump $s' SV = PV(0x7fa09a004c70) at 0x7fa09a016348 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x7fa099e01540 "abc"\0 CUR = 3 LEN = 10 Enter fullscreen mode Exit fullscreen mode This is the same logic as we achieved with Encode::Simple , but with a twist: Perl did not upgrade the string! What gives?? It turns out that upgraded strings are slower than their downgraded forms: to do much of anything with upgrades strings you have to parse each Unicode character out of the buffer. For this reason, utf8::decode will (like its parallel C API function ) leave strings downgraded unless the decoded string is bytes-incompatible. Encode::Simple, by contrast, always upgrades , even for bytes-compatible strings. ( Unicode::UTF8 does the same.) This is why we can’t just say “Perl stores text strings as UTF-8.” Some character decoders do work that way, but Perl’s own internal decoder doesn’t. Perl Internals: The Really Messy Part We’ve looked at how Perl stores bytes-incompatible (>255) code points and UTF8-invariant ones (0-127). What about the 128-255 range? Here’s where it gets dicey: these code points are bytes-compatible but not UTF8-invariant. Perl can thus store these either downgraded or upgraded, but this time it matters which they are. Recall our example above where we looked at the Dump() of undecoded … . Compare that to: > perl -MDevel::Peek -e'my $s = "…"; utf8::upgrade($s); Dump $s' SV = PV(0x7feb80004c70) at 0x7feb800162a0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x7feb7fc04930 "\303\242\302\200\302\246"\0 [UTF8 "\x{e2}\x{80}\x{a6}"] CUR = 6 LEN = 10 Enter fullscreen mode Exit fullscreen mode utf8::upgrade() internally encodes the formerly-downgraded $s as UTF-8. As far as Perl code goes it’s the same string; only its internal representation changes. Since $s was already a UTF-8 sequence, what Perl stores in memory is double-encoded ; however, to the Perl application it actually makes no difference because anything that accesses that string will see 3 characters ( 0xe2 0x80 0xa6 ), not Perl’s internally-double-encoded stuff. This includes outputting the string, e.g.: > perl -E'my $s = "…"; say $s; utf8::upgrade($s); say $s' … … Enter fullscreen mode Exit fullscreen mode It’s sometimes surprising which interfaces return upgraded strings and which don’t. For example JSON::PP’s encode() returns an upgraded string, even if we disable character encoding: > perl -MDevel::Peek -MJSON::PP -E'Dump( JSON::PP->new()->utf8(0)->encode(["…"]) )' SV = PV(0x7fd786004ff0) at 0x7fd78909e4f8 REFCNT = 1 FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8) PV = 0x7fd78826a8a0 "[\"\303\242\302\200\302\246\"]"\0 [UTF8 "["\x{e2}\x{80}\x{a6}"]"] CUR = 10 LEN = 13 COW_REFCNT = 0 Enter fullscreen mode Exit fullscreen mode REMINDER: Nothing to See Here! The above Devel::Peek examples are a purely-informational “peek behind the curtain” at Perl’s internals. Unless you’re altering Perl itself—again, even XS modules should ignore Perl internals —ignore Perl’s internal encoding. Our Way Forward Most modern programming languages use different types to represent “binary strings” and “character strings”. Perl, for better or for worse, does not; like the difference between a string and a number, we have to track that ourselves . Here, then, are the best things we Perl programmers can do for ourselves and for each other to prevent character encoding problems: Consider Perl to have one type of string: a character string. Perl wants you to ignore its internal encoding; don’t fight that. ( Technically Perl could change its internal encoding scheme, and well-behaved modules, whether pure-Perl or XS, would keep working.) Document whether your modules expect strings to be character-decoded or not. Do likewise for returned strings. (Maybe even provide functions for both, as Mojo::JSON does.) Prefer Encode::Simple over alternatives like Encode , utf8 , and Unicode::UTF8 . Encode::Simple, by default, throws an exception when it encounters invalid data, which means you’ll catch errors up-front rather than deep in your code. The others all accept invalid input by default. For XS authors: When working with PVs (strings), always differentiate between the two encodings. Macros like SvPVbyte , SvPVutf8 , and their variants are your friends!

#perl #unicode

Wide character in print at Foo/Bar.pm line 27.

We’ve all been here: that maddening “wide character” warning. Why does it happen? How can we fix it? How can we prevent it in the future? Let’s take a look.

Lots of early Perl adopters were C programmers. C strings are arrays of bytes, which allow code points up to 255, and that’s it. Perl used that model for many years.

Along came Unicode, and with it a need for Perl to store code points that exceed 255 (i.e., “wide characters”). The solution—which Perl retains today—was to give Perl a 2nd way of storing a string: in addition to C-style “byte strings”, Perl can store strings in an internal, Unicode-compatible encoding. Thus, a Perl string can now natively store any Unicode code point.

Of course, programs don’t generally receive “wide characters” as inputs. They receive bytes, then decode those bytes into “characters”. Then they encode the characters back into bytes for output. In general, then, each program:

… receives bytes as input,
… decodes those bytes to characters,
… does something with those characters,
… encodes its output characters to bytes,
… and outputs those bytes.

Here’s the trick: lots of Perl programs simply don’t care about “characters”; for example, if all you’re doing is piping a stream from one filehandle to another, there’s no reason to decode bytes to characters since we’re just going to re-encode those characters to bytes right away. For such programs, Perl’s pre-Unicode, a-byte-is-a-character-is-a-byte model works just fine.

Let’s call these two workflows “character-oriented” and “byte-oriented”. Most character encoding problems in Perl arise from a conflict between these two.

Byte-Oriented Data in a Character-Oriented World

Suppose we omit step 2 above. Consider the following:

> perl -MJSON::PP -E'my $s = "…"; say JSON::PP::encode_json([$s])'
["â¦"]

To grok the above, first consider $s. Most folks nowadays probably use UTF-8 terminals, which means … takes 3 bytes: 0xe2 0x80 0xa6. Our one-liner doesn’t decode $s, so as far as Perl’s concerned $s is 3 characters: 0xe2 0x80 0xa6.

encode_json(), though, expects its input strings to be decoded. It also outputs a byte sequence; thus, it applies a UTF-8 encode to each of $s’s 3 characters, which yields 6 bytes: 0xe2 becomes 0xc3 0xa2, 0x80 becomes 0xc2 0x80, and 0xa6 becomes 0xc2 0xa6.

To fix this, we can do one of:

A) Decode the input, e.g.:

my $s = "…";
Encode::Simple::decode_utf8($s);
say JSON::PP::encode_json([$s]);

B) Provide a “pre-decoded” string:

my $s = "\x{2026}";
say JSON::PP::encode_json([$s]);

C) Make the JSON encoder forgo character encoding, e.g.:

my $s = "…";
say JSON::PP->new()->utf8(0)->encode([$s]);

CAVEAT: This latter approach can yield invalid JSON.

Character-Oriented Data in a Byte-Oriented World

The opposite problem—omitting step 4 in our 5-step workflow above—is a bit more interesting:

> perl -MJSON::PP -E'say JSON::PP::decode_json(q<["…"]>)->[0]'
Wide character in print at -e line 1.
…

Unlike before, where the mangled characters in the output reveal a palpable problem, here the program actually prints the right thing; it’s just throwing a warning along the way. What gives?

Just as encode_json() does a UTF-8 encode on its input, decode_json() does a UTF-8 decode. That means that decode_json(q<["…"]>)->[0] is a single character, 0x2026. So before we print it we’re supposed to encode it. Indeed, once we do that, the warning goes away:

> perl -MEncode::Simple -MJSON::PP -E'say encode_utf8( JSON::PP::decode_json(q<["…"]>)->[0])'
…

So can I just ignore that warning?

Maybe. But don’t.

As we know, Perl can store strings as “byte strings”: simple sequences of code points 0-255. Perl can also, though, store strings in an “upgraded”, abstract Unicode encoding. Such an “upgraded” string falls into one of two categories:

1) “Bytes-compatible”: All code points fall in the 0-255 range. In other words, Perl could store this string “downgraded”, but for whatever reason isn’t.

2) “Bytes-incompatible”: One or more code points exceed 255.

When outputting upgraded strings, Perl follows these rules:

1) If the string is bytes-compatible: output the string’s “downgraded” form.

2) Otherwise: Output the code points encoded to UTF-8, and “complain”: if we’re syswrite()ing, Perl throws an exception, but if we’re say()ing or print()ing then Perl just warns.

Of course, lots of applications output UTF-8 anyway, in which case #2 above happens to be “the right thing”. But Perl would rather you be explicit: encode your strings before outputting them.

That Encoding Behind the Curtain …

Perl’s “internal Unicode encoding” is, in fact, just UTF-8. (Actually a “loose” variant thereof, but we digress.) It’s really better to forget this unless you’re maintaining Perl itself—even XS modules shouldn’t care!—but for the sake of a concrete understanding we’ll look at a few examples here.

Perl Internals: Wide Characters

Compare the following:

perl -MDevel::Peek -MEncode::Simple -e'my $s = "…"; decode_utf8($s); Dump $s'
SV = PV(0x7fc992804c70) at 0x7fc992816348
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7fc9927006d0 "\342\200\246"\0 [UTF8 "\x{2026}"]
  CUR = 3
  LEN = 10

… versus:

> perl -MDevel::Peek -e'my $s = "…"; Dump $s'
SV = PV(0x7f9e5e804c70) at 0x7f9e5e8162a0
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x7f9e5e40bbe0 "\342\200\246"\0
  CUR = 3
  LEN = 10
  COW_REFCNT = 1

The important piece here is that [UTF8 "\x{2026}"] bit that we see only in the top example. This is the string’s content as Perl code sees it: a single character with code point 0x2026.

Perl Internals: UTF8-Invariant Characters

Now consider:

> perl -MEncode::Simple -MDevel::Peek -e'Dump( decode_utf8("abc") )'
SV = PV(0x7f81bc004d30) at 0x7f81bc0042a8
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x7f81bbf46770 "abc"\0 [UTF8 "abc"]
  CUR = 3
  LEN = 10

A special feature of UTF-8 is that, unlike other Unicode encodings (UTF-16 & al.), it encodes code points 0-127 identically to US-ASCII and ISO-8859-1. We call these code points “UTF8-invariant” because Perl stores them as the same bytes regardless of whether the string is upgraded or not.

Watch this, though:

> perl -MDevel::Peek -MEncode -e'my $s = "abc"; utf8::decode($s) or die "bad"; Dump $s'
SV = PV(0x7fa09a004c70) at 0x7fa09a016348
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x7fa099e01540 "abc"\0
  CUR = 3
  LEN = 10

This is the same logic as we achieved with Encode::Simple, but with a twist: Perl did not upgrade the string! What gives??

It turns out that upgraded strings are slower than their downgraded forms: to do much of anything with upgrades strings you have to parse each Unicode character out of the buffer. For this reason, utf8::decode will (like its parallel C API function) leave strings downgraded unless the decoded string is bytes-incompatible. Encode::Simple, by contrast, always upgrades, even for bytes-compatible strings. (Unicode::UTF8 does the same.)

This is why we can’t just say “Perl stores text strings as UTF-8.” Some character decoders do work that way, but Perl’s own internal decoder doesn’t.

Perl Internals: The Really Messy Part

We’ve looked at how Perl stores bytes-incompatible (>255) code points and UTF8-invariant ones (0-127). What about the 128-255 range?

Here’s where it gets dicey: these code points are bytes-compatible but not UTF8-invariant. Perl can thus store these either downgraded or upgraded, but this time it matters which they are.

Recall our example above where we looked at the Dump() of undecoded …. Compare that to:

> perl -MDevel::Peek -e'my $s = "…"; utf8::upgrade($s); Dump $s'
SV = PV(0x7feb80004c70) at 0x7feb800162a0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7feb7fc04930 "\303\242\302\200\302\246"\0 [UTF8 "\x{e2}\x{80}\x{a6}"]
  CUR = 6
  LEN = 10

utf8::upgrade() internally encodes the formerly-downgraded $s as UTF-8. As far as Perl code goes it’s the same string; only its internal representation changes. Since $s was already a UTF-8 sequence, what Perl stores in memory is double-encoded; however, to the Perl application it actually makes no difference because anything that accesses that string will see 3 characters (0xe2 0x80 0xa6), not Perl’s internally-double-encoded stuff. This includes outputting the string, e.g.:

> perl -E'my $s = "…"; say $s; utf8::upgrade($s); say $s'
…
…

It’s sometimes surprising which interfaces return upgraded strings and which don’t. For example JSON::PP’s encode() returns an upgraded string, even if we disable character encoding:

> perl -MDevel::Peek -MJSON::PP -E'Dump( JSON::PP->new()->utf8(0)->encode(["…"]) )'
SV = PV(0x7fd786004ff0) at 0x7fd78909e4f8
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x7fd78826a8a0 "[\"\303\242\302\200\302\246\"]"\0 [UTF8 "["\x{e2}\x{80}\x{a6}"]"]
  CUR = 10
  LEN = 13
  COW_REFCNT = 0

REMINDER: Nothing to See Here!

The above Devel::Peek examples are a purely-informational “peek behind the curtain” at Perl’s internals. Unless you’re altering Perl itself—again, even XS modules should ignore Perl internals—ignore Perl’s internal encoding.

Our Way Forward

Most modern programming languages use different types to represent “binary strings” and “character strings”. Perl, for better or for worse, does not; like the difference between a string and a number, we have to track that ourselves.

Here, then, are the best things we Perl programmers can do for ourselves and for each other to prevent character encoding problems:

Consider Perl to have one type of string: a character string. Perl wants you to ignore its internal encoding; don’t fight that. (Technically Perl could change its internal encoding scheme, and well-behaved modules, whether pure-Perl or XS, would keep working.)
Document whether your modules expect strings to be character-decoded or not. Do likewise for returned strings. (Maybe even provide functions for both, as Mojo::JSON does.)
Prefer Encode::Simple over alternatives like Encode, utf8, and Unicode::UTF8. Encode::Simple, by default, throws an exception when it encounters invalid data, which means you’ll catch errors up-front rather than deep in your code. The others all accept invalid input by default.
For XS authors: When working with PVs (strings), always differentiate between the two encodings. Macros like SvPVbyte, SvPVutf8, and their variants are your friends!

Top comments (5)

Dan • Feb 8 '21 • Edited

A very difficult topic to cover, well done, and 100% agree with the conclusions.

Minor nit: I would refer to Perl's internal upgraded encoding as "approximately UTF-8" - it follows all of the same structure as UTF-8, so all valid UTF-8 is valid in Perl's internal encoding, but the reverse is not necessarily true, because Perl's internal encoding does not have restrictions on noncharacters, surrogates, or code points over U+10FFFF; indeed it allows storing any ordinal, because Perl strings don't necessarily represent Unicode characters until they're used as such.

And more importantly, unless you are writing XS code you should not depend on it being UTF-8 adjacent anyway - Perl could switch its internal string encoding to UTF-16LE and correctly-written pureperl code would work the same.

Felipe Gasper • Feb 9 '21

Thank you! I updated the post a bit to address these points.

Evan Carroll • Feb 14 '21

This is a great article. This is probably the best article I've read on the subject. And I agree with Dan Brook that it's not an easy topic to cover.

Consider Perl to have one type of string: a character string.

That's pretty much the bottom line to me. If you're writing a library that maintains utf8 bytes, I'd consider that a bug. It's not even exceptionally difficult to solve that. But people have to come to that conclusion. The notion of "utf8 byte strings" shouldn't be something you maintain in your program through a convention. Decode it so others can continue to pretend like Perl has one type of string.

As far as pt 2 here or revision, I would cover the practical solution to what you started with,

use JSON::PP ();
my $s = "…";
say JSON::PP::encode_json([$s]);

Will work totally fine without any forethought so long as you remember to use utf8,

use utf8;
use JSON::PP ();
my $s = "…";
say JSON::PP::encode_json([$s]);

And there is no reason to ever not use utf8 in your source files. Why would you ever want to put unencoded non-unicode bytes in your perl source code. Doesn't make sense to me. Put the blob outside your source file.

Felipe Gasper • Feb 21 '21

Decode it so others can continue to pretend like Perl has one type of string.

This isn’t a pretense, though; it’s the literal truth. What defines a Perl string is its sequence of code points. Nothing more.

And there is no reason to ever not use utf8 in your source files.

Source-decode by default makes some sense. I would personally rather it be deferred, though, until Perl can tell whether a string is decoded or not. There’s enough Perl out there already that screws this stuff up; changing recommended defaults without providing any additional “guard rails” seems likely to confuse.

I’m also—as I related in a thread on a recent article Dan wrote proposing that use utf8 be part of use v7—a bit worried about STDIN, pipes, and the like still defaulting to undecoded when the source code auto-decodes. If we’re going to source-decode, I’d rather we go the extra mile and make inputs/outputs default to UTF-8, or maybe ape node.js and require that an encoding be specified in order to create a filehandle.

Dan • Feb 8 '21

Make sure to add the #perl tag to your Perl posts! :)