Discussion on: Perl, Unicode, and Bytes

View post

This is a great article. This is probably the best article I've read on the subject. And I agree with Dan Brook that it's not an easy topic to cover.

Consider Perl to have one type of string: a character string.

That's pretty much the bottom line to me. If you're writing a library that maintains utf8 bytes, I'd consider that a bug. It's not even exceptionally difficult to solve that. But people have to come to that conclusion. The notion of "utf8 byte strings" shouldn't be something you maintain in your program through a convention. Decode it so others can continue to pretend like Perl has one type of string.

As far as pt 2 here or revision, I would cover the practical solution to what you started with,

use JSON::PP ();
my $s = "…";
say JSON::PP::encode_json([$s]);

Will work totally fine without any forethought so long as you remember to use utf8,

use utf8;
use JSON::PP ();
my $s = "…";
say JSON::PP::encode_json([$s]);

And there is no reason to ever not use utf8 in your source files. Why would you ever want to put unencoded non-unicode bytes in your perl source code. Doesn't make sense to me. Put the blob outside your source file.

Felipe Gasper • Feb 21 '21

Decode it so others can continue to pretend like Perl has one type of string.

This isn’t a pretense, though; it’s the literal truth. What defines a Perl string is its sequence of code points. Nothing more.

And there is no reason to ever not use utf8 in your source files.

Source-decode by default makes some sense. I would personally rather it be deferred, though, until Perl can tell whether a string is decoded or not. There’s enough Perl out there already that screws this stuff up; changing recommended defaults without providing any additional “guard rails” seems likely to confuse.

I’m also—as I related in a thread on a recent article Dan wrote proposing that use utf8 be part of use v7—a bit worried about STDIN, pipes, and the like still defaulting to undecoded when the source code auto-decodes. If we’re going to source-decode, I’d rather we go the extra mile and make inputs/outputs default to UTF-8, or maybe ape node.js and require that an encoding be specified in order to create a filehandle.