DEV Community

Discussion on: Perl, Unicode, and Bytes

Collapse
 
evancarroll profile image
Evan Carroll

This is a great article. This is probably the best article I've read on the subject. And I agree with Dan Brook that it's not an easy topic to cover.

Consider Perl to have one type of string: a character string.

That's pretty much the bottom line to me. If you're writing a library that maintains utf8 bytes, I'd consider that a bug. It's not even exceptionally difficult to solve that. But people have to come to that conclusion. The notion of "utf8 byte strings" shouldn't be something you maintain in your program through a convention. Decode it so others can continue to pretend like Perl has one type of string.


As far as pt 2 here or revision, I would cover the practical solution to what you started with,

use JSON::PP ();
my $s = "…";
say JSON::PP::encode_json([$s]);
Enter fullscreen mode Exit fullscreen mode

Will work totally fine without any forethought so long as you remember to use utf8,

use utf8;
use JSON::PP ();
my $s = "…";
say JSON::PP::encode_json([$s]);
Enter fullscreen mode Exit fullscreen mode

And there is no reason to ever not use utf8 in your source files. Why would you ever want to put unencoded non-unicode bytes in your perl source code. Doesn't make sense to me. Put the blob outside your source file.

Collapse
 
fgasper profile image
Felipe Gasper

Decode it so others can continue to pretend like Perl has one type of string.

This isn’t a pretense, though; it’s the literal truth. What defines a Perl string is its sequence of code points. Nothing more.

And there is no reason to ever not use utf8 in your source files.

Source-decode by default makes some sense. I would personally rather it be deferred, though, until Perl can tell whether a string is decoded or not. There’s enough Perl out there already that screws this stuff up; changing recommended defaults without providing any additional “guard rails” seems likely to confuse.

I’m also—as I related in a thread on a recent article Dan wrote proposing that use utf8 be part of use v7—a bit worried about STDIN, pipes, and the like still defaulting to undecoded when the source code auto-decodes. If we’re going to source-decode, I’d rather we go the extra mile and make inputs/outputs default to UTF-8, or maybe ape node.js and require that an encoding be specified in order to create a filehandle.