Felipe Gasper

Posted on Mar 27, 2021

use Sys::Binmode;

#perl

Character encoding is an often-misunderstood aspect of Perl. Perl itself has significant bugs in the area. I recently published a CPAN module called Sys::Binmode which fixes most of those. If you write Perl you probably should use it in all new code.

I know that’s a “tall” claim, but …

Check this out:

my $foo = "\xff\x{100}";
chop $foo;
print $foo, $/;
exec "echo", $foo;

This looks like it ought to print two identical lines, right? But in fact, it prints:

�
ÿ

Why? To answer that we have to learn a bit of Perl’s internals. Read on!

Background: What’s in a string?

In theory, Perl strings store code points, nothing more. They don’t store “bytes” or “characters”, but just code points—i.e., unsigned integers. (In that sense, Perl is more like JavaScript than C!)

That, of course, is just an abstraction: all programming languages use bytes internally to store strings. How does Perl decide which bytes to use for which code points? As it happens, Perl can do that in either of two formats: a “narrow” format that can store code points 0-255 only, and a “wide” format that can store any arbitrary code point. Which of those formats Perl uses for a given string is up to Perl; things that aren’t Perl generally shouldn’t care about it.

For this abstraction to work, whether Perl stores a given string as “narrow” or “wide” must make no difference to a Perl program. And indeed, if you print $foo you’ll get the same result regardless of which internal format Perl uses to store $foo. Same for syswrite and send.

Most of Perl’s built-ins, though—e.g., exec, open, mkdir, etc.—don’t work this way.

Look again at our 4-line program above. In line 1 we create a string with 2 code points: 255 and 256. In line 2 we chop off the latter code point, so now $foo just has 255. In line 3 we print that string and a newline; in line 4 we run echo to do the same thing. Ideally lines 3 and 4 should achieve the same output. But for you they probably didn’t. Why?

Let’s rerun that program but this time pipe it to xxd to see exactly what’s being output:

> pbpaste | perl | xxd
00000000: ff0a c3bf 0a

0a is just the newline character. So line 3 printed a single byte, 0xff (plus newline), while echo on line 4 printed 2 bytes—0xc3 0xbf (and a newline). Line 3 is correct: a string that contains code point 255 should output byte 255 (i.e., 0xff). What’s going on with line 4?

Recall that, of Perl’s internal string-storage formats, only the “wide” one can handle code points above 255. Since $foo on line 1 contains code point 256, Perl stores that string in “wide” format. Then in line 2 we get rid of code point 256. Now we have just 255. Perl could thus switch our string to its “narrow” (i.e., 0-255) format, but it happens—as of Perl 5.32, anyway—not to.

This should make no difference since those internal storage details are behind Perl’s string abstraction. That’s the case with print, but exec misbehaves: it outputs Perl’s raw internal buffer rather than the proper code-point-to-byte conversion that print uses. Perl, though, doesn’t publicly define the contents of that internal buffer. Thus we have undefined behaviour, aka “nasal demons”, built directly into Perl!

This is a leak in Perl’s string-storage abstraction, and it’s what Sys::Binmode fixes.

(Extra credit: remove the \x{100} and line 2 from our program above, and rerun it. The two lines should now be the same. Why?)

Enter Sys::Binmode

Sys::Binmode fixes exec and many other Perl built-ins by force-converting those built-ins’ arguments to Perl’s internal “narrow” string storage format. This fixes the abstraction leak: now, no matter how these strings are stored, Perl gives them to the operating system the same way.

Try it: do cpan Sys::Binmode, then rerun our program with perl -MSys::Binmode. It’ll now print two identical lines.

Special Case: Non-POSIX OSes (e.g., Windows)

Windows programmers may see a problem here: Perl’s “narrow” string storage format can only store bytes, so any time we want to give arbitrary Unicode characters to the operating system—which doesn’t exist for POSIX OSes like Linux—we’re stuck.

As it happens, though, Perl doesn’t actually use the Windows APIs that would allow sending arbitrary Unicode characters anyway. If Perl ever changed that Sys::Binmode would need an update, but for now it can work the same way as on POSIX OSes without compromising any functionality.

Use in Existing Code?

Note that I say to use Sys::Binmode in new code, not all code. This is because existing code may actually depend on Perl’s abstraction leak.

Look again at exec’s broken behaviour above. For code point 255 it printed the bytes of Perl’s “wide” storage format, which for that string was 2 bytes: 0xc3 and 0xbf. Notice that that broken behaviour actually made our terminal print something useful: ÿ. As it happens, those 2 bytes from Perl’s internals are UTF-8 for 255. That’s because Perl’s “wide” internal format is actually just (a “lax” variant of) UTF-8, so anything that outputs Perl’s internals will output UTF-8 if Perl stores the string in “wide” format.

Ordinarily to output a string in UTF-8 you encode it thus explicitly, e.g., encode('UTF-8', $str). exec appears to be automatically encode()ing for us, but it’s actually just outputting whatever Perl happens to store internally. So if Perl decides to store a string “wide”, it’ll give UTF-8 to exec … but if Perl decides to store that string “narrow”, then you’ll get something else! We could try to second-guess Perl’s internal decision-making, but that’s dangerous: how Perl decides to store its strings is undocumented and always subject to change.

Sys::Binmode will suppress that unreliable “auto-encode” behaviour, which forces us to encode our strings properly before giving them to exec and friends. Of course, that’s what we should have done all along!

Conclusion

I can think of no situation where Sys::Binmode effects any undesirable change to Perl in new code. It surely fixes bugs like in our exec demo program. Assuming that I’m correct that, for new code, this module only avoids problems without introducing any, it should be used in all new code.

Convinced? :-)

DEV Community