If you’ve read my Perl, Unicode, and Bytes or Sys::Binmode posts, you know about the complexities of character encoding in Perl. A bit after I wrote that first post I had a little epiphany I thought worth sharing.
One day I noticed that URI::XSEscape was mangling its output: I’d pass in
épée and get out
%C3%83%C2%A9p%C3%83%C2%A9e. I recognized this as an extra UTF-8 encode: rather than URI-encoding my 6 bytes of
épée, it was UTF-8 encoding—so now 10 bytes—then URI-encoding that.
I pulled out Devel::Peek and saw that something prior to the URI-encoding step had “upgraded” my string’s internal storage: Perl itself stored my string as 10 bytes, even though the Perl scalar still consisted of 6 characters. Ordinarily this is nothing of importance since Perl code doesn’t need to care how Perl itself stores its strings.
… until it does need to care, that is.
Perl’s C API—the set of macros and functions available to work with Perl from C—is a classic C API: lots of different ways to do almost the same thing. To translate a Perl scalar to a C signed integer, for example, you can use
IV here signifying an “integer value”) A similar set of macros exists for unsigned integers (
Converting a Perl scalar to a C string is similar. There are many tools available, but the 3 “fundamental” ones are:
SvPVbyte: Takes the code points of your Perl string and gives back a C buffer whose bytes match those code points. Thus, any code point that exceeds 255 doesn’t work, and an exception is thrown.
SvPVbytebut gives the UTF-8-encoded bytes for your Perl string’s code points. This works for any code point that Perl can store, but for code points 128-255 it’ll give different results from
SvPVbyte. (cf. perldoc perlunicode)
SvPV: Gives you the Perl string’s internal buffer, aka its “PV” (“pointer value”). It could be bytes, or it could be UTF-8. It’s like a C analogue to Perl’s
SvPV, of course, is what URI::XSEscape was using.
SvPV to be meaningful it has to be used in tandem with
SvUTF8, a macro that tells you which form the PV is: bytes, or UTF-8. So if
SvUTF8 is true, then
SvPV’s output is UTF-8; otherwise
SvPV’s output is bytes. But URI::XSEscape wasn’t checking
SvUTF8; it was just URI-encoding
The big problem with
SvPV is that the number of contexts other than Perl where it’s sensible to have a C string that could be bytes or UTF-8 is … small. Nevertheless, uses of this macro (and its variants) to interact with contexts outside Perl are all over CPAN.
URI::XSEscape, like its pure-Perl counterpart, presents interfaces appropriate for both “byte-oriented” and “character-oriented” Perl code (cf. Perl, Unicode, and Bytes). Since the byte-oriented interface is what I was using, switching URI::XSEscape from
SvPVbyte was the simple fix to this problem.
In essence, C code like URI::XSEscape should approach Perl strings the same way that pure-Perl code does, without caring about Perl’s internal string storage. Most C code should thus avoid
SvPV for the same reason that most Perl should not
A quick scan through some popular XS modules showed more occurrences of this problem:
- DNS::Unbound (mea culpa!)
- Socket (a core module!)
These offer a non-default mode that auto-encodes to UTF-8, but their default setup has the same bug:
There are likely many more; those are just ones I’ve found.
I suspect it’s that:
SvPVis the shortest of the above-named methods for converting a Perl scalar to a C string. Thus, it’s easier to type and looks less “intimidating”.
Historically, Perl’s documentation favoured
SvPVin its examples of scalar-to-string conversion; the other two were seldom discussed. I fixed this recently, but it’ll be years before everyone’s local
perldocreflects that change.
Perl’s default XS typemap uses
SvUTF8) to convert a scalar to a string. Thus, the following XSUB, called as
void printstr (const char *str) CODE: fprintf(stdout, str);
… prints Perl’s internals, which a Perl caller isn’t supposed to care about. Ideally language defaults like this would be the “safe” ways to do things, but this particular one is nonsensical.
A simple way to test for this problem is to
utf8::upgrade your strings before you give them to the tested code—ensuring, of course, that you’re testing with some code points in the 128-255 range. Your test should verify that your program’s behaviour is the same with
utf8::upgraded strings as with non-upgraded strings.
You wouldn’t normally upgrade strings manually in production (since it makes your Perl code think about Perl’s internals, which it shouldn’t do), but for testing it’s fine and useful.
For example, I found the URI::XSEscape problem by doing:
my $foo = "épée"; utf8::upgrade($foo); print URI::XSEscape::uri_escape($foo);
The worst part of all this is that modules like CDB_File can’t replace
SvPV without breaking existing applications that may depend on that
use bytes-ish behaviour. So there’s not much to do except build new, corrected interfaces, deprecating the old ones … which of course will eventually necessitate changes to existing code. For Perl “gurus” that may be simple, but for everyone else changing existing code could be expensive, painful, and even harmful to Perl’s reputation as a language that prizes backward compatibility.
XS code isn’t the only place where this bug appears; Perl itself has it, too! Read all about it at “use Sys::Binmode;”.
I think most code that uses
SvPV to convert a Perl string to a C string intends for Perl code points to correspond to bytes in the C string; thus, such code should actually use
SvPVbyte or one of its variants. (UTF-8-aware C code, of course, would use
SvPVutf8.) Toward that end, we MUST discourage further use of
SvPV. I propose to the Perl community, then, a few changes: some that don’t break anything, and others that will probably break some things:
SvPV and friends. We can’t remove them, but we can create longer, “scarier-looking” aliases for them and use those names in the documentation. I propose
xsubpp warn when it sees SvPV or variants in a typemap.
3) Use Sys::Binmode in all new code to fix Perl’s own buggy behaviour.
4) Submit bug reports! Audit the XS modules that you use, and if you find different behaviour between upgraded and downgraded strings, let the maintainers know—ideally by sending them patches!
You can’t make an omelet without breaking some eggs, and you often can’t fix things like this without breaking some current applications. Nevertheless …
char * and
const char * in Perl’s default typemap use
SvPVbyte_nolen, but hey.) For the vast majority of XS modules this probably would be just a bug fix, though for apps that depend on a
use bytes-ish status quo there would be breakage. Thankfully, though: a) the most widely-used XS modules (e.g., MIME::Base64, JSON::XS) where this could be a problem don’t appear to be vulnerable, and b) any breakage would be easy to fix: module authors merely have to adopt
SvPVutf8 if that’s what they want, optionally creating separate functions if support for both is desired.
6) Make Sys::Binmode’s behaviour Perl’s own behaviour. This is more contentious because it sidesteps the much larger problem of Perl’s lacklustre support for Windows filesystems; still, Sys::Binmode-type behaviour is no worse than Perl’s status quo, and it fixes a significant leak in Perl’s string abstraction.
7) Perl needs to differentiate byte sequences from text strings. This would fix a plethora of “shin-bumpers” that afflict users of the language. This is a fairly difficult problem to solve, but I don’t think it’s insurmountable.
Absent fixes like the above, we just have to avoid this issue. You’ll always have consistent behaviour if you send encoded strings to the operating system and downgrade them prior to output; this way Perl doesn’t store any strings as UTF-8, so
SvPVbyte give the same result.
IMPORTANT: If you don’t decode your strings, then by definition they’re already encoded, so in this case don’t encode them manually, or you’ll mangle your output.