DEV Community: Felipe Gasper

Prose for Coding Pros

Felipe Gasper — Thu, 29 Jul 2021 02:37:39 +0000

The ability to express yourself verbally is a useful but oft-overlooked skill for software developers. Here I’m going to explore some guidelines I try to apply in my own writing.

(In the following I’m using incomplete sentences per convention in software status messages.)

DISCLAIMER: I am not a professional writer. YMMV.

1. Avoid “could not”.

Consider this phrase:

Could not open the file “foo”.

This means that, at some point in the (assumedly recent) past, the “speaker”—assumedly some computer somewhere—lacked the ability to open foo.

But does that mean:

1) The computer observed its inability to open foo, and didn’t bother trying?

2) The computer tried, and failed, to open foo?

This is a significant difference, and an important one to disambiguate.

#1 connotes a sense of ongoing inability: “I can’t do this thing, no matter how many times I try.” For example, I can’t jump over my house. No matter how many times I try, it’s just not going to happen. It was also true last night; thus, “last night I couldn’t jump over my house” is a true statement.

#2, by contrast, reports on the result of one specific operation at one specific time. Let’s say I did try last night, for some reason, to jump over my house. This is probably what folks will infer I say “last night I couldn’t jump over my house”, but logically it’s not the only meaning; thus, the statement is imprecise. We can do better.

(Linguistically-minded folks may observe a certain parallel with the distinction between “imperfect” and “preterite” past-tense forms.)

Consider expressions like:

Cannot open the file “foo”.

This connotes ongoing inability, which is the logical way to report cases where, e.g., filesystem permissions impede access to a given resource. (To go back to the house analogy, this would be “I can’t jump over my house.”)

Failed to open the file “foo”.

This states unambiguously that “I tried, but failed.” It doesn’t attempt to communicate ongoing inability; it just gives the outcome of a specific attempt to do something. (Also: “Last night I failed to jump over my house.”)

2. Prefer antonyms over negation.

Consider this phrase:

Not Authorized

We could interpret this at least two ways:

1) I cannot access the resource.

2) I might have access to the resource, but nothing has explicitly authorized me.

While native English speakers may intuitively understand “Not Authorized” to imply meaning #1, a nonnative speaker may not. Logically, either meaning is an accurate interpretation of the phrase.

We can be more precise thus:

Access Forbidden

Sometimes following this principle will lead you to some over-awkward wordings. For example, you could replace …

You do not have a file named “foo”.

… with:

You lack a file named “foo”.

… but that sounds kind of goofy. Depending on context that may be OK, though!

(cf. Strunk & White, The Elements of Style, where it says to ”Put statements in positive form.”)

3. Prefer passive voice to active voice when it makes more sense.

We should prefer active voice to passive voice in most cases: it’s clearer to state that an actor “does” an action, rather than that the action “is done” by the actor.

Oftentimes, though, I find myself describing scenarios where the “doer” is either unknown or irrelevant. In those cases, it’s often more logical to use the passive voice than to contort your phrasing to allow active voice.

Consider the following:

If the file has been opened in the last hour, delete the file.

That “has been opened” is classic passive voice: a form of “to be” plus a past participle. We might try to convert it to active voice thus:

If anyone has opened the file in the last hour, delete the file.

This, though, subtly changes the meaning: now our condition—i.e., the “if block” of our phrase—stipulates a “doer”: “anyone”. It’s a pretty generic expression, but might it lead someone to think that it refers to a person? What if what opens the file is a script?

OK, so we rephrase it again:

If anyone or anything has opened the file in the last hour, delete the file.

This works logically, but to my sense it’s getting a bit awkward. The original passive form of the expression is simpler and clearer.

(cf. Yellowlees Douglas, The Reader’s Brain)

4. Rather than “of”, consider a possessive.

In German, Igor Stravinsky’s “Rite of Spring” is “Frühlingsopfer”: literally “Spring’s Rite”. We wouldn’t say that in English because it just sounds awkward, but in other contexts it can yield a slight gain in concision. Consider these:

the file’s change time

… versus …

the change time of the file

The first one is just a bit more concise.

Another advantage of the possessive form is its “object-oriented” syntax: the word order corresponds to how that might look in code: file.ctime. If your target audience is developers, that might facilitate a bit easier understanding, though it’s subtle enough that they may not notice.

5. Describe causal relationships explicitly.

Compare the following sentences:

You cannot save “foo”. You have exceeded your quota.

You cannot save “foo” because you have exceeded your quota.

Most folks will probably interpret the above phrases the same way, but note that only the second form unambiguously gives the relationship between the two ideas (i.e., inability to save “foo” and quota excess). Being explicit reduces the likelihood of misunderstanding.

Note, however, that the second form includes one larger sentence in lieu of two smaller ones. All things else being equal, longer sentences impede comprehension; thus, one might fear that that loss in comprehensibility outweighs the gain in clarity that “because” yields.

A third option may yield a “happy medium”:

You cannot save “foo”. This is because you have exceeded your quota.

Now there’s more verbiage overall, but we’re back to two small sentences rather than one large one, and we still describe the causal relationship explicitly.

6. Write “properly”.

Like it or not, some people will judge you—maybe even subconsciously—for neglecting widely-accepted grammar conventions. (But hey, that’s life!) Overall, it’s best to avoid that. Learn and apply the difference between “its” and “it’s”, “who” and “whom”, etc.

7. Beware “the”.

“The” is essentially a “global”: it lacks context, which means whatever it references applies universally, unless something else around it restrict scope. Consider:

The submit button doesn’t work.

There are lots of submit buttons in the world … which one do we mean? Assumedly there’d be context around this phrase that restricts it, but as with “could not” above, it’s better to minimize reliance on context. Consider instead, then:

Our login page’s submit button doesn’t work.

8. Avoid “error” as a verb.

“Error”—in American English, anyhow—isn’t a verb. Say “fail” instead—or, if you must, “err”! :)

9. Avoid “not $this because $that”.

Consider this phrase:

I’m not going hiking because of the weather.

All this means is that the phrase “I am hiking because of the weather” is false. Thus, either of these could be what the phrase means:

It’s too cold for me to go hiking.
Weather isn’t why I’m hiking; I’m hiking to find buried treasure!

To avoid this ambiguity, do one of the following:

Turn the phrase around: “Because of the weather, I’m not going hiking.”
Express the consequence positively, e.g., “I’m skipping the hiking trip because of the weather.”

What difference does all this really make?

Remember the last time you debugged something that just did not make sense? How many times did you stare at whatever error messages you had, comb through logs looking for clues, or the like? At some point you likely realized that you had misunderstood something—or, at the very least, you spent time wondering if you had.

This is the value of unambiguous messaging. If I see “failed to open the file”, I know an attempt was made to do so; if I just see “could not open the file”, though, I may spend time wondering whether that means an attempt was made, or merely that a stat() revealed improper permissions or ownership.

Unambiguous communication saves time and frustration for you and your colleagues, present and future. What’s not to love about that?

SQLite, Perl, and a Boolean

Felipe Gasper — Sun, 30 May 2021 16:27:02 +0000

I’ve written several articles now about the trials and tribulations of character encoding in Perl. Having gained the knowledge I have, I’ve also been finding bugs in libraries we use at $work and sending patches to their maintainers.

The latest one is DBD::SQLite, CPAN’s self-contained SQLite binding. It’s a great library that I’ve used for years, but I recently noted two problems in it:

1) In its default configuration it used the SvPV macro to translate Perl strings to C strings, which is bad for reasons I detailed in “Perl’s SvPV Menace”.

2) In its (non-default) “unicode” configuration it used a “naïve” method of UTF-8 decoding that neglects validation. This mechanism can corrupt Perl’s internals by making it mistake invalid UTF-8 sequences for valid ones.

Neither of these is trivial to fix: applications may depend on the SvPV problem—what one coworker of mine calls a “load-bearing bug” 😀—while adding UTF-8 validation entails a performance hit.

In reality, DBD::SQLite needed at least 4 modes of translating between Perl and C strings:

1) The current (“load-bearing-buggy”) default.
2) Same as #1, but use SvPVbyte to avoid the SvPV bug.
3) Current “naïve unicode” behaviour.
4) A “non-naïve unicode” mode that validates incoming UTF-8.

(I eventually made two variants of this last one: one that just warns on invalid data, and the other that throws an exception.)

There was another problem, though: DBD::SQLite’s interface for controlling this was a boolean. That meant only two modes were even possible!

This exemplifies a principle a mentor of mine taught me years back: avoid boolean parameters. They restrict your ability to add additional configurations.

(And for pity’s sake, abhor unnamed booleans in particular! What does the 0 in open_file($path, 0) mean??)

To fix this my pull request had to deprecate the existing sqlite_unicode parameter. It’s an unfortunate step that’ll produce new warnings in existing applications, but the “omelet” here justifies the “broken egg”.

Perl’s SvPV Menace

Felipe Gasper — Thu, 27 May 2021 02:42:30 +0000

If you’ve read my Perl, Unicode, and Bytes or Sys::Binmode posts, you know about the complexities of character encoding in Perl. A bit after I wrote that first post I had a little epiphany I thought worth sharing.

One day I noticed that URI::XSEscape was mangling its output: I’d pass in épée and get out %C3%83%C2%A9p%C3%83%C2%A9e. I recognized this as an extra UTF-8 encode: rather than URI-encoding my 6 bytes of épée, it was UTF-8 encoding—so now 10 bytes—then URI-encoding that.

I pulled out Devel::Peek and saw that something prior to the URI-encoding step had “upgraded” my string’s internal storage: Perl itself stored my string as 10 bytes, even though the Perl scalar still consisted of 6 characters. Ordinarily this is nothing of importance since Perl code doesn’t need to care how Perl itself stores its strings.

… until it does need to care, that is.

What is SvPV?

Perl’s C API—the set of macros and functions available to work with Perl from C—is a classic C API: lots of different ways to do almost the same thing. To translate a Perl scalar to a C signed integer, for example, you can use SvIV, SvIV_nomg, SvIVX, or SvIVx. (IV here signifying an “integer value”) A similar set of macros exists for unsigned integers (UVs).

Converting a Perl scalar to a C string is similar. There are many tools available, but the 3 “fundamental” ones are:

SvPVbyte: Takes the code points of your Perl string and gives back a C buffer whose bytes match those code points. Thus, any code point that exceeds 255 doesn’t work, and an exception is thrown.
SvPVutf8: Like SvPVbyte but gives the UTF-8-encoded bytes for your Perl string’s code points. This works for any code point that Perl can store, but for code points 128-255 it’ll give different results from SvPVbyte. (cf. perldoc perlunicode)
SvPV: Gives you the Perl string’s internal buffer, aka its “PV” (“pointer value”). It could be bytes, or it could be UTF-8. It’s like a C analogue to Perl’s use bytes.

SvPV, of course, is what URI::XSEscape was using.

For SvPV to be meaningful it has to be used in tandem with SvUTF8, a macro that tells you which form the PV is: bytes, or UTF-8. So if SvUTF8 is true, then SvPV’s output is UTF-8; otherwise SvPV’s output is bytes. But URI::XSEscape wasn’t checking SvUTF8; it was just URI-encoding SvPV directly.

The big problem with SvPV is that the number of contexts other than Perl where it’s sensible to have a C string that could be bytes or UTF-8 is … small. Nevertheless, uses of this macro (and its variants) to interact with contexts outside Perl are all over CPAN.

URI::XSEscape, like its pure-Perl counterpart, presents interfaces appropriate for both “byte-oriented” and “character-oriented” Perl code (cf. Perl, Unicode, and Bytes). Since the byte-oriented interface is what I was using, switching URI::XSEscape from SvPV to SvPVbyte was the simple fix to this problem.

In essence, C code like URI::XSEscape should approach Perl strings the same way that pure-Perl code does, without caring about Perl’s internal string storage. Most C code should thus avoid SvPV for the same reason that most Perl should not use bytes.

The plot thickens …

A quick scan through some popular XS modules showed more occurrences of this problem:

These offer a non-default mode that auto-encodes to UTF-8, but their default setup has the same bug:

There are likely many more; those are just ones I’ve found.

How did this come to be?

I suspect it’s that:

SvPV is the shortest of the above-named methods for converting a Perl scalar to a C string. Thus, it’s easier to type and looks less “intimidating”.
Historically, Perl’s documentation favoured SvPV in its examples of scalar-to-string conversion; the other two were seldom discussed. I fixed this recently, but it’ll be years before everyone’s local perldoc reflects that change.
Perl’s default XS typemap uses SvPV (without consulting SvUTF8) to convert a scalar to a string. Thus, the following XSUB, called as printstr($mystr):

void
printstr (const char *str)
  CODE:
    fprintf(stdout, str);

… prints Perl’s internals, which a Perl caller isn’t supposed to care about. Ideally language defaults like this would be the “safe” ways to do things, but this particular one is nonsensical.

Does this problem affect your code?

A simple way to test for this problem is to utf8::upgrade your strings before you give them to the tested code—ensuring, of course, that you’re testing with some code points in the 128-255 range. Your test should verify that your program’s behaviour is the same with utf8::upgraded strings as with non-upgraded strings.

You wouldn’t normally upgrade strings manually in production (since it makes your Perl code think about Perl’s internals, which it shouldn’t do), but for testing it’s fine and useful.

For example, I found the URI::XSEscape problem by doing:

my $foo = "épée";
utf8::upgrade($foo);
print URI::XSEscape::uri_escape($foo);

Not just any old bug …

The worst part of all this is that modules like CDB_File can’t replace SvPV without breaking existing applications that may depend on that use bytes-ish behaviour. So there’s not much to do except build new, corrected interfaces, deprecating the old ones … which of course will eventually necessitate changes to existing code. For Perl “gurus” that may be simple, but for everyone else changing existing code could be expensive, painful, and even harmful to Perl’s reputation as a language that prizes backward compatibility.

But that’s not all …

XS code isn’t the only place where this bug appears; Perl itself has it, too! Read all about it at “use Sys::Binmode;”.

How can we fix this?

I think most code that uses SvPV to convert a Perl string to a C string intends for Perl code points to correspond to bytes in the C string; thus, such code should actually use SvPVbyte or one of its variants. (UTF-8-aware C code, of course, would use SvPVutf8.) Toward that end, we MUST discourage further use of SvPV. I propose to the Perl community, then, a few changes: some that don’t break anything, and others that will probably break some things:

Fixing this: The easy parts!

1) Rename SvPV and friends. We can’t remove them, but we can create longer, “scarier-looking” aliases for them and use those names in the documentation. I propose SvPVinternal, SvPVinternal_const, etc.

2) Make xsubpp warn when it sees SvPV or variants in a typemap.

3) Use Sys::Binmode in all new code to fix Perl’s own buggy behaviour.

4) Submit bug reports! Audit the XS modules that you use, and if you find different behaviour between upgraded and downgraded strings, let the maintainers know—ideally by sending them patches!

Fixing this: The hard part …

You can’t make an omelet without breaking some eggs, and you often can’t fix things like this without breaking some current applications. Nevertheless …

5) Make char * and const char * in Perl’s default typemap use SvPVbyte. (Actually SvPVbyte_nolen, but hey.) For the vast majority of XS modules this probably would be just a bug fix, though for apps that depend on a use bytes-ish status quo there would be breakage. Thankfully, though: a) the most widely-used XS modules (e.g., MIME::Base64, JSON::XS) where this could be a problem don’t appear to be vulnerable, and b) any breakage would be easy to fix: module authors merely have to adopt SvPVutf8 if that’s what they want, optionally creating separate functions if support for both is desired.

6) Make Sys::Binmode’s behaviour Perl’s own behaviour. This is more contentious because it sidesteps the much larger problem of Perl’s lacklustre support for Windows filesystems; still, Sys::Binmode-type behaviour is no worse than Perl’s status quo, and it fixes a significant leak in Perl’s string abstraction.

Fixing this: The moon-shot …

7) Perl needs to differentiate byte sequences from text strings. This would fix a plethora of “shin-bumpers” that afflict users of the language. This is a fairly difficult problem to solve, but I don’t think it’s insurmountable.

In the meantime …

Absent fixes like the above, we just have to avoid this issue. You’ll always have consistent behaviour if you send encoded strings to the operating system and downgrade them prior to output; this way Perl doesn’t store any strings as UTF-8, so SvPV and SvPVbyte give the same result.

IMPORTANT: If you don’t decode your strings, then by definition they’re already encoded, so in this case don’t encode them manually, or you’ll mangle your output.

Perling and Curling

Felipe Gasper — Fri, 07 May 2021 04:36:06 +0000

Most of us probably know curl as a quick and easy way to send HTTP requests from the command line.

That tool, though, is just an interface to the curl project’s real gold: the libcurl API. Using this API, applications in all sorts of languages have easy access to the awesome power that libcurl provides. This article will discuss how to use that power in Perl.

A Quick Example

use Net::Curl::Easier;

my $easy = Net::Curl::Easier->new(
    url => 'http://perl.org',
    followlocation => 1,
)->perform();

print $easy->head(), $easy->body();

Let’s talk about what just happened.

Net::Curl::Easier is a thin wrapper around Net::Curl’s “easy” interface—“easy” is what libcurl calls it!—that smooths over some rough edges in Net::Curl.

(Full disclosure: I am Net::Curl::Easier’s maintainer.)

Once we create our “Easier” object, having given it the proper URL and told it to follow HTTP redirects (followlocation refers to HTTP’s Location header), we run perform() on the Easier object.

After that, we print the HTTP response headers and body, and we’re done!

Why not just use HTTP::Tiny?

Indeed. Well, error reporting, for one. Consider:

Net::Curl::Easier->new(
    url => 'http://blahblah',
)->perform();

If you run this you’ll probably just see Couldn't resolve host name printed to standard error. But if you dig deeper you’ll see something nifty:

use Net::Curl::Easier;
use Data::Dumper;

eval {
    Net::Curl::Easier->new(
        url => 'http://blahblah',
    )->perform();
};
print Dumper $@;

It turns out that that error isn’t just a string; it’s an exception object.

In large systems I often want to handle certain failure types differently from others. HTTP::Tiny’s errors are just strings, so type-specific failure handling with HTTP::Tiny entails parsing strings, which is brittle. What if someone decides to reword some error message for clarity, thus breaking my string parser?

With Net::Curl I can look for specific numeric error codes, documentation for which the curl project itself maintains. This is much more robust.

Don’t care. What else you got?

OK. How about this:

my $easy = Net::Curl::Easier->new(
    username => 'hal',
    userpwd => 'itsasecret',
    url => 'imap://mail.example.com/INBOX/;UID=123',
)->perform();

I just queried … an email inbox?!?

Curl doesn’t just speak HTTP; it speaks many other protocols including IMAP, LDAP, SCP, and MQTT. To see the full list of protocols that your curl supports, run curl --version.

Concurrency

Curl can also run concurrent queries. To do that I recommend using Net::Curl::Promiser. (Full disclosure: I also maintain this module.)

Example, assuming use of Mojolicious:

use Net::Curl::Easier;
use Net::Curl::Promiser::Mojo;
use Mojo::Promise;

my $easy1 = Net::Curl::Easier->new(
    url => 'http://perl.org',
    followlocation => 1,
);

my $easy2 = Net::Curl::Easier->new(
    username => 'hal',
    userpwd => 'itsasecret',
    url => 'imap://mail.example.com/INBOX/;UID=123',
);

my $easy3 = Net::Curl::Easier->new(
    username => 'hal',
    userpwd => 'itsasecret',
    url => 'scp://tty.example.com/path/to/file',
);

my $promiser = Net::Curl::Promiser::Mojo->new();

Mojo::Promise->all_settled(
    $promiser->add_handle($easy1)->then( sub {
        print $easy1->head(), $easy1->body();
    } ),
    $promiser->add_handle($easy2)->then( sub {
        # ... whatever you want with the IMAP result
    } ),
    $promiser->add_handle($easy3)->then( sub {
        # ... whatever you want with the SCP result
    } ),
)->wait();

We just grabbed a web page, queried a mailbox, and downloaded a file via SCP, all in parallel!

Note, too, that this method interfaces seamlessly with other promises. So if you have existing Mojo::UserAgent-based code, you can add requests for other protocols alongside it.

Net::Curl::Promiser also works natively with
AnyEvent and
IO::Async, should those be of greater interest to you. It also provides a convenience layer for custom select-based event loops, in case that’s how you roll.

Other Modules

Some alternatives to modules presented above:

AnyEvent::YACurl: A newer library than Net::Curl that simplifies the interface a bit. It assumes use of AnyEvent, though, so if you’re not using AE then this may not be for you.
WWW::Curl: The library of which Net::Curl is a fork. It can do much of what Net::Curl does but lacks access to libcurl’s MULTI_SOCKET interface, which is faster and more flexible than curl’s internal select-based manager for concurrent requests.
Net::Curl::Simple: A wrapper by Net::Curl’s original author. It provides some of the same conveniences as Net::Curl::Promiser and Net::Curl::Easier but uses callbacks rather than promises.

Closing Thoughts

Curl exposes an awesome breadth of functionality, of which the above examples have just scratched the surface. Check it out!

use Sys::Binmode;

Felipe Gasper — Sat, 27 Mar 2021 10:34:01 +0000

Character encoding is an often-misunderstood aspect of Perl. Perl itself has significant bugs in the area. I recently published a CPAN module called Sys::Binmode which fixes most of those. If you write Perl you probably should use it in all new code.

I know that’s a “tall” claim, but …

Check this out:

my $foo = "\xff\x{100}";
chop $foo;
print $foo, $/;
exec "echo", $foo;

This looks like it ought to print two identical lines, right? But in fact, it prints:

�
ÿ

Why? To answer that we have to learn a bit of Perl’s internals. Read on!

Background: What’s in a string?

In theory, Perl strings store code points, nothing more. They don’t store “bytes” or “characters”, but just code points—i.e., unsigned integers. (In that sense, Perl is more like JavaScript than C!)

That, of course, is just an abstraction: all programming languages use bytes internally to store strings. How does Perl decide which bytes to use for which code points? As it happens, Perl can do that in either of two formats: a “narrow” format that can store code points 0-255 only, and a “wide” format that can store any arbitrary code point. Which of those formats Perl uses for a given string is up to Perl; things that aren’t Perl generally shouldn’t care about it.

For this abstraction to work, whether Perl stores a given string as “narrow” or “wide” must make no difference to a Perl program. And indeed, if you print $foo you’ll get the same result regardless of which internal format Perl uses to store $foo. Same for syswrite and send.

Most of Perl’s built-ins, though—e.g., exec, open, mkdir, etc.—don’t work this way.

Look again at our 4-line program above. In line 1 we create a string with 2 code points: 255 and 256. In line 2 we chop off the latter code point, so now $foo just has 255. In line 3 we print that string and a newline; in line 4 we run echo to do the same thing. Ideally lines 3 and 4 should achieve the same output. But for you they probably didn’t. Why?

Let’s rerun that program but this time pipe it to xxd to see exactly what’s being output:

> pbpaste | perl | xxd
00000000: ff0a c3bf 0a

0a is just the newline character. So line 3 printed a single byte, 0xff (plus newline), while echo on line 4 printed 2 bytes—0xc3 0xbf (and a newline). Line 3 is correct: a string that contains code point 255 should output byte 255 (i.e., 0xff). What’s going on with line 4?

Recall that, of Perl’s internal string-storage formats, only the “wide” one can handle code points above 255. Since $foo on line 1 contains code point 256, Perl stores that string in “wide” format. Then in line 2 we get rid of code point 256. Now we have just 255. Perl could thus switch our string to its “narrow” (i.e., 0-255) format, but it happens—as of Perl 5.32, anyway—not to.

This should make no difference since those internal storage details are behind Perl’s string abstraction. That’s the case with print, but exec misbehaves: it outputs Perl’s raw internal buffer rather than the proper code-point-to-byte conversion that print uses. Perl, though, doesn’t publicly define the contents of that internal buffer. Thus we have undefined behaviour, aka “nasal demons”, built directly into Perl!

This is a leak in Perl’s string-storage abstraction, and it’s what Sys::Binmode fixes.

(Extra credit: remove the \x{100} and line 2 from our program above, and rerun it. The two lines should now be the same. Why?)

Enter Sys::Binmode

Sys::Binmode fixes exec and many other Perl built-ins by force-converting those built-ins’ arguments to Perl’s internal “narrow” string storage format. This fixes the abstraction leak: now, no matter how these strings are stored, Perl gives them to the operating system the same way.

Try it: do cpan Sys::Binmode, then rerun our program with perl -MSys::Binmode. It’ll now print two identical lines.

Special Case: Non-POSIX OSes (e.g., Windows)

Windows programmers may see a problem here: Perl’s “narrow” string storage format can only store bytes, so any time we want to give arbitrary Unicode characters to the operating system—which doesn’t exist for POSIX OSes like Linux—we’re stuck.

As it happens, though, Perl doesn’t actually use the Windows APIs that would allow sending arbitrary Unicode characters anyway. If Perl ever changed that Sys::Binmode would need an update, but for now it can work the same way as on POSIX OSes without compromising any functionality.

Use in Existing Code?

Note that I say to use Sys::Binmode in new code, not all code. This is because existing code may actually depend on Perl’s abstraction leak.

Look again at exec’s broken behaviour above. For code point 255 it printed the bytes of Perl’s “wide” storage format, which for that string was 2 bytes: 0xc3 and 0xbf. Notice that that broken behaviour actually made our terminal print something useful: ÿ. As it happens, those 2 bytes from Perl’s internals are UTF-8 for 255. That’s because Perl’s “wide” internal format is actually just (a “lax” variant of) UTF-8, so anything that outputs Perl’s internals will output UTF-8 if Perl stores the string in “wide” format.

Ordinarily to output a string in UTF-8 you encode it thus explicitly, e.g., encode('UTF-8', $str). exec appears to be automatically encode()ing for us, but it’s actually just outputting whatever Perl happens to store internally. So if Perl decides to store a string “wide”, it’ll give UTF-8 to exec … but if Perl decides to store that string “narrow”, then you’ll get something else! We could try to second-guess Perl’s internal decision-making, but that’s dangerous: how Perl decides to store its strings is undocumented and always subject to change.

Sys::Binmode will suppress that unreliable “auto-encode” behaviour, which forces us to encode our strings properly before giving them to exec and friends. Of course, that’s what we should have done all along!

Conclusion

I can think of no situation where Sys::Binmode effects any undesirable change to Perl in new code. It surely fixes bugs like in our exec demo program. Assuming that I’m correct that, for new code, this module only avoids problems without introducing any, it should be used in all new code.

Convinced? :-)

Perl, Unicode, and Bytes

Felipe Gasper — Fri, 29 Jan 2021 00:29:50 +0000

Wide character in print at Foo/Bar.pm line 27.

We’ve all been here: that maddening “wide character” warning. Why does it happen? How can we fix it? How can we prevent it in the future? Let’s take a look.

Lots of early Perl adopters were C programmers. C strings are arrays of bytes, which allow code points up to 255, and that’s it. Perl used that model for many years.

Along came Unicode, and with it a need for Perl to store code points that exceed 255 (i.e., “wide characters”). The solution—which Perl retains today—was to give Perl a 2nd way of storing a string: in addition to C-style “byte strings”, Perl can store strings in an internal, Unicode-compatible encoding. Thus, a Perl string can now natively store any Unicode code point.

Of course, programs don’t generally receive “wide characters” as inputs. They receive bytes, then decode those bytes into “characters”. Then they encode the characters back into bytes for output. In general, then, each program:

… receives bytes as input,
… decodes those bytes to characters,
… does something with those characters,
… encodes its output characters to bytes,
… and outputs those bytes.

Here’s the trick: lots of Perl programs simply don’t care about “characters”; for example, if all you’re doing is piping a stream from one filehandle to another, there’s no reason to decode bytes to characters since we’re just going to re-encode those characters to bytes right away. For such programs, Perl’s pre-Unicode, a-byte-is-a-character-is-a-byte model works just fine.

Let’s call these two workflows “character-oriented” and “byte-oriented”. Most character encoding problems in Perl arise from a conflict between these two.

Byte-Oriented Data in a Character-Oriented World

Suppose we omit step 2 above. Consider the following:

> perl -MJSON::PP -E'my $s = "…"; say JSON::PP::encode_json([$s])'
["â€¦"]

To grok the above, first consider $s. Most folks nowadays probably use UTF-8 terminals, which means … takes 3 bytes: 0xe2 0x80 0xa6. Our one-liner doesn’t decode $s, so as far as Perl’s concerned $s is 3 characters: 0xe2 0x80 0xa6.

encode_json(), though, expects its input strings to be decoded. It also outputs a byte sequence; thus, it applies a UTF-8 encode to each of $s’s 3 characters, which yields 6 bytes: 0xe2 becomes 0xc3 0xa2, 0x80 becomes 0xc2 0x80, and 0xa6 becomes 0xc2 0xa6.

To fix this, we can do one of:

A) Decode the input, e.g.:

my $s = "…";
Encode::Simple::decode_utf8($s);
say JSON::PP::encode_json([$s]);

B) Provide a “pre-decoded” string:

my $s = "\x{2026}";
say JSON::PP::encode_json([$s]);

C) Make the JSON encoder forgo character encoding, e.g.:

my $s = "…";
say JSON::PP->new()->utf8(0)->encode([$s]);

CAVEAT: This latter approach can yield invalid JSON.

Character-Oriented Data in a Byte-Oriented World

The opposite problem—omitting step 4 in our 5-step workflow above—is a bit more interesting:

> perl -MJSON::PP -E'say JSON::PP::decode_json(q<["…"]>)->[0]'
Wide character in print at -e line 1.
…

Unlike before, where the mangled characters in the output reveal a palpable problem, here the program actually prints the right thing; it’s just throwing a warning along the way. What gives?

Just as encode_json() does a UTF-8 encode on its input, decode_json() does a UTF-8 decode. That means that decode_json(q<["…"]>)->[0] is a single character, 0x2026. So before we print it we’re supposed to encode it. Indeed, once we do that, the warning goes away:

> perl -MEncode::Simple -MJSON::PP -E'say encode_utf8( JSON::PP::decode_json(q<["…"]>)->[0])'
…

So can I just ignore that warning?

Maybe. But don’t.

As we know, Perl can store strings as “byte strings”: simple sequences of code points 0-255. Perl can also, though, store strings in an “upgraded”, abstract Unicode encoding. Such an “upgraded” string falls into one of two categories:

1) “Bytes-compatible”: All code points fall in the 0-255 range. In other words, Perl could store this string “downgraded”, but for whatever reason isn’t.

2) “Bytes-incompatible”: One or more code points exceed 255.

When outputting upgraded strings, Perl follows these rules:

1) If the string is bytes-compatible: output the string’s “downgraded” form.

2) Otherwise: Output the code points encoded to UTF-8, and “complain”: if we’re syswrite()ing, Perl throws an exception, but if we’re say()ing or print()ing then Perl just warns.

Of course, lots of applications output UTF-8 anyway, in which case #2 above happens to be “the right thing”. But Perl would rather you be explicit: encode your strings before outputting them.

That Encoding Behind the Curtain …

Perl’s “internal Unicode encoding” is, in fact, just UTF-8. (Actually a “loose” variant thereof, but we digress.) It’s really better to forget this unless you’re maintaining Perl itself—even XS modules shouldn’t care!—but for the sake of a concrete understanding we’ll look at a few examples here.

Perl Internals: Wide Characters

Compare the following:

perl -MDevel::Peek -MEncode::Simple -e'my $s = "…"; decode_utf8($s); Dump $s'
SV = PV(0x7fc992804c70) at 0x7fc992816348
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7fc9927006d0 "\342\200\246"\0 [UTF8 "\x{2026}"]
  CUR = 3
  LEN = 10

… versus:

> perl -MDevel::Peek -e'my $s = "…"; Dump $s'
SV = PV(0x7f9e5e804c70) at 0x7f9e5e8162a0
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x7f9e5e40bbe0 "\342\200\246"\0
  CUR = 3
  LEN = 10
  COW_REFCNT = 1

The important piece here is that [UTF8 "\x{2026}"] bit that we see only in the top example. This is the string’s content as Perl code sees it: a single character with code point 0x2026.

Perl Internals: UTF8-Invariant Characters

Now consider:

> perl -MEncode::Simple -MDevel::Peek -e'Dump( decode_utf8("abc") )'
SV = PV(0x7f81bc004d30) at 0x7f81bc0042a8
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x7f81bbf46770 "abc"\0 [UTF8 "abc"]
  CUR = 3
  LEN = 10

A special feature of UTF-8 is that, unlike other Unicode encodings (UTF-16 & al.), it encodes code points 0-127 identically to US-ASCII and ISO-8859-1. We call these code points “UTF8-invariant” because Perl stores them as the same bytes regardless of whether the string is upgraded or not.

Watch this, though:

> perl -MDevel::Peek -MEncode -e'my $s = "abc"; utf8::decode($s) or die "bad"; Dump $s'
SV = PV(0x7fa09a004c70) at 0x7fa09a016348
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x7fa099e01540 "abc"\0
  CUR = 3
  LEN = 10

This is the same logic as we achieved with Encode::Simple, but with a twist: Perl did not upgrade the string! What gives??

It turns out that upgraded strings are slower than their downgraded forms: to do much of anything with upgrades strings you have to parse each Unicode character out of the buffer. For this reason, utf8::decode will (like its parallel C API function) leave strings downgraded unless the decoded string is bytes-incompatible. Encode::Simple, by contrast, always upgrades, even for bytes-compatible strings. (Unicode::UTF8 does the same.)

This is why we can’t just say “Perl stores text strings as UTF-8.” Some character decoders do work that way, but Perl’s own internal decoder doesn’t.

Perl Internals: The Really Messy Part

We’ve looked at how Perl stores bytes-incompatible (>255) code points and UTF8-invariant ones (0-127). What about the 128-255 range?

Here’s where it gets dicey: these code points are bytes-compatible but not UTF8-invariant. Perl can thus store these either downgraded or upgraded, but this time it matters which they are.

Recall our example above where we looked at the Dump() of undecoded …. Compare that to:

> perl -MDevel::Peek -e'my $s = "…"; utf8::upgrade($s); Dump $s'
SV = PV(0x7feb80004c70) at 0x7feb800162a0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7feb7fc04930 "\303\242\302\200\302\246"\0 [UTF8 "\x{e2}\x{80}\x{a6}"]
  CUR = 6
  LEN = 10

utf8::upgrade() internally encodes the formerly-downgraded $s as UTF-8. As far as Perl code goes it’s the same string; only its internal representation changes. Since $s was already a UTF-8 sequence, what Perl stores in memory is double-encoded; however, to the Perl application it actually makes no difference because anything that accesses that string will see 3 characters (0xe2 0x80 0xa6), not Perl’s internally-double-encoded stuff. This includes outputting the string, e.g.:

> perl -E'my $s = "…"; say $s; utf8::upgrade($s); say $s'
…
…

It’s sometimes surprising which interfaces return upgraded strings and which don’t. For example JSON::PP’s encode() returns an upgraded string, even if we disable character encoding:

> perl -MDevel::Peek -MJSON::PP -E'Dump( JSON::PP->new()->utf8(0)->encode(["…"]) )'
SV = PV(0x7fd786004ff0) at 0x7fd78909e4f8
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x7fd78826a8a0 "[\"\303\242\302\200\302\246\"]"\0 [UTF8 "["\x{e2}\x{80}\x{a6}"]"]
  CUR = 10
  LEN = 13
  COW_REFCNT = 0

REMINDER: Nothing to See Here!

The above Devel::Peek examples are a purely-informational “peek behind the curtain” at Perl’s internals. Unless you’re altering Perl itself—again, even XS modules should ignore Perl internals—ignore Perl’s internal encoding.

Our Way Forward

Most modern programming languages use different types to represent “binary strings” and “character strings”. Perl, for better or for worse, does not; like the difference between a string and a number, we have to track that ourselves.

Here, then, are the best things we Perl programmers can do for ourselves and for each other to prevent character encoding problems:

Consider Perl to have one type of string: a character string. Perl wants you to ignore its internal encoding; don’t fight that. (Technically Perl could change its internal encoding scheme, and well-behaved modules, whether pure-Perl or XS, would keep working.)
Document whether your modules expect strings to be character-decoded or not. Do likewise for returned strings. (Maybe even provide functions for both, as Mojo::JSON does.)
Prefer Encode::Simple over alternatives like Encode, utf8, and Unicode::UTF8. Encode::Simple, by default, throws an exception when it encounters invalid data, which means you’ll catch errors up-front rather than deep in your code. The others all accept invalid input by default.
For XS authors: When working with PVs (strings), always differentiate between the two encodings. Macros like SvPVbyte, SvPVutf8, and their variants are your friends!