<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Felipe Gasper</title>
    <description>The latest articles on DEV Community by Felipe Gasper (@fgasper).</description>
    <link>https://dev.to/fgasper</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F533009%2Fc093c4c4-2080-4c37-862f-6b8657f1d183.jpeg</url>
      <title>DEV Community: Felipe Gasper</title>
      <link>https://dev.to/fgasper</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fgasper"/>
    <language>en</language>
    <item>
      <title>Prose for Coding Pros</title>
      <dc:creator>Felipe Gasper</dc:creator>
      <pubDate>Thu, 29 Jul 2021 02:37:39 +0000</pubDate>
      <link>https://dev.to/fgasper/prose-for-coding-pros-1ck7</link>
      <guid>https://dev.to/fgasper/prose-for-coding-pros-1ck7</guid>
      <description>&lt;p&gt;The ability to express yourself verbally is a useful but oft-overlooked skill for software developers. Here I’m going to explore some guidelines I try to apply in my own writing.&lt;/p&gt;

&lt;p&gt;(In the following I’m using incomplete sentences per convention in software status messages.)&lt;/p&gt;

&lt;p&gt;DISCLAIMER: I am not a professional writer. YMMV.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Avoid “could not”.
&lt;/h2&gt;

&lt;p&gt;Consider this phrase:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Could not open the file “foo”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means that, at some point in the (assumedly recent) past, the “speaker”—assumedly some computer somewhere—lacked the ability to open &lt;code&gt;foo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But does that mean:&lt;/p&gt;

&lt;p&gt;1) The computer &lt;em&gt;observed&lt;/em&gt; its inability to open &lt;code&gt;foo&lt;/code&gt;, and didn’t bother trying?&lt;/p&gt;

&lt;p&gt;2) The computer tried, and failed, to open &lt;code&gt;foo&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;This is a significant difference, and an important one to disambiguate.&lt;/p&gt;

&lt;p&gt;#1 connotes a sense of ongoing inability: “I can’t do this thing, no matter how many times I try.” For example, I can’t jump over my house. No matter how many times I try, it’s just not going to happen. It was also true last night; thus, “last night I couldn’t jump over my house” is a true statement.&lt;/p&gt;

&lt;p&gt;#2, by contrast, reports on the result of one specific operation at one specific time. Let’s say I &lt;em&gt;did&lt;/em&gt; try last night, for some reason, to jump over my house. This is &lt;em&gt;probably&lt;/em&gt; what folks will infer I say “last night I couldn’t jump over my house”, but logically it’s not the only meaning; thus, the statement is imprecise. We can do better.&lt;/p&gt;

&lt;p&gt;(Linguistically-minded folks may observe a certain parallel with the distinction between “imperfect” and “preterite” past-tense forms.)&lt;/p&gt;

&lt;p&gt;Consider expressions like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Cannot open the file “foo”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This connotes ongoing inability, which is the logical way to report cases where, e.g., filesystem permissions impede access to a given resource. (To go back to the house analogy, this would be “I can’t jump over my house.”)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Failed to open the file “foo”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This states unambiguously that “I tried, but failed.” It doesn’t attempt to communicate ongoing inability; it just gives the outcome of a specific attempt to do something. (Also: “Last night I failed to jump over my house.”)&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Prefer antonyms over negation.
&lt;/h2&gt;

&lt;p&gt;Consider this phrase:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Not Authorized&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We could interpret this at least two ways:&lt;/p&gt;

&lt;p&gt;1) I cannot access the resource.&lt;/p&gt;

&lt;p&gt;2) I might have access to the resource, but nothing has explicitly authorized me.&lt;/p&gt;

&lt;p&gt;While native English speakers may intuitively understand “Not Authorized” to imply meaning #1, a nonnative speaker may not. Logically, either meaning is an accurate interpretation of the phrase.&lt;/p&gt;

&lt;p&gt;We can be more precise thus:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Access Forbidden&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sometimes following this principle will lead you to some over-awkward wordings. For example, you &lt;em&gt;could&lt;/em&gt; replace …&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You do not have a file named “foo”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;… with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You lack a file named “foo”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;… but that sounds kind of goofy. Depending on context that may be OK, though!&lt;/p&gt;

&lt;p&gt;(cf. Strunk &amp;amp; White, &lt;em&gt;The Elements of Style&lt;/em&gt;, where it says to ”Put statements in positive form.”)&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Prefer passive voice to active voice &lt;em&gt;when it makes more sense&lt;/em&gt;.
&lt;/h2&gt;

&lt;p&gt;We should prefer active voice to passive voice in most cases: it’s clearer to state that an actor “does” an action, rather than that the action “is done” by the actor.&lt;/p&gt;

&lt;p&gt;Oftentimes, though, I find myself describing scenarios where the “doer” is either unknown or irrelevant. In those cases, it’s often &lt;em&gt;more logical to use the passive voice&lt;/em&gt; than to contort your phrasing to allow active voice.&lt;/p&gt;

&lt;p&gt;Consider the following:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the file has been opened in the last hour, delete the file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That “has been opened” is classic passive voice: a form of “to be” plus a past participle. We might try to convert it to active voice thus:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If anyone has opened the file in the last hour, delete the file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This, though, subtly changes the meaning: now our condition—i.e., the “if block” of our phrase—stipulates a “doer”: “anyone”. It’s a pretty generic expression, but might it lead someone to think that it refers to a &lt;em&gt;person&lt;/em&gt;? What if what opens the file is a script?&lt;/p&gt;

&lt;p&gt;OK, so we rephrase it again:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If anyone or anything has opened the file in the last hour, delete the file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This works logically, but to my sense it’s getting a bit awkward. The original passive form of the expression is simpler and clearer.&lt;/p&gt;

&lt;p&gt;(cf. Yellowlees Douglas, &lt;em&gt;The Reader’s Brain&lt;/em&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Rather than “of”, consider a possessive.
&lt;/h2&gt;

&lt;p&gt;In German, Igor Stravinsky’s “Rite of Spring” is “Frühlingsopfer”: literally “Spring’s Rite”. We wouldn’t say that in English because it just sounds awkward, but in other contexts it can yield a slight gain in concision. Consider these:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the file’s change time&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;… versus …&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the change time of the file&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first one is just a &lt;em&gt;bit&lt;/em&gt; more concise.&lt;/p&gt;

&lt;p&gt;Another advantage of the possessive form is its “object-oriented” syntax: the word order corresponds to how that might look in code: &lt;code&gt;file.ctime&lt;/code&gt;. If your target audience is developers, that might facilitate a bit easier understanding, though it’s subtle enough that they may not notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Describe causal relationships explicitly.
&lt;/h2&gt;

&lt;p&gt;Compare the following sentences:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You cannot save “foo”. You have exceeded your quota.&lt;/p&gt;

&lt;p&gt;You cannot save “foo” because you have exceeded your quota.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most folks will probably interpret the above phrases the same way, but note that only the second form &lt;em&gt;unambiguously&lt;/em&gt; gives the relationship between the two ideas (i.e., inability to save “foo” and quota excess). Being explicit reduces the likelihood of misunderstanding.&lt;/p&gt;

&lt;p&gt;Note, however, that the second form includes one larger sentence in lieu of two smaller ones. All things else being equal, longer sentences impede comprehension; thus, one might fear that that loss in comprehensibility outweighs the gain in clarity that “because” yields.&lt;/p&gt;

&lt;p&gt;A third option may yield a “happy medium”:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You cannot save “foo”. This is because you have exceeded your quota.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now there’s more verbiage overall, but we’re back to two small sentences rather than one large one, and we still describe the causal relationship explicitly.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Write “properly”.
&lt;/h2&gt;

&lt;p&gt;Like it or not, some people &lt;em&gt;will&lt;/em&gt; judge you—maybe even subconsciously—for neglecting widely-accepted grammar conventions. (But hey, &lt;a href="https://www.youtube.com/watch?v=TnlPtaPxXfc"&gt;that’s life!&lt;/a&gt;) Overall, it’s best to avoid that. Learn and apply the difference between “its” and “it’s”, “who” and “whom”, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Beware “the”.
&lt;/h2&gt;

&lt;p&gt;“The” is essentially a “global”: it lacks context, which means whatever it references applies universally, unless something else around it restrict scope. Consider:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The submit button doesn’t work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are lots of submit buttons in the world … which one do we mean? Assumedly there’d be context around this phrase that restricts it, but as with “could not” above, it’s better to minimize reliance on context. Consider instead, then:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Our login page’s submit button doesn’t work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  8. Avoid “error” as a verb.
&lt;/h2&gt;

&lt;p&gt;“Error”—in American English, anyhow—isn’t a verb. Say “fail” instead—or, if you must, “err”! :)&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Avoid “not $this because $that”.
&lt;/h2&gt;

&lt;p&gt;Consider this phrase:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I’m not going hiking because of the weather.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All this means is that the phrase “I am hiking because of the weather” is false. Thus, either of these could be what the phrase means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It’s too cold for me to go hiking.&lt;/li&gt;
&lt;li&gt;Weather isn’t why I’m hiking; I’m hiking to find buried treasure!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To avoid this ambiguity, do one of the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Turn the phrase around: “Because of the weather, I’m not going hiking.”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Express the consequence positively, e.g., “I’m skipping the hiking trip because of the weather.”&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  What difference does all this really make?
&lt;/h1&gt;

&lt;p&gt;Remember the last time you debugged something that just &lt;em&gt;did not&lt;/em&gt; make sense? How many times did you stare at whatever error messages you had, comb through logs looking for clues, or the like? At some point you likely realized that you had misunderstood something—or, at the very least, you spent time &lt;em&gt;wondering&lt;/em&gt; if you had.&lt;/p&gt;

&lt;p&gt;This is the value of unambiguous messaging. If I see “failed to open the file”, I know an attempt was made to do so; if I just see “could not open the file”, though, I may spend time wondering whether that means an attempt was made, or merely that a &lt;code&gt;stat()&lt;/code&gt; revealed improper permissions or ownership.&lt;/p&gt;

&lt;p&gt;Unambiguous communication saves time and frustration for you and your colleagues, present and future. What’s not to love about that?&lt;/p&gt;

</description>
      <category>writing</category>
    </item>
    <item>
      <title>SQLite, Perl, and a Boolean</title>
      <dc:creator>Felipe Gasper</dc:creator>
      <pubDate>Sun, 30 May 2021 16:27:02 +0000</pubDate>
      <link>https://dev.to/fgasper/sqlite-perl-and-a-boolean-30d8</link>
      <guid>https://dev.to/fgasper/sqlite-perl-and-a-boolean-30d8</guid>
      <description>&lt;p&gt;I’ve &lt;a href="https://dev.to/fgasper/use-sys-binmode-4e6o"&gt;written&lt;/a&gt; &lt;a href="https://dev.to/fgasper/perl-s-svpv-menace-5515"&gt;several&lt;/a&gt; &lt;a href="https://dev.to/fgasper/perl-unicode-and-bytes-5cg7"&gt;articles&lt;/a&gt; &lt;a href="https://www.perl.com/article/json-unicode-and-perl-oh-my-/"&gt;now&lt;/a&gt; about the trials and tribulations of character encoding in Perl. Having gained the knowledge I have, I’ve also been finding bugs in libraries we use at $work and sending patches to their maintainers.&lt;/p&gt;

&lt;p&gt;The latest one is &lt;a href="https://metacpan.org/pod/DBD::SQLite"&gt;DBD::SQLite&lt;/a&gt;, CPAN’s self-contained SQLite binding. It’s a great library that I’ve used for years, but I recently noted two problems in it:&lt;/p&gt;

&lt;p&gt;1) In its default configuration it used the &lt;code&gt;SvPV&lt;/code&gt; macro to translate Perl strings to C strings, which is bad for reasons I detailed in “&lt;a href="https://dev.to/fgasper/perl-s-svpv-menace-5515"&gt;Perl’s SvPV Menace&lt;/a&gt;”.&lt;/p&gt;

&lt;p&gt;2) In its (non-default) “unicode” configuration it used a “naïve” method of UTF-8 decoding that neglects validation. This mechanism can corrupt Perl’s internals by making it mistake invalid UTF-8 sequences for valid ones.&lt;/p&gt;

&lt;p&gt;Neither of these is trivial to fix: applications may depend on the &lt;code&gt;SvPV&lt;/code&gt; problem—what one coworker of mine calls a “load-bearing bug” 😀—while adding UTF-8 validation entails a performance hit.&lt;/p&gt;

&lt;p&gt;In reality, DBD::SQLite needed at least 4 modes of translating between Perl and C strings:&lt;/p&gt;

&lt;p&gt;1) The current (“load-bearing-buggy”) default.&lt;br&gt;
2) Same as #1, but use &lt;code&gt;SvPVbyte&lt;/code&gt; to avoid the &lt;code&gt;SvPV&lt;/code&gt; bug.&lt;br&gt;
3) Current “naïve unicode” behaviour.&lt;br&gt;
4) A “non-naïve unicode” mode that validates incoming UTF-8.&lt;/p&gt;

&lt;p&gt;(I eventually made two variants of this last one: one that just warns on invalid data, and the other that throws an exception.)&lt;/p&gt;

&lt;p&gt;There was another problem, though: DBD::SQLite’s interface for controlling this was a boolean. That meant only two modes were even &lt;em&gt;possible&lt;/em&gt;!&lt;/p&gt;

&lt;p&gt;This exemplifies a principle a mentor of mine taught me years back: &lt;strong&gt;avoid boolean parameters.&lt;/strong&gt; They restrict your ability to add additional configurations.&lt;/p&gt;

&lt;p&gt;(And for pity’s sake, abhor &lt;em&gt;unnamed&lt;/em&gt; booleans in particular! What does the 0 in &lt;code&gt;open_file($path, 0)&lt;/code&gt; mean??)&lt;/p&gt;

&lt;p&gt;To fix this &lt;a href="https://github.com/DBD-SQLite/DBD-SQLite/pull/80"&gt;my pull request&lt;/a&gt; had to deprecate the existing &lt;code&gt;sqlite_unicode&lt;/code&gt; parameter. It’s an unfortunate step that’ll produce new warnings in existing applications, but the “omelet” here justifies the “broken egg”.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>perl</category>
      <category>sql</category>
    </item>
    <item>
      <title>Perl’s SvPV Menace</title>
      <dc:creator>Felipe Gasper</dc:creator>
      <pubDate>Thu, 27 May 2021 02:42:30 +0000</pubDate>
      <link>https://dev.to/fgasper/perl-s-svpv-menace-5515</link>
      <guid>https://dev.to/fgasper/perl-s-svpv-menace-5515</guid>
      <description>&lt;p&gt;If you’ve read my &lt;a href="https://dev.to/fgasper/perl-unicode-and-bytes-5cg7"&gt;Perl, Unicode, and Bytes&lt;/a&gt; or &lt;a href="https://dev.to/fgasper/use-sys-binmode-4e6o"&gt;Sys::Binmode&lt;/a&gt; posts, you know about the complexities of character encoding in Perl. A bit after I wrote that first post I had a little epiphany I thought worth sharing.&lt;/p&gt;

&lt;p&gt;One day I noticed that &lt;a href="https://metacpan.org/pod/URI::XSEscape"&gt;URI::XSEscape&lt;/a&gt; was mangling its output: I’d pass in &lt;code&gt;épée&lt;/code&gt; and get out &lt;code&gt;%C3%83%C2%A9p%C3%83%C2%A9e&lt;/code&gt;. I recognized this as an extra UTF-8 encode: rather than URI-encoding my 6 bytes of &lt;code&gt;épée&lt;/code&gt;, it was UTF-8 encoding—so now 10 bytes—then URI-encoding that.&lt;/p&gt;

&lt;p&gt;I pulled out &lt;a href="https://metacpan.org/pod/Devel::Peek"&gt;Devel::Peek&lt;/a&gt; and saw that something prior to the URI-encoding step had “upgraded” my string’s internal storage: Perl itself stored my string as 10 bytes, even though the Perl scalar still consisted of 6 &lt;em&gt;characters&lt;/em&gt;. Ordinarily this is nothing of importance since Perl code doesn’t need to care how Perl itself stores its strings.&lt;/p&gt;

&lt;p&gt;… until it &lt;em&gt;does&lt;/em&gt; need to care, that is.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is SvPV?
&lt;/h1&gt;

&lt;p&gt;Perl’s C API—the set of macros and functions available to work with Perl from C—is a classic C API: lots of different ways to do &lt;em&gt;almost&lt;/em&gt; the same thing. To translate a Perl scalar to a C signed integer, for example, you can use &lt;code&gt;SvIV&lt;/code&gt;, &lt;code&gt;SvIV_nomg&lt;/code&gt;, &lt;code&gt;SvIVX&lt;/code&gt;, or &lt;code&gt;SvIVx&lt;/code&gt;. (&lt;code&gt;IV&lt;/code&gt; here signifying an “integer value”) A similar set of macros exists for unsigned integers (&lt;code&gt;UV&lt;/code&gt;s).&lt;/p&gt;

&lt;p&gt;Converting a Perl scalar to a C string is similar. There are many tools available, but the 3 “fundamental” ones are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SvPVbyte&lt;/code&gt;: Takes the code points of your Perl string and gives back a C buffer whose bytes match those code points. Thus, any code point that exceeds 255 doesn’t work, and an exception is thrown.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SvPVutf8&lt;/code&gt;: Like &lt;code&gt;SvPVbyte&lt;/code&gt; but gives the &lt;em&gt;UTF-8-encoded&lt;/em&gt; bytes for your Perl string’s code points. This works for any code point that Perl can store, but for code points 128-255 it’ll give different results from &lt;code&gt;SvPVbyte&lt;/code&gt;. (cf. &lt;a href="https://perldoc.perl.org/perlunicode#The-%22Unicode-Bug%22"&gt;perldoc perlunicode&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SvPV&lt;/code&gt;: Gives you the Perl string’s internal buffer, aka its “PV” (“pointer value”). It could be bytes, or it could be UTF-8. It’s like a C analogue to Perl’s &lt;code&gt;use bytes&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;SvPV&lt;/code&gt;, of course, is what URI::XSEscape was using.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;SvPV&lt;/code&gt; to be meaningful it has to be used in tandem with &lt;code&gt;SvUTF8&lt;/code&gt;, a macro that tells you which form the PV is: bytes, or UTF-8. So if &lt;code&gt;SvUTF8&lt;/code&gt; is true, then &lt;code&gt;SvPV&lt;/code&gt;’s output is UTF-8; otherwise &lt;code&gt;SvPV&lt;/code&gt;’s output is bytes. But URI::XSEscape wasn’t checking &lt;code&gt;SvUTF8&lt;/code&gt;; it was just URI-encoding &lt;code&gt;SvPV&lt;/code&gt; directly.&lt;/p&gt;

&lt;p&gt;The big problem with &lt;code&gt;SvPV&lt;/code&gt; is that the number of contexts &lt;em&gt;other&lt;/em&gt; than Perl where it’s sensible to have a C string that could be bytes &lt;em&gt;or&lt;/em&gt; UTF-8 is … small. Nevertheless, uses of this macro (and its variants) to interact with contexts outside Perl are all over CPAN.&lt;/p&gt;

&lt;p&gt;URI::XSEscape, like &lt;a href="https://metacpan.org/pod/URI::Escape"&gt;its pure-Perl counterpart&lt;/a&gt;, presents interfaces appropriate for both “byte-oriented” and “character-oriented” Perl code (cf. &lt;a href="https://dev.to/fgasper/perl-unicode-and-bytes-5cg7"&gt;Perl, Unicode, and Bytes&lt;/a&gt;). Since the byte-oriented interface is what I was using, switching URI::XSEscape from &lt;code&gt;SvPV&lt;/code&gt; to &lt;code&gt;SvPVbyte&lt;/code&gt; was the simple fix to this problem.&lt;/p&gt;

&lt;p&gt;In essence, C code like URI::XSEscape should approach Perl strings the same way that pure-Perl code does, without caring about Perl’s internal string storage. &lt;em&gt;Most&lt;/em&gt; C code should thus avoid &lt;code&gt;SvPV&lt;/code&gt; for the same reason that most Perl should not &lt;code&gt;use bytes&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  The plot thickens …
&lt;/h1&gt;

&lt;p&gt;A quick scan through some popular XS modules showed more occurrences of this problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://metacpan.org/pod/DBD::SQLite"&gt;DBD::SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://metacpan.org/pod/Net::Curl"&gt;Net::Curl&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://metacpan.org/pod/DNS::Unbound"&gt;DNS::Unbound&lt;/a&gt; (mea culpa!)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://metacpan.org/pod/DNS::LDNS"&gt;DNS::LDNS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://metacpan.org/pod/YAML::Syck"&gt;YAML::Syck&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://metacpan.org/pod/HTTP::Parser::XS"&gt;HTTP::Parser::XS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://metacpan.org/pod/Socket"&gt;Socket&lt;/a&gt; (a core module!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These offer a non-default mode that auto-encodes to UTF-8, but their default setup has the same bug:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://metacpan.org/pod/CDB_File"&gt;CDB_File&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://metacpan.org/pod/LMDB_File"&gt;LMDB_File&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are likely many more; those are just ones I’ve found.&lt;/p&gt;

&lt;h1&gt;
  
  
  How did this come to be?
&lt;/h1&gt;

&lt;p&gt;I suspect it’s that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SvPV&lt;/code&gt; is the shortest of the above-named methods for converting a Perl scalar to a C string. Thus, it’s easier to type and looks less “intimidating”.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Historically, Perl’s documentation favoured &lt;code&gt;SvPV&lt;/code&gt; in its examples of scalar-to-string conversion; the other two were seldom discussed. &lt;a href="https://github.com/Perl/perl5/commit/3c3f883d1ac1fc6048277d2d60015c66c211ac9b"&gt;I fixed this recently&lt;/a&gt;, but it’ll be years before everyone’s local &lt;code&gt;perldoc&lt;/code&gt; reflects that change.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Perl’s default &lt;a href="https://perldoc.perl.org/perlxstypemap"&gt;XS typemap&lt;/a&gt; uses &lt;code&gt;SvPV&lt;/code&gt; (without consulting &lt;code&gt;SvUTF8&lt;/code&gt;) to convert a scalar to a string. Thus, the following XSUB, called as &lt;code&gt;printstr($mystr)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;void
printstr (const char *str)
  CODE:
    fprintf(stdout, str);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… prints Perl’s internals, which a Perl caller isn’t supposed to care about. Ideally language defaults like this would be the “safe” ways to do things, but this particular one is nonsensical.&lt;/p&gt;

&lt;h1&gt;
  
  
  Does this problem affect your code?
&lt;/h1&gt;

&lt;p&gt;A simple way to test for this problem is to &lt;code&gt;utf8::upgrade&lt;/code&gt; your strings before you give them to the tested code—ensuring, of course, that you’re testing with some code points in the 128-255 range. Your test should verify that your program’s behaviour is the same with &lt;code&gt;utf8::upgrade&lt;/code&gt;d strings as with non-upgraded strings.&lt;/p&gt;

&lt;p&gt;You wouldn’t normally upgrade strings manually in production (since it makes your Perl code think about Perl’s internals, which it shouldn’t do), but for testing it’s fine and useful.&lt;/p&gt;

&lt;p&gt;For example, I found the URI::XSEscape problem by doing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $foo = "épée";
utf8::upgrade($foo);
print URI::XSEscape::uri_escape($foo);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Not just any old bug …
&lt;/h1&gt;

&lt;p&gt;The worst part of all this is that modules like CDB_File &lt;em&gt;can’t&lt;/em&gt; replace &lt;code&gt;SvPV&lt;/code&gt; without breaking existing applications that may depend on that &lt;code&gt;use bytes&lt;/code&gt;-ish behaviour. So there’s not much to do except build new, corrected interfaces, deprecating the old ones … which of course will &lt;em&gt;eventually&lt;/em&gt; necessitate changes to existing code. For Perl “gurus” that may be simple, but for everyone else changing existing code could be expensive, painful, and even harmful to Perl’s reputation as a language that prizes backward compatibility.&lt;/p&gt;

&lt;h1&gt;
  
  
  But that’s not all …
&lt;/h1&gt;

&lt;p&gt;XS code isn’t the only place where this bug appears; &lt;em&gt;Perl itself&lt;/em&gt; has it, too! Read all about it at “&lt;a href="https://dev.to/fgasper/use-sys-binmode-4e6o"&gt;use Sys::Binmode;&lt;/a&gt;”.&lt;/p&gt;

&lt;h1&gt;
  
  
  How can we fix this?
&lt;/h1&gt;

&lt;p&gt;I think &lt;em&gt;most&lt;/em&gt; code that uses &lt;code&gt;SvPV&lt;/code&gt; to convert a Perl string to a C string intends for Perl code points to correspond to bytes in the C string; thus, such code should actually use &lt;code&gt;SvPVbyte&lt;/code&gt; or one of its variants. (UTF-8-aware C code, of course, would use &lt;code&gt;SvPVutf8&lt;/code&gt;.) Toward that end, we &lt;strong&gt;MUST&lt;/strong&gt; discourage further use of &lt;code&gt;SvPV&lt;/code&gt;. I propose to the Perl community, then, a few  changes: some that don’t break anything, and others that will probably break some things:&lt;/p&gt;

&lt;h2&gt;
  
  
  Fixing this: The easy parts!
&lt;/h2&gt;

&lt;p&gt;1) Rename &lt;code&gt;SvPV&lt;/code&gt; and friends. We can’t &lt;em&gt;remove&lt;/em&gt; them, but we can create longer, “scarier-looking” aliases for them and use those names in the documentation. I propose &lt;code&gt;SvPVinternal&lt;/code&gt;, &lt;code&gt;SvPVinternal_const&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;2) Make &lt;code&gt;xsubpp&lt;/code&gt; warn when it sees SvPV or variants in a typemap.&lt;/p&gt;

&lt;p&gt;3) Use &lt;a href="https://dev.to/fgasper/use-sys-binmode-4e6o"&gt;Sys::Binmode&lt;/a&gt; in all new code to fix Perl’s own buggy behaviour.&lt;/p&gt;

&lt;p&gt;4) Submit bug reports! Audit the XS modules that you use, and if you find different behaviour between upgraded and downgraded strings, let the maintainers know—ideally by sending them patches!&lt;/p&gt;

&lt;h2&gt;
  
  
  Fixing this: The hard part …
&lt;/h2&gt;

&lt;p&gt;You can’t make an omelet without breaking some eggs, and you often can’t fix things like this without breaking &lt;em&gt;some&lt;/em&gt; current applications. Nevertheless …&lt;/p&gt;

&lt;p&gt;5) Make &lt;code&gt;char *&lt;/code&gt; and &lt;code&gt;const char *&lt;/code&gt; in Perl’s default typemap use &lt;code&gt;SvPVbyte&lt;/code&gt;. (Actually &lt;code&gt;SvPVbyte_nolen&lt;/code&gt;, but hey.) For the vast majority of XS modules this probably would be just a bug fix, though for apps that depend on a &lt;code&gt;use bytes&lt;/code&gt;-ish status quo there would be breakage. Thankfully, though: a) the most widely-used XS modules (e.g., &lt;a href="https://metacpan.org/pod/MIME::Base64"&gt;MIME::Base64&lt;/a&gt;, &lt;a href="https://metacpan.org/pod/JSON::XS"&gt;JSON::XS&lt;/a&gt;) where this &lt;em&gt;could&lt;/em&gt; be a problem don’t appear to be vulnerable, and b) any breakage would be easy to fix: module authors merely have to adopt &lt;code&gt;SvPVutf8&lt;/code&gt; if that’s what they want, optionally creating separate functions if support for both is desired.&lt;/p&gt;

&lt;p&gt;6) Make &lt;a href="https://metacpan.org/pod/Sys::Binmode"&gt;Sys::Binmode&lt;/a&gt;’s behaviour Perl’s own behaviour. This is more contentious because it sidesteps the much larger problem of Perl’s lacklustre support for Windows filesystems; still, Sys::Binmode-type behaviour is no &lt;em&gt;worse&lt;/em&gt; than Perl’s status quo, and it fixes a significant leak in Perl’s string abstraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fixing this: The moon-shot …
&lt;/h2&gt;

&lt;p&gt;7) Perl needs to differentiate byte sequences from text strings. This would fix a plethora of “shin-bumpers” that afflict users of the language. This is a fairly difficult problem to solve, but I don’t think it’s insurmountable.&lt;/p&gt;

&lt;h1&gt;
  
  
  In the meantime …
&lt;/h1&gt;

&lt;p&gt;Absent fixes like the above, we just have to avoid this issue. You’ll always have consistent behaviour if you send &lt;em&gt;encoded&lt;/em&gt; strings to the operating system and &lt;em&gt;downgrade&lt;/em&gt; them prior to output; this way Perl doesn’t store any strings as UTF-8, so &lt;code&gt;SvPV&lt;/code&gt; and &lt;code&gt;SvPVbyte&lt;/code&gt; give the same result.&lt;/p&gt;

&lt;p&gt;IMPORTANT: If you don’t &lt;em&gt;decode&lt;/em&gt; your strings, then by definition they’re already encoded, so in this case &lt;em&gt;don’t&lt;/em&gt; encode them manually, or you’ll mangle your output.&lt;/p&gt;

</description>
      <category>perl</category>
    </item>
    <item>
      <title>Perling and Curling</title>
      <dc:creator>Felipe Gasper</dc:creator>
      <pubDate>Fri, 07 May 2021 04:36:06 +0000</pubDate>
      <link>https://dev.to/fgasper/perling-and-curling-2i10</link>
      <guid>https://dev.to/fgasper/perling-and-curling-2i10</guid>
      <description>&lt;p&gt;Most of us probably know &lt;a href="https://curl.se/"&gt;curl&lt;/a&gt; as a quick and easy way to send HTTP requests from the command line.&lt;/p&gt;

&lt;p&gt;That tool, though, is just an interface to the curl project’s real gold: the libcurl API. Using this API, applications in all sorts of languages have easy access to the awesome power that libcurl provides. This article will discuss how to use that power in Perl.&lt;/p&gt;

&lt;h1&gt;
  
  
  A Quick Example
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use Net::Curl::Easier;

my $easy = Net::Curl::Easier-&amp;gt;new(
    url =&amp;gt; 'http://perl.org',
    followlocation =&amp;gt; 1,
)-&amp;gt;perform();

print $easy-&amp;gt;head(), $easy-&amp;gt;body();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s talk about what just happened.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://metacpan.org/pod/Net::Curl::Easier"&gt;Net::Curl::Easier&lt;/a&gt; is a thin wrapper around &lt;a href="https://metacpan.org/pod/Net::Curl"&gt;Net::Curl&lt;/a&gt;’s “easy” interface—&lt;a href="https://curl.se/libcurl/c/libcurl-easy.html"&gt;“easy” is what libcurl calls it!&lt;/a&gt;—that smooths over some rough edges in Net::Curl.&lt;/p&gt;

&lt;p&gt;(Full disclosure: I am Net::Curl::Easier’s maintainer.)&lt;/p&gt;

&lt;p&gt;Once we create our “Easier” object, having given it the proper URL and told it to follow HTTP redirects (&lt;code&gt;followlocation&lt;/code&gt; refers to &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Location"&gt;HTTP’s &lt;code&gt;Location&lt;/code&gt; header&lt;/a&gt;), we run &lt;code&gt;perform()&lt;/code&gt; on the Easier object.&lt;/p&gt;

&lt;p&gt;After that, we print the HTTP response headers and body, and we’re done!&lt;/p&gt;

&lt;h1&gt;
  
  
  Why not just use &lt;a href="https://metacpan.org/pod/HTTP::Tiny"&gt;HTTP::Tiny&lt;/a&gt;?
&lt;/h1&gt;

&lt;p&gt;Indeed. Well, error reporting, for one. Consider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Net::Curl::Easier-&amp;gt;new(
    url =&amp;gt; 'http://blahblah',
)-&amp;gt;perform();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you run this you’ll probably just see &lt;code&gt;Couldn't resolve host name&lt;/code&gt; printed to standard error. But if you dig deeper you’ll see something nifty:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use Net::Curl::Easier;
use Data::Dumper;

eval {
    Net::Curl::Easier-&amp;gt;new(
        url =&amp;gt; 'http://blahblah',
    )-&amp;gt;perform();
};
print Dumper $@;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It turns out that that error isn’t just a string; it’s an exception &lt;em&gt;object&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In large systems I often want to handle certain failure types differently from others. HTTP::Tiny’s errors are just strings, so type-specific failure handling with HTTP::Tiny entails &lt;em&gt;parsing strings&lt;/em&gt;, which is brittle. What if someone decides to reword some error message for clarity, thus breaking my string parser?&lt;/p&gt;

&lt;p&gt;With Net::Curl I can look for specific numeric error codes, documentation for which &lt;a href="https://curl.se/libcurl/c/libcurl-errors.html"&gt;the curl project itself maintains&lt;/a&gt;. This is much more robust.&lt;/p&gt;

&lt;h1&gt;
  
  
  Don’t care. What else you got?
&lt;/h1&gt;

&lt;p&gt;OK. How about this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $easy = Net::Curl::Easier-&amp;gt;new(
    username =&amp;gt; 'hal',
    userpwd =&amp;gt; 'itsasecret',
    url =&amp;gt; 'imap://mail.example.com/INBOX/;UID=123',
)-&amp;gt;perform();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I just queried … an email inbox?!?&lt;/p&gt;

&lt;p&gt;Curl doesn’t just speak HTTP; it speaks many other protocols including IMAP, LDAP, SCP, and MQTT. To see the full list of protocols that your curl supports, run &lt;code&gt;curl --version&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Concurrency
&lt;/h1&gt;

&lt;p&gt;Curl can also run concurrent queries. To do that I recommend using &lt;a href="https://metacpan.org/pod/Net::Curl::Promiser"&gt;Net::Curl::Promiser&lt;/a&gt;. (Full disclosure: I also maintain this module.)&lt;/p&gt;

&lt;p&gt;Example, assuming use of &lt;a href="http://metacpan.org/pod/Mojolicious"&gt;Mojolicious&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use Net::Curl::Easier;
use Net::Curl::Promiser::Mojo;
use Mojo::Promise;

my $easy1 = Net::Curl::Easier-&amp;gt;new(
    url =&amp;gt; 'http://perl.org',
    followlocation =&amp;gt; 1,
);

my $easy2 = Net::Curl::Easier-&amp;gt;new(
    username =&amp;gt; 'hal',
    userpwd =&amp;gt; 'itsasecret',
    url =&amp;gt; 'imap://mail.example.com/INBOX/;UID=123',
);

my $easy3 = Net::Curl::Easier-&amp;gt;new(
    username =&amp;gt; 'hal',
    userpwd =&amp;gt; 'itsasecret',
    url =&amp;gt; 'scp://tty.example.com/path/to/file',
);

my $promiser = Net::Curl::Promiser::Mojo-&amp;gt;new();

Mojo::Promise-&amp;gt;all_settled(
    $promiser-&amp;gt;add_handle($easy1)-&amp;gt;then( sub {
        print $easy1-&amp;gt;head(), $easy1-&amp;gt;body();
    } ),
    $promiser-&amp;gt;add_handle($easy2)-&amp;gt;then( sub {
        # ... whatever you want with the IMAP result
    } ),
    $promiser-&amp;gt;add_handle($easy3)-&amp;gt;then( sub {
        # ... whatever you want with the SCP result
    } ),
)-&amp;gt;wait();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We just grabbed a web page, queried a mailbox, and downloaded a file via SCP, all in parallel!&lt;/p&gt;

&lt;p&gt;Note, too, that this method interfaces seamlessly with other promises. So if you have existing &lt;a href="http://metacpan.org/pod/Mojo::UserAgent"&gt;Mojo::UserAgent&lt;/a&gt;-based code, you can add requests for other protocols alongside it.&lt;/p&gt;

&lt;p&gt;Net::Curl::Promiser also works natively with&lt;br&gt;
&lt;a href="http://metacpan.org/pod/AnyEvent"&gt;AnyEvent&lt;/a&gt; and&lt;br&gt;
&lt;a href="http://metacpan.org/pod/IO::Async"&gt;IO::Async&lt;/a&gt;, should those be of greater interest to you. It also provides a convenience layer for custom &lt;a href="https://perldoc.perl.org/perlfunc#select-RBITS%2CWBITS%2CEBITS%2CTIMEOUT"&gt;select&lt;/a&gt;-based event loops, in case that’s how you roll.&lt;/p&gt;

&lt;h1&gt;
  
  
  Other Modules
&lt;/h1&gt;

&lt;p&gt;Some alternatives to modules presented above:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://metacpan.org/pod/AnyEvent::YACurl"&gt;AnyEvent::YACurl&lt;/a&gt;: A newer library than Net::Curl that simplifies the interface a bit. It assumes use of &lt;a href="https://metacpan.org/pod/AnyEvent"&gt;AnyEvent&lt;/a&gt;, though, so if you’re not using AE then this may not be for you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://metacpan.org/pod/WWW::Curl"&gt;WWW::Curl&lt;/a&gt;: The library of which Net::Curl is a fork. It can do much of what Net::Curl does but lacks access to libcurl’s &lt;a href="https://curl.se/libcurl/c/libcurl-multi.html"&gt;MULTI_SOCKET interface&lt;/a&gt;, which is faster and more flexible than curl’s internal &lt;a href="https://linux.die.net/man/2/select"&gt;select&lt;/a&gt;-based manager for concurrent requests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://metacpan.org/pod/Net::Curl::Simple"&gt;Net::Curl::Simple&lt;/a&gt;: A wrapper by Net::Curl’s original author. It provides some of the same conveniences as Net::Curl::Promiser and Net::Curl::Easier but uses callbacks rather than promises.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Closing Thoughts
&lt;/h1&gt;

&lt;p&gt;Curl exposes an awesome breadth of functionality, of which the above examples have just scratched the surface. Check it out!&lt;/p&gt;

</description>
      <category>perl</category>
      <category>programming</category>
    </item>
    <item>
      <title>use Sys::Binmode;</title>
      <dc:creator>Felipe Gasper</dc:creator>
      <pubDate>Sat, 27 Mar 2021 10:34:01 +0000</pubDate>
      <link>https://dev.to/fgasper/use-sys-binmode-4e6o</link>
      <guid>https://dev.to/fgasper/use-sys-binmode-4e6o</guid>
      <description>&lt;p&gt;Character encoding is an often-misunderstood aspect of Perl. Perl itself has significant bugs in the area. I recently published a CPAN module called &lt;a href="https://metacpan.org/pod/Sys::Binmode"&gt;Sys::Binmode&lt;/a&gt; which fixes most of those. If you write Perl you &lt;em&gt;probably&lt;/em&gt; should use it in all new code.&lt;/p&gt;

&lt;p&gt;I know that’s a “tall” claim, but …&lt;/p&gt;

&lt;h1&gt;
  
  
  Check this out:
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $foo = "\xff\x{100}";
chop $foo;
print $foo, $/;
exec "echo", $foo;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;em&gt;looks&lt;/em&gt; like it ought to print two identical lines, right? But in fact, it prints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;�
ÿ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? To answer that we have to learn a bit of Perl’s internals. Read on!&lt;/p&gt;

&lt;h1&gt;
  
  
  Background: What’s in a string?
&lt;/h1&gt;

&lt;p&gt;In theory, Perl strings store &lt;em&gt;code points&lt;/em&gt;, nothing more. They don’t store “bytes” or “characters”, but just code points—i.e., unsigned integers. (In that sense, Perl is more like JavaScript than C!)&lt;/p&gt;

&lt;p&gt;That, of course, is just an abstraction: all programming languages use bytes &lt;em&gt;internally&lt;/em&gt; to store strings. How does Perl decide which bytes to use for which code points? As it happens, Perl can do that in either of two formats: a “narrow” format that can store code points 0-255 only, and a “wide” format that can store any arbitrary code point. Which of those formats Perl uses for a given string &lt;em&gt;is up to Perl&lt;/em&gt;; things that aren’t Perl &lt;em&gt;generally&lt;/em&gt; shouldn’t care about it.&lt;/p&gt;

&lt;p&gt;For this abstraction to work, whether Perl stores a given string as “narrow” or “wide” must make no difference to a Perl program. And indeed, if you &lt;code&gt;print $foo&lt;/code&gt; you’ll get the same result regardless of which internal format Perl uses to store &lt;code&gt;$foo&lt;/code&gt;. Same for &lt;code&gt;syswrite&lt;/code&gt; and &lt;code&gt;send&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Most&lt;/em&gt; of Perl’s built-ins, though—e.g., &lt;code&gt;exec&lt;/code&gt;, &lt;code&gt;open&lt;/code&gt;, &lt;code&gt;mkdir&lt;/code&gt;, etc.—don’t work this way.&lt;/p&gt;

&lt;p&gt;Look again at our 4-line program above. In line 1 we create a string with 2 code points: 255 and 256. In line 2 we chop off the latter code point, so now &lt;code&gt;$foo&lt;/code&gt; just has 255. In line 3 we print that string and a newline; in line 4 we run &lt;code&gt;echo&lt;/code&gt; to do the same thing. Ideally lines 3 and 4 should achieve the same output. But for you they probably didn’t. Why?&lt;/p&gt;

&lt;p&gt;Let’s rerun that program but this time pipe it to &lt;code&gt;xxd&lt;/code&gt; to see exactly what’s being output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; pbpaste | perl | xxd
00000000: ff0a c3bf 0a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;0a&lt;/code&gt; is just the newline character. So line 3 printed a single byte, 0xff (plus newline), while &lt;code&gt;echo&lt;/code&gt; on line 4 printed 2 bytes—0xc3 0xbf (and a newline). Line 3 is correct: a string that contains code point 255 should output byte 255 (i.e., 0xff). What’s going on with line 4?&lt;/p&gt;

&lt;p&gt;Recall that, of Perl’s internal string-storage formats, only the “wide” one can handle code points above 255. Since &lt;code&gt;$foo&lt;/code&gt; on line 1 contains code point 256, Perl stores that string in “wide” format. Then in line 2 we get rid of code point 256. Now we have just 255. Perl &lt;em&gt;could&lt;/em&gt; thus switch our string to its “narrow” (i.e., 0-255) format, but it happens—as of Perl 5.32, anyway—not to.&lt;/p&gt;

&lt;p&gt;This &lt;em&gt;should&lt;/em&gt; make no difference since those internal storage details are behind Perl’s string abstraction. That’s the case with &lt;code&gt;print&lt;/code&gt;, but &lt;code&gt;exec&lt;/code&gt; misbehaves: it outputs Perl’s raw internal buffer rather than the proper code-point-to-byte conversion that &lt;code&gt;print&lt;/code&gt; uses. Perl, though, doesn’t publicly define the &lt;em&gt;contents&lt;/em&gt; of that internal buffer. Thus we have &lt;a href="https://en.wikipedia.org/wiki/Undefined_behavior"&gt;undefined behaviour&lt;/a&gt;, aka “nasal demons”, built &lt;em&gt;directly&lt;/em&gt; into Perl!&lt;/p&gt;

&lt;p&gt;This is a leak in Perl’s string-storage abstraction, and it’s what Sys::Binmode fixes.&lt;/p&gt;

&lt;p&gt;(Extra credit: remove the &lt;code&gt;\x{100}&lt;/code&gt; and line 2 from our program above, and rerun it. The two lines should now be the same. Why?)&lt;/p&gt;

&lt;h1&gt;
  
  
  Enter Sys::Binmode
&lt;/h1&gt;

&lt;p&gt;Sys::Binmode fixes &lt;code&gt;exec&lt;/code&gt; and many other Perl built-ins by force-converting those built-ins’ arguments to Perl’s internal “narrow” string storage format. This fixes the abstraction leak: now, no matter how these strings are stored, Perl gives them to the operating system the same way.&lt;/p&gt;

&lt;p&gt;Try it: do &lt;code&gt;cpan Sys::Binmode&lt;/code&gt;, then rerun our program with &lt;code&gt;perl -MSys::Binmode&lt;/code&gt;. It’ll now print two identical lines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Special Case: Non-POSIX OSes (e.g., Windows)
&lt;/h2&gt;

&lt;p&gt;Windows programmers may see a problem here: Perl’s “narrow” string storage format can only store bytes, so any time we want to give arbitrary Unicode characters to the operating system—which doesn’t exist for POSIX OSes like Linux—we’re stuck.&lt;/p&gt;

&lt;p&gt;As it happens, though, Perl doesn’t actually &lt;em&gt;use&lt;/em&gt; the Windows APIs that would allow sending arbitrary Unicode characters anyway. If Perl ever changed that Sys::Binmode would need an update, but for now it can work the same way as on POSIX OSes without compromising any functionality.&lt;/p&gt;

&lt;h1&gt;
  
  
  Use in Existing Code?
&lt;/h1&gt;

&lt;p&gt;Note that I say to use Sys::Binmode in &lt;strong&gt;new&lt;/strong&gt; code, not &lt;em&gt;all&lt;/em&gt; code. This is because existing code may actually &lt;em&gt;depend&lt;/em&gt; on Perl’s abstraction leak.&lt;/p&gt;

&lt;p&gt;Look again at &lt;code&gt;exec&lt;/code&gt;’s broken behaviour above. For code point 255 it printed the bytes of Perl’s “wide” storage format, which for that string was 2 bytes: 0xc3 and 0xbf. Notice that that broken behaviour actually made our terminal print something useful: &lt;code&gt;ÿ&lt;/code&gt;. As it happens, those 2 bytes from Perl’s internals are UTF-8 for 255. That’s because Perl’s “wide” internal format is actually just (&lt;a href="https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8"&gt;a “lax” variant of&lt;/a&gt;) UTF-8, so anything that outputs Perl’s internals will output UTF-8 if Perl stores the string in “wide” format.&lt;/p&gt;

&lt;p&gt;Ordinarily to output a string in UTF-8 you &lt;em&gt;encode&lt;/em&gt; it thus explicitly, e.g., &lt;code&gt;encode('UTF-8', $str)&lt;/code&gt;. &lt;code&gt;exec&lt;/code&gt; &lt;em&gt;appears&lt;/em&gt; to be automatically &lt;code&gt;encode()&lt;/code&gt;ing for us, but it’s actually just outputting whatever Perl happens to store internally. So if Perl decides to store a string “wide”, it’ll give UTF-8 to &lt;code&gt;exec&lt;/code&gt; … but if Perl decides to store that string “narrow”, then you’ll get something else! We could try to second-guess Perl’s internal decision-making, but that’s dangerous: how Perl decides to store its strings is undocumented and always subject to change.&lt;/p&gt;

&lt;p&gt;Sys::Binmode will suppress that unreliable “auto-encode” behaviour, which forces us to encode our strings properly before giving them to &lt;code&gt;exec&lt;/code&gt; and friends. Of course, that’s what we should have done all along!&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;I can think of no situation where Sys::Binmode effects any undesirable change to Perl in new code. It surely fixes bugs like in our &lt;code&gt;exec&lt;/code&gt; demo program. Assuming that I’m correct that, for new code, this module &lt;em&gt;only&lt;/em&gt; avoids problems without introducing any, it should be used in all new code.&lt;/p&gt;

&lt;p&gt;Convinced? :-)&lt;/p&gt;

</description>
      <category>perl</category>
    </item>
    <item>
      <title>Perl, Unicode, and Bytes</title>
      <dc:creator>Felipe Gasper</dc:creator>
      <pubDate>Fri, 29 Jan 2021 00:29:50 +0000</pubDate>
      <link>https://dev.to/fgasper/perl-unicode-and-bytes-5cg7</link>
      <guid>https://dev.to/fgasper/perl-unicode-and-bytes-5cg7</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wide character in print at Foo/Bar.pm line 27.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ve all been here: that maddening “wide character” warning. Why does it happen? How can we fix it? How can we prevent it in the future? Let’s take a look.&lt;/p&gt;

&lt;p&gt;Lots of early Perl adopters were C programmers. C strings are arrays of bytes, which allow code points up to 255, and that’s it. Perl used that model for many years.&lt;/p&gt;

&lt;p&gt;Along came Unicode, and with it a need for Perl to store code points that exceed 255 (i.e., “wide characters”). The solution—which Perl retains today—was to give Perl a 2nd way of storing a string: in addition to C-style “byte strings”, Perl can store strings in an internal, Unicode-compatible encoding. Thus, a Perl string can now natively store any Unicode code point.&lt;/p&gt;

&lt;p&gt;Of course, programs don’t generally &lt;em&gt;receive&lt;/em&gt; “wide characters” as inputs. They receive &lt;em&gt;bytes&lt;/em&gt;, then &lt;strong&gt;decode&lt;/strong&gt; those bytes into “characters”. Then they &lt;strong&gt;encode&lt;/strong&gt; the characters back into bytes for output. In general, then, each program:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;… receives bytes as input,&lt;/li&gt;
&lt;li&gt;… decodes those bytes to characters,&lt;/li&gt;
&lt;li&gt;… does something with those characters,&lt;/li&gt;
&lt;li&gt;… encodes its output characters to bytes,&lt;/li&gt;
&lt;li&gt;… and outputs those bytes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s the trick: lots of Perl programs simply &lt;em&gt;don’t care&lt;/em&gt; about “characters”; for example, if all you’re doing is piping a stream from one filehandle to another, there’s no reason to decode bytes to characters since we’re just going to re-encode those characters to bytes right away. For such programs, Perl’s pre-Unicode, a-byte-is-a-character-is-a-byte model works just fine.&lt;/p&gt;

&lt;p&gt;Let’s call these two workflows “character-oriented” and “byte-oriented”. Most character encoding problems in Perl arise from a conflict between these two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Byte-Oriented Data in a Character-Oriented World
&lt;/h2&gt;

&lt;p&gt;Suppose we omit step 2 above. Consider the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MJSON::PP -E'my $s = "…"; say JSON::PP::encode_json([$s])'
["â€¦"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To grok the above, first consider &lt;code&gt;$s&lt;/code&gt;. Most folks nowadays probably use UTF-8 terminals, which means &lt;code&gt;…&lt;/code&gt; takes 3 bytes: &lt;code&gt;0xe2 0x80 0xa6&lt;/code&gt;. Our one-liner doesn’t decode &lt;code&gt;$s&lt;/code&gt;, so as far as Perl’s concerned &lt;code&gt;$s&lt;/code&gt; is 3 characters: &lt;code&gt;0xe2 0x80 0xa6&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;code&gt;encode_json()&lt;/code&gt;, though, expects its input strings to be decoded. It also outputs a byte sequence; thus, it applies a UTF-8 encode to each of &lt;code&gt;$s&lt;/code&gt;’s 3 characters, which yields &lt;em&gt;6&lt;/em&gt; bytes: &lt;code&gt;0xe2&lt;/code&gt; becomes &lt;code&gt;0xc3 0xa2&lt;/code&gt;, &lt;code&gt;0x80&lt;/code&gt; becomes &lt;code&gt;0xc2 0x80&lt;/code&gt;, and &lt;code&gt;0xa6&lt;/code&gt; becomes &lt;code&gt;0xc2 0xa6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To fix this, we can do one of:&lt;/p&gt;

&lt;p&gt;A) Decode the input, e.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $s = "…";
Encode::Simple::decode_utf8($s);
say JSON::PP::encode_json([$s]);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;B) Provide a “pre-decoded” string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $s = "\x{2026}";
say JSON::PP::encode_json([$s]);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;C) Make the JSON encoder forgo character encoding, e.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $s = "…";
say JSON::PP-&amp;gt;new()-&amp;gt;utf8(0)-&amp;gt;encode([$s]);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CAVEAT:&lt;/strong&gt; This latter approach can yield &lt;em&gt;invalid JSON.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Character-Oriented Data in a Byte-Oriented World
&lt;/h2&gt;

&lt;p&gt;The opposite problem—omitting step 4 in our 5-step workflow above—is a bit more interesting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MJSON::PP -E'say JSON::PP::decode_json(q&amp;lt;["…"]&amp;gt;)-&amp;gt;[0]'
Wide character in print at -e line 1.
…
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unlike before, where the mangled characters in the output reveal a palpable problem, here the program actually &lt;em&gt;prints the right thing&lt;/em&gt;; it’s just throwing a warning along the way. What gives?&lt;/p&gt;

&lt;p&gt;Just as &lt;code&gt;encode_json()&lt;/code&gt; does a UTF-8 encode on its input, &lt;code&gt;decode_json()&lt;/code&gt; does a UTF-8 &lt;em&gt;decode&lt;/em&gt;. That means that &lt;code&gt;decode_json(q&amp;lt;["…"]&amp;gt;)-&amp;gt;[0]&lt;/code&gt; is a &lt;em&gt;single&lt;/em&gt; character, &lt;code&gt;0x2026&lt;/code&gt;. So before we print it we’re supposed to &lt;em&gt;encode&lt;/em&gt; it. Indeed, once we do that, the warning goes away:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MEncode::Simple -MJSON::PP -E'say encode_utf8( JSON::PP::decode_json(q&amp;lt;["…"]&amp;gt;)-&amp;gt;[0])'
…
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  So can I just ignore that warning?
&lt;/h2&gt;

&lt;p&gt;Maybe. But don’t.&lt;/p&gt;

&lt;p&gt;As we know, Perl can store strings as “byte strings”: simple sequences of code points 0-255. Perl can also, though, store strings in an “upgraded”, abstract Unicode encoding. Such an “upgraded” string falls into one of two categories:&lt;/p&gt;

&lt;p&gt;1) “Bytes-compatible”: All code points fall in the 0-255 range. In other words, Perl &lt;em&gt;could&lt;/em&gt; store this string “downgraded”, but for whatever reason isn’t.&lt;/p&gt;

&lt;p&gt;2) “Bytes-incompatible”: One or more code points exceed 255.&lt;/p&gt;

&lt;p&gt;When outputting upgraded strings, Perl follows these rules:&lt;/p&gt;

&lt;p&gt;1) If the string is bytes-compatible: output the string’s “downgraded” form.&lt;/p&gt;

&lt;p&gt;2) Otherwise: Output the code points encoded to UTF-8, and “complain”: if we’re &lt;code&gt;syswrite()&lt;/code&gt;ing, Perl throws an exception, but if we’re &lt;code&gt;say()&lt;/code&gt;ing or &lt;code&gt;print()&lt;/code&gt;ing then Perl just warns.&lt;/p&gt;

&lt;p&gt;Of course, lots of applications output UTF-8 anyway, in which case #2 above &lt;em&gt;happens&lt;/em&gt; to be “the right thing”. But Perl would rather you be explicit: encode your strings before outputting them.&lt;/p&gt;

&lt;h2&gt;
  
  
  That Encoding Behind the Curtain …
&lt;/h2&gt;

&lt;p&gt;Perl’s “internal Unicode encoding” is, in fact, just UTF-8. (Actually a “loose” variant thereof, but we digress.) It’s really &lt;strong&gt;better to forget this&lt;/strong&gt; unless you’re maintaining Perl itself—&lt;a href="https://dev.to/fgasper/perl-s-svpv-menace-5515"&gt;even XS modules shouldn’t care!&lt;/a&gt;—but for the sake of a concrete understanding we’ll look at a few examples here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Perl Internals: Wide Characters
&lt;/h3&gt;

&lt;p&gt;Compare the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;perl -MDevel::Peek -MEncode::Simple -e'my $s = "…"; decode_utf8($s); Dump $s'
SV = PV(0x7fc992804c70) at 0x7fc992816348
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7fc9927006d0 "\342\200\246"\0 [UTF8 "\x{2026}"]
  CUR = 3
  LEN = 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… versus:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MDevel::Peek -e'my $s = "…"; Dump $s'
SV = PV(0x7f9e5e804c70) at 0x7f9e5e8162a0
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x7f9e5e40bbe0 "\342\200\246"\0
  CUR = 3
  LEN = 10
  COW_REFCNT = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important piece here is that &lt;code&gt;[UTF8 "\x{2026}"]&lt;/code&gt; bit that we see only in the top example. This is the string’s content as Perl code sees it: a single character with code point 0x2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  Perl Internals: UTF8-Invariant Characters
&lt;/h3&gt;

&lt;p&gt;Now consider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MEncode::Simple -MDevel::Peek -e'Dump( decode_utf8("abc") )'
SV = PV(0x7f81bc004d30) at 0x7f81bc0042a8
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x7f81bbf46770 "abc"\0 [UTF8 "abc"]
  CUR = 3
  LEN = 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A special feature of UTF-8 is that, unlike other Unicode encodings (UTF-16 &amp;amp; al.), it encodes code points 0-127 identically to US-ASCII and ISO-8859-1. We call these code points “UTF8-invariant” because Perl stores them as the same bytes regardless of whether the string is upgraded or not.&lt;/p&gt;

&lt;p&gt;Watch this, though:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MDevel::Peek -MEncode -e'my $s = "abc"; utf8::decode($s) or die "bad"; Dump $s'
SV = PV(0x7fa09a004c70) at 0x7fa09a016348
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x7fa099e01540 "abc"\0
  CUR = 3
  LEN = 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same logic as we achieved with &lt;a href="https://metacpan.org/pod/Encode::Simple"&gt;Encode::Simple&lt;/a&gt;, but with a twist: Perl &lt;em&gt;did not upgrade the string!&lt;/em&gt; What gives??&lt;/p&gt;

&lt;p&gt;It turns out that upgraded strings are slower than their downgraded forms: to do much of anything with upgrades strings you have to parse each Unicode character out of the buffer. For this reason, &lt;code&gt;utf8::decode&lt;/code&gt; will (like &lt;a href="https://perldoc.perl.org/perlapi#sv_utf8_decode"&gt;its parallel C API function&lt;/a&gt;) &lt;em&gt;leave strings downgraded&lt;/em&gt; unless the decoded string is bytes-incompatible. Encode::Simple, by contrast, &lt;em&gt;always upgrades&lt;/em&gt;, even for bytes-compatible strings. (&lt;a href="https://metacpan.org/pod/distribution/Unicode-UTF8/lib/Unicode/UTF8.pod"&gt;Unicode::UTF8&lt;/a&gt; does the same.)&lt;/p&gt;

&lt;p&gt;This is why we can’t just say “Perl stores text strings as UTF-8.” Some character decoders &lt;em&gt;do&lt;/em&gt; work that way, but Perl’s own internal decoder doesn’t.&lt;/p&gt;

&lt;h2&gt;
  
  
  Perl Internals: The &lt;em&gt;Really&lt;/em&gt; Messy Part
&lt;/h2&gt;

&lt;p&gt;We’ve looked at how Perl stores bytes-incompatible (&amp;gt;255) code points and UTF8-invariant ones (0-127). What about the 128-255 range?&lt;/p&gt;

&lt;p&gt;Here’s where it gets dicey: these code points are bytes-compatible but &lt;strong&gt;not&lt;/strong&gt; UTF8-invariant. Perl can thus store these either downgraded or upgraded, but this time &lt;em&gt;it matters&lt;/em&gt; which they are.&lt;/p&gt;

&lt;p&gt;Recall our example above where we looked at the &lt;code&gt;Dump()&lt;/code&gt; of undecoded &lt;code&gt;…&lt;/code&gt;. Compare that to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MDevel::Peek -e'my $s = "…"; utf8::upgrade($s); Dump $s'
SV = PV(0x7feb80004c70) at 0x7feb800162a0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7feb7fc04930 "\303\242\302\200\302\246"\0 [UTF8 "\x{e2}\x{80}\x{a6}"]
  CUR = 6
  LEN = 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;utf8::upgrade()&lt;/code&gt; &lt;em&gt;internally&lt;/em&gt; encodes the formerly-downgraded &lt;code&gt;$s&lt;/code&gt; as UTF-8. As far as Perl code goes it’s the same string; only its internal representation changes. Since &lt;code&gt;$s&lt;/code&gt; was already a UTF-8 sequence, what Perl stores in memory is &lt;em&gt;double-encoded&lt;/em&gt;; however, to the Perl application it actually makes no difference because anything that &lt;em&gt;accesses&lt;/em&gt; that string will see 3 characters (&lt;code&gt;0xe2 0x80 0xa6&lt;/code&gt;), not Perl’s internally-double-encoded stuff. This includes outputting the string, e.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -E'my $s = "…"; say $s; utf8::upgrade($s); say $s'
…
…
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s sometimes surprising which interfaces return upgraded strings and which don’t. For example JSON::PP’s &lt;code&gt;encode()&lt;/code&gt; returns an upgraded string, even if we disable character encoding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; perl -MDevel::Peek -MJSON::PP -E'Dump( JSON::PP-&amp;gt;new()-&amp;gt;utf8(0)-&amp;gt;encode(["…"]) )'
SV = PV(0x7fd786004ff0) at 0x7fd78909e4f8
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x7fd78826a8a0 "[\"\303\242\302\200\302\246\"]"\0 [UTF8 "["\x{e2}\x{80}\x{a6}"]"]
  CUR = 10
  LEN = 13
  COW_REFCNT = 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  REMINDER: Nothing to See Here!
&lt;/h3&gt;

&lt;p&gt;The above Devel::Peek examples are a &lt;strong&gt;purely-informational&lt;/strong&gt; “peek behind the curtain” at Perl’s internals. Unless you’re altering Perl itself—again, &lt;a href="https://dev.to/fgasper/perl-s-svpv-menace-5515"&gt;even XS modules should ignore Perl internals&lt;/a&gt;—ignore Perl’s internal encoding.&lt;/p&gt;

&lt;h1&gt;
  
  
  Our Way Forward
&lt;/h1&gt;

&lt;p&gt;Most modern programming languages use different types to represent “binary strings” and “character strings”. Perl, for better or for worse, does not; like the difference between a string and a number, &lt;a href="https://perldoc.perl.org/perlunifaq#How-can-I-determine-if-a-string-is-a-text-string-or-a-binary-string?"&gt;we have to track that ourselves&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here, then, are the best things we Perl programmers can do for ourselves and for each other to prevent character encoding problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Consider Perl to have &lt;strong&gt;one type of string:&lt;/strong&gt; a &lt;em&gt;character&lt;/em&gt; string. Perl wants you to ignore its internal encoding; don’t fight that. (&lt;em&gt;Technically&lt;/em&gt; Perl could change its internal encoding scheme, and well-behaved modules, whether pure-Perl or XS, would keep working.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Document whether your modules expect strings to be character-decoded or not. Do likewise for returned strings. (Maybe even provide functions for both, as &lt;a href="https://metacpan.org/pod/Mojo::JSON"&gt;Mojo::JSON&lt;/a&gt; does.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prefer &lt;a href="https://metacpan.org/pod/Encode::Simple"&gt;Encode::Simple&lt;/a&gt; over alternatives like &lt;a href="https://metacpan.org/pod/Encode"&gt;Encode&lt;/a&gt;, &lt;a href="https://metacpan.org/pod/utf8"&gt;utf8&lt;/a&gt;, and &lt;a href="https://metacpan.org/pod/Unicode::UTF8"&gt;Unicode::UTF8&lt;/a&gt;. Encode::Simple, by default, throws an exception when it encounters invalid data, which means you’ll catch errors up-front rather than deep in your code. The others all accept invalid input by default.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For XS authors: When working with PVs (strings), &lt;em&gt;always&lt;/em&gt; differentiate between the two encodings. Macros like &lt;a href="https://perldoc.perl.org/perlapi#SvPVbyte"&gt;SvPVbyte&lt;/a&gt;, &lt;a href="https://perldoc.perl.org/perlapi#SvPVutf8"&gt;SvPVutf8&lt;/a&gt;, and their variants are your friends!&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>perl</category>
      <category>unicode</category>
    </item>
  </channel>
</rss>
