<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David Peter</title>
    <description>The latest articles on DEV Community by David Peter (@sharkdp).</description>
    <link>https://dev.to/sharkdp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F95033%2Fa9643c1c-3e0f-423b-8139-f0b193f75139.jpeg</url>
      <title>DEV Community: David Peter</title>
      <link>https://dev.to/sharkdp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sharkdp"/>
    <language>en</language>
    <item>
      <title>Hacktoberfest 2020 - a retrospective</title>
      <dc:creator>David Peter</dc:creator>
      <pubDate>Sun, 01 Nov 2020 09:58:00 +0000</pubDate>
      <link>https://dev.to/sharkdp/retrospective-hacktoberfest-2020-k91</link>
      <guid>https://dev.to/sharkdp/retrospective-hacktoberfest-2020-k91</guid>
      <description>&lt;p&gt;This years &lt;a href="https://hacktoberfest.digitalocean.com/" rel="noopener noreferrer"&gt;Hacktobfest&lt;/a&gt; didn't have a great start. There was a lot of understandable &lt;a href="https://news.ycombinator.com/item?id=24643894" rel="noopener noreferrer"&gt;controversy&lt;/a&gt; around the fact that some people abused the system by sending spam pull requests in order to get a free shirt. Eventually, this lead to some important &lt;a href="https://hacktoberfest.digitalocean.com/hacktoberfest-update" rel="noopener noreferrer"&gt;policy changes&lt;/a&gt;, one of which was to make Hacktoberfest opt-in for maintainers.&lt;/p&gt;

&lt;p&gt;While I think this is a good result overall, I also felt kind of sad because it put the whole event in such a bad light. Personally, I had great experiences with Hacktoberfest in the past years - both as a contributor as well as a maintainer. Which is why I decided to actively enable "Hacktoberfest" contributions on some of &lt;a href="https://github.com/sharkdp/" rel="noopener noreferrer"&gt;my projects&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For &lt;a href="https://github.com/sharkdp/bat" rel="noopener noreferrer"&gt;&lt;code&gt;bat&lt;/code&gt;&lt;/a&gt; specifically, I opened three tickets that included ideas and instructions for contributing &lt;a href="https://github.com/sharkdp/bat/issues/1211" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; &lt;a href="https://github.com/sharkdp/bat/issues/1213" rel="noopener noreferrer"&gt;[2]&lt;/a&gt; &lt;a href="https://github.com/sharkdp/bat/issues/1216" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;. They were mainly targeted towards first-time contributors. In the following chart, you can see what kind of effect the opt-in strategy had on the number of contributions to the &lt;code&gt;bat&lt;/code&gt; repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F27lnwjh1kdtda32mvni1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F27lnwjh1kdtda32mvni1.png" alt="number of pull requests per month"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now I should note that the large majority of these are small contributions. New test cases or documentation updates. But that does not mean that they are not helpful for the project - quite the contrary. And more importantly, I don't think that's even the point of Hacktoberfest.&lt;/p&gt;

&lt;p&gt;The really great thing is that it motivates people to get started with open source work (or to rekindle their engagement). For &lt;code&gt;bat&lt;/code&gt;, we received a lot of contributions from newcomers. It is fantastic to see the excitement when their PR is being merged. Most of them are also really grateful for review comments and very happy to push further updates.&lt;/p&gt;

&lt;p&gt;We did not receive a single spam contribution. Sure, there is always a small fraction of PRs that are going to be rejected (3 out of 129 for &lt;code&gt;bat&lt;/code&gt;). But that is not specific to Hacktoberfest. As a maintainer, the amount of work you put into an average first-time contribution is definitely a bit higher than usual. But we have all started as a beginner once. Personally, I can definitely still remember my first contributions to the open source world and the kind of excitement I felt.&lt;/p&gt;

&lt;p&gt;To summarize: I still think that Hacktoberfest is a great initiative and I am definitely looking forward to future events.&lt;/p&gt;

&lt;p&gt;Thank you to all contributors and to my co-maintainers &lt;a href="https://github.com/eth-p" rel="noopener noreferrer"&gt;eth-p&lt;/a&gt; and &lt;a href="https://github.com/keith-hall" rel="noopener noreferrer"&gt;keith-hall&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>hacktoberfest</category>
      <category>opensource</category>
    </item>
    <item>
      <title>An unexpected performance regression</title>
      <dc:creator>David Peter</dc:creator>
      <pubDate>Mon, 16 Sep 2019 19:16:58 +0000</pubDate>
      <link>https://dev.to/sharkdp/an-unexpected-performance-regression-11ai</link>
      <guid>https://dev.to/sharkdp/an-unexpected-performance-regression-11ai</guid>
      <description>&lt;p&gt;Performance regressions are something that I find rather hard to track in an automated way. For the past years, I have been working on &lt;a href="https://github.com/sharkdp/fd"&gt;a tool called &lt;code&gt;fd&lt;/code&gt;&lt;/a&gt;, which is aiming to be a &lt;em&gt;fast&lt;/em&gt; and user-friendly (but not necessarily feature-complete) alternative to &lt;a href="https://www.gnu.org/software/findutils/"&gt;&lt;code&gt;find&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As you would expect from a file-searching tool, &lt;code&gt;fd&lt;/code&gt; is an I/O-heavy program whose performance is governed by external factors like filesystem speed, caching effects, as well as OS-specific aspects. To get reliable and meaningful timing results, I developed a &lt;a href="https://github.com/sharkdp/hyperfine"&gt;command-line benchmarking tool called &lt;code&gt;hyperfine&lt;/code&gt;&lt;/a&gt; which takes care of things like warmup runs (for hot-cache benchmarks) or cache-clearing preparation commands (for cold-cache benchmarks). It also performs an analysis across multiple runs and warns the user about outside interference by detecting statistical outliers¹.&lt;/p&gt;

&lt;p&gt;But this is just a small part of the problem. The real challenge is to find a suitable collection of benchmarks that tests different aspects of your program across a wide range of environments. To get a feeling for the vast amount of factors that can influence the runtime of a program like &lt;code&gt;fd&lt;/code&gt;, let me tell you about one particular performance regression that I found recently².&lt;/p&gt;

&lt;p&gt;I keep a small collection of old &lt;code&gt;fd&lt;/code&gt; executables around in order to quickly run specific benchmarks across different versions. I noticed a significant performance regression between &lt;code&gt;fd-7.0.0&lt;/code&gt; and &lt;code&gt;fd-7.1.0&lt;/code&gt; in one of the benchmarks:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8jPMW0L4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ep14199yxpaswdcttop2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8jPMW0L4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ep14199yxpaswdcttop2.png" alt="performance regression between 7.0 and 7.1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I quickly looked at the commits between 7.0 and 7.1 to see if there were any changes that could have introduced this regression. I couldn't find any obvious candidates.&lt;/p&gt;

&lt;p&gt;Next, I decided to perform a small binary search by re-compiling specific commits and running the benchmark. To my surprise, I wasn't able to reproduce the fast times that I had measured with the precompiled binaries of the old versions. Every single commit yielded slow results!&lt;/p&gt;

&lt;p&gt;There was only one way this could have happened: the old binaries were faster because they were compiled with an &lt;em&gt;older version of the Rust compiler&lt;/em&gt;. The version that came out shortly before the &lt;code&gt;fd-7.1.0&lt;/code&gt; release was &lt;a href="https://blog.rust-lang.org/2018/08/02/Rust-1.28.html"&gt;Rust 1.28&lt;/a&gt;. It made a significant change to how Rust binaries were built: it dropped &lt;code&gt;jemalloc&lt;/code&gt; as the default allocator.&lt;/p&gt;

&lt;p&gt;To make sure that this was the root cause of the regression, I re-enabled &lt;code&gt;jemalloc&lt;/code&gt; via the &lt;a href="https://crates.io/crates/jemallocator"&gt;jemallocator&lt;/a&gt; crate. Sure enough, this brought the time back down:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Kbi24sXU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/9kkq8xll17pexngmmkl8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Kbi24sXU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/9kkq8xll17pexngmmkl8.png" alt="Runtime back to normal in v7.4.0"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Subsequently, I ran the whole "benchmark suite". I found a consistent speed up of up to 40% by switching from the system-allocator to jemalloc (see results below). The recently released &lt;a href="https://github.com/sharkdp/fd/releases"&gt;&lt;code&gt;fd-7.4.0&lt;/code&gt;&lt;/a&gt; now re-enables jemalloc as the allocator for &lt;code&gt;fd&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unfortunately, I still don't have a good solution for automatically keeping track of performance regressions - but I would be very interested in your feedback and ideas.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark results
&lt;/h3&gt;

&lt;p&gt;Simple pattern, warm cache:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Mean [ms]&lt;/th&gt;
&lt;th&gt;Min [ms]&lt;/th&gt;
&lt;th&gt;Max [ms]&lt;/th&gt;
&lt;th&gt;Relative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-sysalloc '.*[0-9]\.jpg$'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;252.5 ± 1.4&lt;/td&gt;
&lt;td&gt;250.6&lt;/td&gt;
&lt;td&gt;255.5&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-jemalloc '.*[0-9]\.jpg$'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;201.1 ± 2.4&lt;/td&gt;
&lt;td&gt;197.6&lt;/td&gt;
&lt;td&gt;207.0&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Simple pattern, hidden and ignored files, warm cache:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Mean [ms]&lt;/th&gt;
&lt;th&gt;Min [ms]&lt;/th&gt;
&lt;th&gt;Max [ms]&lt;/th&gt;
&lt;th&gt;Relative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-sysalloc -HI '.*[0-9]\.jpg$'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;748.4 ± 6.1&lt;/td&gt;
&lt;td&gt;739.9&lt;/td&gt;
&lt;td&gt;755.0&lt;/td&gt;
&lt;td&gt;1.42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-jemalloc -HI '.*[0-9]\.jpg$'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;526.5 ± 4.9&lt;/td&gt;
&lt;td&gt;520.2&lt;/td&gt;
&lt;td&gt;536.6&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;File extension search, warm cache:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Mean [ms]&lt;/th&gt;
&lt;th&gt;Min [ms]&lt;/th&gt;
&lt;th&gt;Max [ms]&lt;/th&gt;
&lt;th&gt;Relative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-sysalloc -HI -e jpg ''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;758.4 ± 23.1&lt;/td&gt;
&lt;td&gt;745.7&lt;/td&gt;
&lt;td&gt;823.0&lt;/td&gt;
&lt;td&gt;1.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-jemalloc -HI -e jpg ''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;542.6 ± 2.7&lt;/td&gt;
&lt;td&gt;538.3&lt;/td&gt;
&lt;td&gt;546.1&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;File-type search, warm cache:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Mean [ms]&lt;/th&gt;
&lt;th&gt;Min [ms]&lt;/th&gt;
&lt;th&gt;Max [ms]&lt;/th&gt;
&lt;th&gt;Relative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-sysalloc -HI --type l ''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;722.5 ± 3.9&lt;/td&gt;
&lt;td&gt;716.2&lt;/td&gt;
&lt;td&gt;729.5&lt;/td&gt;
&lt;td&gt;1.37&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-jemalloc -HI --type l ''&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;526.1 ± 6.8&lt;/td&gt;
&lt;td&gt;517.6&lt;/td&gt;
&lt;td&gt;539.1&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Simple pattern, cold cache:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Mean [s]&lt;/th&gt;
&lt;th&gt;Min [s]&lt;/th&gt;
&lt;th&gt;Max [s]&lt;/th&gt;
&lt;th&gt;Relative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-sysalloc -HI '.*[0-9]\.jpg$'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5.728 ± 0.005&lt;/td&gt;
&lt;td&gt;5.723&lt;/td&gt;
&lt;td&gt;5.733&lt;/td&gt;
&lt;td&gt;1.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fd-jemalloc -HI '.*[0-9]\.jpg$'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5.532 ± 0.009&lt;/td&gt;
&lt;td&gt;5.521&lt;/td&gt;
&lt;td&gt;5.539&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;small&gt;¹ For example, I need to close Dropbox and Spotify before running &lt;code&gt;fd&lt;/code&gt; benchmarks as they have a significant influence on the runtime.&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;&lt;small&gt;² As stated in the beginning, I don't have a good way to automatically track this. So it took me some time to spot this regression :-(&lt;/small&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>performance</category>
    </item>
    <item>
      <title>The difference between "binary" and "text" files</title>
      <dc:creator>David Peter</dc:creator>
      <pubDate>Sun, 30 Dec 2018 15:34:04 +0000</pubDate>
      <link>https://dev.to/sharkdp/what-is-a-binary-file-2cf5</link>
      <guid>https://dev.to/sharkdp/what-is-a-binary-file-2cf5</guid>
      <description>&lt;p&gt;This article explores the topic of "binary" and "text" files. What is the difference between the two (if any)? Is there a clear definition for what constitutes a "binary" or a "text" file?&lt;/p&gt;

&lt;p&gt;We start our journey with two candidate files whose content we would intuitively categorize as "text" and "binary" data, respectively:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
 bash
echo "hello 🌍" &amp;gt; message
convert -size 1x1 xc:white png:white


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We have created two files: A file named &lt;code&gt;message&lt;/code&gt; with the textual content &lt;em&gt;"hello 🌍"&lt;/em&gt; (including the Unicode symbol &lt;a href="https://unicode-table.com/en/1F30D/" rel="noopener noreferrer"&gt;"Earth Globe Europe-Africa"&lt;/a&gt;) and a PNG image with a single white pixel called &lt;code&gt;white&lt;/code&gt;. File extensions are deliberately left out.&lt;/p&gt;

&lt;p&gt;To demonstrate that some programs distinguish between "text" and "binary" files, check out how &lt;code&gt;grep&lt;/code&gt; changes its behavior:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

▶ grep -R hello            
message:hello 🌍

▶ grep -R PNG
Binary file white matches


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;diff&lt;/code&gt; does something similar:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

▶ echo "hello world" &amp;gt; other-message
▶ diff other-message message 
1c1
&amp;lt; hello world
---
&amp;gt; hello 🌍

▶ convert -size 1x1 xc:black png:black
▶ diff black white
Binary files black and white differ


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;How do these programs distinguish between "text" and "binary" files?&lt;/p&gt;

&lt;p&gt;Before we answer this question, let us first try to come up with a definition. Clearly, on a fundamental file-system level, every file is just a collection of bytes and could therefore be viewed as binary data. On the other hand, a distinction between "text" and "non-text" (hereafter: "binary") data seems helpful for programs like &lt;code&gt;grep&lt;/code&gt; or &lt;code&gt;diff&lt;/code&gt;, if only not to mess up the output of your terminal emulator.&lt;/p&gt;

&lt;p&gt;So maybe we can start by defining "text" data. It seems reasonable to begin with an abstract notion of text as being a sequence of &lt;a href="https://en.wikipedia.org/wiki/Unicode" rel="noopener noreferrer"&gt;Unicode code points&lt;/a&gt;. Examples of code points are characters like &lt;code&gt;k&lt;/code&gt;, &lt;code&gt;ä&lt;/code&gt; or &lt;code&gt;א&lt;/code&gt;, as well as special symbols like &lt;code&gt;%&lt;/code&gt;, &lt;code&gt;☢&lt;/code&gt; or &lt;code&gt;🙈&lt;/code&gt;. To store a given text as a sequence of bytes, we need to choose an &lt;em&gt;encoding&lt;/em&gt;. If we want to be able to represent the whole Unicode range, we typically choose UTF-8, sometimes UTF-16 or UTF-32. Historically, encodings which support just a part of todays Unicode are also important. The most prominent ones are US-ASCII and Latin1 (ISO 8859-1), but there are many more. And all of these look different on a byte level.&lt;/p&gt;

&lt;p&gt;Given just the contents of a file (not the history on how it was created), we can therefore try the following definition:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A file is called "text file" if its content consists of an encoded sequence of Unicode code points.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are two practical problems with this definition. First, we would need a list of &lt;em&gt;all possible&lt;/em&gt; encodings. Second, in order to test if the contents of a file is encoded in a given encoding, we would have to decode the &lt;em&gt;whole&lt;/em&gt; contents of the file and see if it succeeds¹. The whole process would be really slow.&lt;/p&gt;

&lt;p&gt;It turns out that there is a much faster way to distinguish between text and binary files, but it comes at the cost of precision.&lt;/p&gt;

&lt;p&gt;To see how this works, let's go back to our two candidate files and explore their byte level content. I am using &lt;a href="https://github.com/sharkdp/hexyl" rel="noopener noreferrer"&gt;&lt;code&gt;hexyl&lt;/code&gt;&lt;/a&gt; as a hex viewer, but you can also use &lt;code&gt;hexdump -C&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1ycdbc17j64sppywnw7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1ycdbc17j64sppywnw7.png" alt="Binary content of 'message' and 'white'"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that both files contain bytes within and outside of the ASCII range (&lt;code&gt;00&lt;/code&gt;…&lt;code&gt;7f&lt;/code&gt;). The four bytes &lt;code&gt;f0 9f 8c 8d&lt;/code&gt; in the &lt;code&gt;message&lt;/code&gt; file, for example, are the UTF-8 encoded version of the Unicode code point &lt;code&gt;U+1F30D&lt;/code&gt; (🌍). On the other hand, the bytes &lt;code&gt;50 4e 47&lt;/code&gt; at the beginning of the &lt;code&gt;white&lt;/code&gt; image are a simple ASCII-encoded version of the characters &lt;code&gt;PNG&lt;/code&gt;².&lt;/p&gt;

&lt;p&gt;So clearly, looking at bytes outside the ASCII range can not be used as a method to detect "binary" files. However, there &lt;em&gt;is&lt;/em&gt; a difference between the two files. The image file contains a lot of NULL bytes (&lt;code&gt;00&lt;/code&gt;) while the short text message does not. It turns out that this can be turned into a simple &lt;em&gt;heuristic&lt;/em&gt; method to detect binary files, since a lot of encoded text data does not contain any NULL bytes (even though it might be legal).&lt;/p&gt;

&lt;p&gt;In fact, this is exactly what &lt;code&gt;diff&lt;/code&gt; and &lt;code&gt;grep&lt;/code&gt; use to detect "binary" files. The following macro is included in &lt;a href="https://github.com/Distrotech/diffutils/blob/9e70e1ce7aaeff0f9c428d1abc9821589ea054f1/src/io.c#L85-L88" rel="noopener noreferrer"&gt;&lt;code&gt;diff&lt;/code&gt;s source code (&lt;code&gt;src/io.c&lt;/code&gt;)&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

#define binary_file_p(buf, size) (memchr (buf, 0, size) != 0)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Here, the &lt;a href="https://linux.die.net/man/3/memchr" rel="noopener noreferrer"&gt;&lt;code&gt;memchr(const void *s, int c, size_t n)&lt;/code&gt;&lt;/a&gt; function is used to search the initial &lt;code&gt;size&lt;/code&gt; bytes of the memory region starting at &lt;code&gt;buf&lt;/code&gt; for the character &lt;code&gt;0&lt;/code&gt;. To speed this process up even more, typically only the first few bytes of the file are read into the buffer &lt;code&gt;buf&lt;/code&gt; (e.g. 1024 bytes). To summarize, &lt;code&gt;grep&lt;/code&gt; and &lt;code&gt;diff&lt;/code&gt; use the following heuristic approach:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A file is very likely to be a "text file" if the first 1024 bytes of its content do not contain any NULL bytes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Note that there are counterexamples where this fails. For example, even if unlikely, UTF-8-encoded text can legally contain NULL bytes. Conversely, some particular binary formats (like binary &lt;a href="https://en.wikipedia.org/wiki/Netpbm_format" rel="noopener noreferrer"&gt;PGM&lt;/a&gt;) do not contain NULL bytes. This method will also typically classify UTF-16 and UTF-32 encoded text as "binary", as they encode common Latin-1 code points with NULL bytes:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

▶ iconv -f UTF-8 -t UTF-16 message &amp;gt; message-utf16
▶ hexdump -C message-utf16 
00000000  ff fe 68 00 65 00 6c 00  6c 00 6f 00 20 00 3c d8  |..h.e.l.l.o. .&amp;lt;.|
00000010  0d df 0a 00                                       |....|
00000014
▶ grep . message-utf16                            
Binary file message-utf16 matches


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Nevertheless, this heuristic approach is very useful. I have written a &lt;a href="https://github.com/sharkdp/content_inspector" rel="noopener noreferrer"&gt;small library&lt;/a&gt; in Rust which uses a slightly refined version of this method to quickly determine whether a given file contains "binary" or "text" data. It is used in my program &lt;a href="https://github.com/sharkdp/bat" rel="noopener noreferrer"&gt;&lt;code&gt;bat&lt;/code&gt;&lt;/a&gt; to prevent "binary" files from being dumped to the terminal:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1l6rhjy0tljzbns18fx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1l6rhjy0tljzbns18fx8.png" alt="bat, detecting binary files"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Footnotes
&lt;/h4&gt;

&lt;p&gt;&lt;small&gt;&lt;br&gt;
¹ Note that there are some encodings that write so-called &lt;a href="https://en.wikipedia.org/wiki/Byte_order_mark" rel="noopener noreferrer"&gt;byte order marks&lt;/a&gt; (BOM) at the beginning of a file to indicate the type of encoding. For example, the little-endian variant of UTF-32 uses &lt;code&gt;ff fe 00 00&lt;/code&gt;. These BOMs would help with the second point because we would not need to decode the &lt;em&gt;whole&lt;/em&gt; content of the file. Unfortunately, adding BOMs is optional and a lot of encodings do not specify one.&lt;br&gt;
&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;&lt;small&gt;&lt;br&gt;
² &lt;code&gt;50 4e 47&lt;/code&gt; is part of the &lt;a href="https://en.wikipedia.org/wiki/List_of_file_signatures" rel="noopener noreferrer"&gt;magic number&lt;/a&gt; of the PNG format. Magic numbers are similar to BOMs and a lot of binary formats use magic numbers at the beginning of the file to signal their type. Using magic numbers to detect certain types of "binary" files is a method that is used by the &lt;code&gt;file&lt;/code&gt; tool.&lt;br&gt;
&lt;/small&gt;&lt;/p&gt;

</description>
      <category>binary</category>
      <category>text</category>
      <category>encoding</category>
      <category>unix</category>
    </item>
    <item>
      <title>My release checklist for Rust programs</title>
      <dc:creator>David Peter</dc:creator>
      <pubDate>Sun, 28 Oct 2018 14:18:12 +0000</pubDate>
      <link>https://dev.to/sharkdp/my-release-checklist-for-rust-programs-1m33</link>
      <guid>https://dev.to/sharkdp/my-release-checklist-for-rust-programs-1m33</guid>
      <description>&lt;p&gt;Releasing new versions of your projects is one of the more laborious tasks of an open source maintainer. There are many great tools that automate part of this process, but typically there are still a lot of manual steps involved. In addition, there are lots of things that can go wrong. New bugs might have been introduced, dependency updates can go wrong, the automatic deployment might not work anymore.&lt;/p&gt;

&lt;p&gt;After some practice with three of my Rust projects (&lt;a href="https://github.com/sharkdp/fd"&gt;fd&lt;/a&gt;, &lt;a href="https://github.com/sharkdp/hyperfine"&gt;hyperfine&lt;/a&gt; and &lt;a href="https://github.com/sharkdp/bat"&gt;bat&lt;/a&gt;), my workflow has converged to something that works quite well and avoids many pitfalls that I have walked into in the past. My hope in writing this post is that this process can be useful for others as well.&lt;/p&gt;

&lt;p&gt;The following is my release checklist for &lt;a href="https://github.com/sharkdp/fd"&gt;fd&lt;/a&gt;, but I have very similar lists for other projects. It is important to take the steps in the given order.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Check and update &lt;strong&gt;dependencies&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;a) Use &lt;a href="https://github.com/kbknapp/cargo-outdated"&gt;&lt;code&gt;cargo outdated&lt;/code&gt;&lt;/a&gt; to check for outdated dependencies. &lt;a href="https://deps.rs/repo/github/sharkdp/fd"&gt;deps.rs&lt;/a&gt; can also be used to get the same information.&lt;br&gt;
b) Run &lt;code&gt;cargo update&lt;/code&gt; to update dependencies to the latest compatible (minor) version.&lt;br&gt;
c) If possible and useful, manually update to new major versions.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;As for updates to new major versions, take a look at the upstream changes and carefully evaluate if an update is necessary (now).&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Get the &lt;strong&gt;list of updates&lt;/strong&gt; since the last release.&lt;/p&gt;

&lt;p&gt;Go to GitHub -&amp;gt; Releases -&amp;gt; "&lt;em&gt;XX&lt;/em&gt; commits to master since this release" to get an overview of all changes since the last release.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example: &lt;a href="https://github.com/sharkdp/fd/compare/v7.1.0...master"&gt;fd/compare/v7.1.0...master&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Update the &lt;strong&gt;documentation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;a) Review and update the &lt;code&gt;-h&lt;/code&gt; and &lt;code&gt;--help&lt;/code&gt; text.&lt;br&gt;
b) Update the README (program usage, document new features, update minimum required Rust version)&lt;br&gt;
c) Update the man page.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Install the latest &lt;code&gt;master&lt;/code&gt; locally and &lt;strong&gt;test new features&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;a) Run &lt;code&gt;cargo install -f&lt;/code&gt;.&lt;br&gt;
b) Test the new features manually.&lt;br&gt;
c) Run benchmarks to avoid performance regressions.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In an ideal world, we have written tests for all of the new code. These tests also run in our CI pipeline, so there is nothing to worry about, right? In my experience, there are always things that need to be reviewed manually. This is especially true for CLI tools that are more difficult to test due to their intricate dependencies on the interactive terminal environment.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Clean up&lt;/strong&gt; the code base.&lt;/p&gt;

&lt;p&gt;a) Run &lt;code&gt;cargo clippy&lt;/code&gt; and review the suggested changes [Optional]&lt;br&gt;
b) Run &lt;code&gt;cargo fmt&lt;/code&gt; to auto-format your code.&lt;br&gt;
c) Run &lt;code&gt;cargo test&lt;/code&gt; to make sure that all tests still pass.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The last two steps are typically automated in my CI pipeline. They are included here for completeness.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Bump &lt;strong&gt;version&lt;/strong&gt; information.&lt;/p&gt;

&lt;p&gt;a) Update the project version in &lt;code&gt;Cargo.toml&lt;/code&gt;&lt;br&gt;
b) Run &lt;code&gt;cargo build&lt;/code&gt; to update &lt;code&gt;Cargo.lock&lt;/code&gt;&lt;br&gt;
c) Search the whole repository for the old version and update as required (README, install instructions, build scripts, ..)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Forgetting to also update &lt;code&gt;Cargo.lock&lt;/code&gt; has prevented me from successfully publishing to &lt;a href="https://crates.io/"&gt;crates.io&lt;/a&gt; in the past.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dry run&lt;/strong&gt; for &lt;code&gt;cargo publish&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cargo publish --dry-run --allow-dirty&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Running &lt;code&gt;cargo publish&lt;/code&gt; is one of the last steps in the release process. Using the dry-run functionality at this stage can avoid later surprises.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Commit, push, and wait for CI to succeed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git push&lt;/code&gt; all the updates from the last steps and &lt;strong&gt;wait until CI has passed&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I used to immediately tag my "version update" commit to start the automated deployment. Having this intermediate "wait for CI" step has definitely prevented some failed releases.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Write &lt;strong&gt;release notes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While waiting for CI to finish, I already start to write the &lt;a href="https://github.com/sharkdp/fd/releases"&gt;release notes&lt;/a&gt;. I go through the list of updates and categorize changes into "Feature", "Change", "Bugfix" or "Other". I typically include links to the relevant Github issues and try to credit the original authors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tag the latest commit and &lt;strong&gt;start deployment&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git tag vX.Y.Z; git push --tags&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This assumes that the CI pipeline has been set up to take care about the actual deployment (upload binaries to GitHub).&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create the &lt;strong&gt;release&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Create the actual release on GitHub and copy over the release notes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify&lt;/strong&gt; the deployment.&lt;/p&gt;

&lt;p&gt;Make sure that all binaries have been uploaded. Manually test the binaries, if possible.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Publish&lt;/strong&gt; to &lt;a href="https://crates.io/"&gt;crates.io&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Make sure that your repository is clean or clone a fresh copy of the repository. Then run&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cargo publish&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Do this after the &lt;code&gt;git push --tags&lt;/code&gt; step. A git tag can be deleted if something goes wrong with the &lt;code&gt;cargo publish&lt;/code&gt; call, but &lt;code&gt;cargo publish&lt;/code&gt; can not be undone if the deployment via &lt;code&gt;git push --tags&lt;/code&gt; fails.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Notify &lt;strong&gt;package maintainers&lt;/strong&gt; about the update.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Arch Linux, for example, has the possibility to &lt;a href="https://www.archlinux.org/packages/community/x86_64/fd/"&gt;flag packages as being "out of date"&lt;/a&gt;. Include the link to your release notes and highlight changes that are relevant for package maintainers (new files that need to be installed, new dependencies, changes in the build process)&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do you maintain similar release-checklists? If so, I'd love to hear about things you do differently or steps I might have missed.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>release</category>
      <category>deployment</category>
      <category>checklist</category>
    </item>
  </channel>
</rss>
