DEV Community: David Peter

Hacktoberfest 2020 - a retrospective

David Peter — Sun, 01 Nov 2020 09:58:00 +0000

This years Hacktobfest didn't have a great start. There was a lot of understandable controversy around the fact that some people abused the system by sending spam pull requests in order to get a free shirt. Eventually, this lead to some important policy changes, one of which was to make Hacktoberfest opt-in for maintainers.

While I think this is a good result overall, I also felt kind of sad because it put the whole event in such a bad light. Personally, I had great experiences with Hacktoberfest in the past years - both as a contributor as well as a maintainer. Which is why I decided to actively enable "Hacktoberfest" contributions on some of my projects.

For bat specifically, I opened three tickets that included ideas and instructions for contributing [1] [2] [3]. They were mainly targeted towards first-time contributors. In the following chart, you can see what kind of effect the opt-in strategy had on the number of contributions to the bat repository:

Now I should note that the large majority of these are small contributions. New test cases or documentation updates. But that does not mean that they are not helpful for the project - quite the contrary. And more importantly, I don't think that's even the point of Hacktoberfest.

The really great thing is that it motivates people to get started with open source work (or to rekindle their engagement). For bat, we received a lot of contributions from newcomers. It is fantastic to see the excitement when their PR is being merged. Most of them are also really grateful for review comments and very happy to push further updates.

We did not receive a single spam contribution. Sure, there is always a small fraction of PRs that are going to be rejected (3 out of 129 for bat). But that is not specific to Hacktoberfest. As a maintainer, the amount of work you put into an average first-time contribution is definitely a bit higher than usual. But we have all started as a beginner once. Personally, I can definitely still remember my first contributions to the open source world and the kind of excitement I felt.

To summarize: I still think that Hacktoberfest is a great initiative and I am definitely looking forward to future events.

Thank you to all contributors and to my co-maintainers eth-p and keith-hall!

An unexpected performance regression

David Peter — Mon, 16 Sep 2019 19:16:58 +0000

Performance regressions are something that I find rather hard to track in an automated way. For the past years, I have been working on a tool called fd, which is aiming to be a fast and user-friendly (but not necessarily feature-complete) alternative to find.

As you would expect from a file-searching tool, fd is an I/O-heavy program whose performance is governed by external factors like filesystem speed, caching effects, as well as OS-specific aspects. To get reliable and meaningful timing results, I developed a command-line benchmarking tool called hyperfine which takes care of things like warmup runs (for hot-cache benchmarks) or cache-clearing preparation commands (for cold-cache benchmarks). It also performs an analysis across multiple runs and warns the user about outside interference by detecting statistical outliers¹.

But this is just a small part of the problem. The real challenge is to find a suitable collection of benchmarks that tests different aspects of your program across a wide range of environments. To get a feeling for the vast amount of factors that can influence the runtime of a program like fd, let me tell you about one particular performance regression that I found recently².

I keep a small collection of old fd executables around in order to quickly run specific benchmarks across different versions. I noticed a significant performance regression between fd-7.0.0 and fd-7.1.0 in one of the benchmarks:

I quickly looked at the commits between 7.0 and 7.1 to see if there were any changes that could have introduced this regression. I couldn't find any obvious candidates.

Next, I decided to perform a small binary search by re-compiling specific commits and running the benchmark. To my surprise, I wasn't able to reproduce the fast times that I had measured with the precompiled binaries of the old versions. Every single commit yielded slow results!

There was only one way this could have happened: the old binaries were faster because they were compiled with an older version of the Rust compiler. The version that came out shortly before the fd-7.1.0 release was Rust 1.28. It made a significant change to how Rust binaries were built: it dropped jemalloc as the default allocator.

To make sure that this was the root cause of the regression, I re-enabled jemalloc via the jemallocator crate. Sure enough, this brought the time back down:

Subsequently, I ran the whole "benchmark suite". I found a consistent speed up of up to 40% by switching from the system-allocator to jemalloc (see results below). The recently released fd-7.4.0 now re-enables jemalloc as the allocator for fd.

Unfortunately, I still don't have a good solution for automatically keeping track of performance regressions - but I would be very interested in your feedback and ideas.

Benchmark results

Simple pattern, warm cache:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`fd-sysalloc '.*[0-9]\.jpg$'`	252.5 ± 1.4	250.6	255.5	1.26
`fd-jemalloc '.*[0-9]\.jpg$'`	201.1 ± 2.4	197.6	207.0	1.00

Simple pattern, hidden and ignored files, warm cache:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`fd-sysalloc -HI '.*[0-9]\.jpg$'`	748.4 ± 6.1	739.9	755.0	1.42
`fd-jemalloc -HI '.*[0-9]\.jpg$'`	526.5 ± 4.9	520.2	536.6	1.00

File extension search, warm cache:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`fd-sysalloc -HI -e jpg ''`	758.4 ± 23.1	745.7	823.0	1.40
`fd-jemalloc -HI -e jpg ''`	542.6 ± 2.7	538.3	546.1	1.00

File-type search, warm cache:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`fd-sysalloc -HI --type l ''`	722.5 ± 3.9	716.2	729.5	1.37
`fd-jemalloc -HI --type l ''`	526.1 ± 6.8	517.6	539.1	1.00

Simple pattern, cold cache:

Command	Mean [s]	Min [s]	Max [s]	Relative
`fd-sysalloc -HI '.*[0-9]\.jpg$'`	5.728 ± 0.005	5.723	5.733	1.04
`fd-jemalloc -HI '.*[0-9]\.jpg$'`	5.532 ± 0.009	5.521	5.539	1.00

¹ For example, I need to close Dropbox and Spotify before running fd benchmarks as they have a significant influence on the runtime.

² As stated in the beginning, I don't have a good way to automatically track this. So it took me some time to spot this regression :-(

The difference between "binary" and "text" files

David Peter — Sun, 30 Dec 2018 15:34:04 +0000

This article explores the topic of "binary" and "text" files. What is the difference between the two (if any)? Is there a clear definition for what constitutes a "binary" or a "text" file?

We start our journey with two candidate files whose content we would intuitively categorize as "text" and "binary" data, respectively:


 bash
echo "hello 🌍" > message
convert -size 1x1 xc:white png:white

We have created two files: A file named message with the textual content "hello 🌍" (including the Unicode symbol "Earth Globe Europe-Africa") and a PNG image with a single white pixel called white. File extensions are deliberately left out.

To demonstrate that some programs distinguish between "text" and "binary" files, check out how grep changes its behavior:



▶ grep -R hello            
message:hello 🌍

▶ grep -R PNG
Binary file white matches

diff does something similar:



▶ echo "hello world" > other-message
▶ diff other-message message 
1c1
< hello world
---
> hello 🌍

▶ convert -size 1x1 xc:black png:black
▶ diff black white
Binary files black and white differ

How do these programs distinguish between "text" and "binary" files?

Before we answer this question, let us first try to come up with a definition. Clearly, on a fundamental file-system level, every file is just a collection of bytes and could therefore be viewed as binary data. On the other hand, a distinction between "text" and "non-text" (hereafter: "binary") data seems helpful for programs like grep or diff, if only not to mess up the output of your terminal emulator.

So maybe we can start by defining "text" data. It seems reasonable to begin with an abstract notion of text as being a sequence of Unicode code points. Examples of code points are characters like k, ä or א, as well as special symbols like %, ☢ or 🙈. To store a given text as a sequence of bytes, we need to choose an encoding. If we want to be able to represent the whole Unicode range, we typically choose UTF-8, sometimes UTF-16 or UTF-32. Historically, encodings which support just a part of todays Unicode are also important. The most prominent ones are US-ASCII and Latin1 (ISO 8859-1), but there are many more. And all of these look different on a byte level.

Given just the contents of a file (not the history on how it was created), we can therefore try the following definition:

A file is called "text file" if its content consists of an encoded sequence of Unicode code points.

There are two practical problems with this definition. First, we would need a list of all possible encodings. Second, in order to test if the contents of a file is encoded in a given encoding, we would have to decode the whole contents of the file and see if it succeeds¹. The whole process would be really slow.

It turns out that there is a much faster way to distinguish between text and binary files, but it comes at the cost of precision.

To see how this works, let's go back to our two candidate files and explore their byte level content. I am using hexyl as a hex viewer, but you can also use hexdump -C:

Note that both files contain bytes within and outside of the ASCII range (00…7f). The four bytes f0 9f 8c 8d in the message file, for example, are the UTF-8 encoded version of the Unicode code point U+1F30D (🌍). On the other hand, the bytes 50 4e 47 at the beginning of the white image are a simple ASCII-encoded version of the characters PNG².

So clearly, looking at bytes outside the ASCII range can not be used as a method to detect "binary" files. However, there is a difference between the two files. The image file contains a lot of NULL bytes (00) while the short text message does not. It turns out that this can be turned into a simple heuristic method to detect binary files, since a lot of encoded text data does not contain any NULL bytes (even though it might be legal).

In fact, this is exactly what diff and grep use to detect "binary" files. The following macro is included in diffs source code (src/io.c):



#define binary_file_p(buf, size) (memchr (buf, 0, size) != 0)

Here, the memchr(const void *s, int c, size_t n) function is used to search the initial size bytes of the memory region starting at buf for the character 0. To speed this process up even more, typically only the first few bytes of the file are read into the buffer buf (e.g. 1024 bytes). To summarize, grep and diff use the following heuristic approach:

A file is very likely to be a "text file" if the first 1024 bytes of its content do not contain any NULL bytes.

Note that there are counterexamples where this fails. For example, even if unlikely, UTF-8-encoded text can legally contain NULL bytes. Conversely, some particular binary formats (like binary PGM) do not contain NULL bytes. This method will also typically classify UTF-16 and UTF-32 encoded text as "binary", as they encode common Latin-1 code points with NULL bytes:



▶ iconv -f UTF-8 -t UTF-16 message > message-utf16
▶ hexdump -C message-utf16 
00000000  ff fe 68 00 65 00 6c 00  6c 00 6f 00 20 00 3c d8  |..h.e.l.l.o. .<.|
00000010  0d df 0a 00                                       |....|
00000014
▶ grep . message-utf16                            
Binary file message-utf16 matches

Nevertheless, this heuristic approach is very useful. I have written a small library in Rust which uses a slightly refined version of this method to quickly determine whether a given file contains "binary" or "text" data. It is used in my program bat to prevent "binary" files from being dumped to the terminal:

Footnotes

¹ Note that there are some encodings that write so-called byte order marks (BOM) at the beginning of a file to indicate the type of encoding. For example, the little-endian variant of UTF-32 uses ff fe 00 00. These BOMs would help with the second point because we would not need to decode the whole content of the file. Unfortunately, adding BOMs is optional and a lot of encodings do not specify one.

² 50 4e 47 is part of the magic number of the PNG format. Magic numbers are similar to BOMs and a lot of binary formats use magic numbers at the beginning of the file to signal their type. Using magic numbers to detect certain types of "binary" files is a method that is used by the file tool.

My release checklist for Rust programs

David Peter — Sun, 28 Oct 2018 14:18:12 +0000

Releasing new versions of your projects is one of the more laborious tasks of an open source maintainer. There are many great tools that automate part of this process, but typically there are still a lot of manual steps involved. In addition, there are lots of things that can go wrong. New bugs might have been introduced, dependency updates can go wrong, the automatic deployment might not work anymore.

After some practice with three of my Rust projects (fd, hyperfine and bat), my workflow has converged to something that works quite well and avoids many pitfalls that I have walked into in the past. My hope in writing this post is that this process can be useful for others as well.

The following is my release checklist for fd, but I have very similar lists for other projects. It is important to take the steps in the given order.

Check and update dependencies.

a) Use cargo outdated to check for outdated dependencies. deps.rs can also be used to get the same information.
b) Run cargo update to update dependencies to the latest compatible (minor) version.
c) If possible and useful, manually update to new major versions.

As for updates to new major versions, take a look at the upstream changes and carefully evaluate if an update is necessary (now).
Get the list of updates since the last release.

Go to GitHub -> Releases -> "XX commits to master since this release" to get an overview of all changes since the last release.

Example: fd/compare/v7.1.0...master
Update the documentation.

a) Review and update the -h and --help text.
b) Update the README (program usage, document new features, update minimum required Rust version)
c) Update the man page.
Install the latest master locally and test new features.

a) Run cargo install -f.
b) Test the new features manually.
c) Run benchmarks to avoid performance regressions.

In an ideal world, we have written tests for all of the new code. These tests also run in our CI pipeline, so there is nothing to worry about, right? In my experience, there are always things that need to be reviewed manually. This is especially true for CLI tools that are more difficult to test due to their intricate dependencies on the interactive terminal environment.
Clean up the code base.

a) Run cargo clippy and review the suggested changes [Optional]
b) Run cargo fmt to auto-format your code.
c) Run cargo test to make sure that all tests still pass.

The last two steps are typically automated in my CI pipeline. They are included here for completeness.
Bump version information.

a) Update the project version in Cargo.toml
b) Run cargo build to update Cargo.lock
c) Search the whole repository for the old version and update as required (README, install instructions, build scripts, ..)

Forgetting to also update Cargo.lock has prevented me from successfully publishing to crates.io in the past.
Dry run for cargo publish.

cargo publish --dry-run --allow-dirty

Running cargo publish is one of the last steps in the release process. Using the dry-run functionality at this stage can avoid later surprises.
Commit, push, and wait for CI to succeed.

git push all the updates from the last steps and wait until CI has passed.

I used to immediately tag my "version update" commit to start the automated deployment. Having this intermediate "wait for CI" step has definitely prevented some failed releases.
Write release notes.

While waiting for CI to finish, I already start to write the release notes. I go through the list of updates and categorize changes into "Feature", "Change", "Bugfix" or "Other". I typically include links to the relevant Github issues and try to credit the original authors.
Tag the latest commit and start deployment.

git tag vX.Y.Z; git push --tags

This assumes that the CI pipeline has been set up to take care about the actual deployment (upload binaries to GitHub).
Create the release.

Create the actual release on GitHub and copy over the release notes.
Verify the deployment.

Make sure that all binaries have been uploaded. Manually test the binaries, if possible.
Publish to crates.io.

Make sure that your repository is clean or clone a fresh copy of the repository. Then run

cargo publish

Do this after the git push --tags step. A git tag can be deleted if something goes wrong with the cargo publish call, but cargo publish can not be undone if the deployment via git push --tags fails.
Notify package maintainers about the update.

Arch Linux, for example, has the possibility to flag packages as being "out of date". Include the link to your release notes and highlight changes that are relevant for package maintainers (new files that need to be installed, new dependencies, changes in the build process)

Do you maintain similar release-checklists? If so, I'd love to hear about things you do differently or steps I might have missed.