loading...

Migrating from quick-error to SNAFU: a story on revamped error handling in Rust

e_net4 profile image Eduardo Pinho Updated on ・11 min read

This post describes my more-than-a-month long story of refactoring existing error definitions and error handling code in my largest project written in Rust. DICOM-rs is a pure Rust implementation of the DICOM standard, the near-ubiquitous standard for representing and communicating medical imaging data. The refactor hereby described on this project was an immediately apparent improvement, enabling me to track down the cause of a recently discovered bug in one of its crates. Read on to learn more!

Note: Basic understanding of error handling in Rust is expected, as the post will be specifically about the thought process of refactoring all error handling from one library to another. Moreover, for a wider overview of modern error handling, I greatly recommend watching Jane Lusby's talk from RustConf 2020.

Why change this?

quick-error served its purpose when it was used in the DICOM-rs ecosystem, which was to facilitate the declaration of multi-variant error types, without the hassle of adding the necessary impl boilerplate of an error type.
It also made it easy (and as we'll see, too easy) to convert a source error into a dedicated variant of our own error type, so that such an error could be automatically converted when necessary.

A quick example (no pun intended):

quick_error! {
    pub enum Error {
        BadIo (e: std::io::Error) {
            display("I/O error: {}", e)
            from()
        }
    }
}

fn do_things() -> Result<(), Error> {
    // now if any function returns a standard I/O error,
    // `?` will just convert it to our `Error` type
    let data = read_my_file()?;                   // here
    std::fs::write("out.bin", processed(data))?;  // and here too
}

It turns out that this approach to error handling has a damaging side-effect: error values E in a Result<T, E> only hold the data that we put in them. They do not keep track of any places where the error was raised with the ? operator, unless we introduce that as part of the conversion into new error types. In this case however, the conversion made in this code introduces no additional information about the context where the operation failed, as I/O errors are always put into the same top-level error variant BadIo. The problem becomes evident when we run our do_things and the only error information we get is the following:

I/O Error: Permission denied

This could well be the display output of an Error::BadIo where the inner error has the kind PermissionDenied. Since this could happen either when reading from a file or when writing to "out.bin", we remain clueless about where the error truly happened.

This error provides poor informative value, which only gets worse as the number of layers in the program increases. The current top layer in DICOM-rs is dicom-object, then followed by two layers of abstraction in dicom-parser, then another one in dicom-encoding. dicom-core is at the bottom, and also has error types of its own. The way these errors were built led to very poor quality error outputs when errors occurred (e.g. PrematureEof could be the full error output when trying to read a file). Worse, since they did not contain a backtrace, the program would not give us the lines of code that led into the error, much unlike in a panic, where by default it would be provided, usually by setting the RUST_BACKTRACE environment variable.

It was clear enough for me that the error types in DICOM-rs
should be more informative, and should contain a backtrace if desired.

What else then?

The world of error handling in Rust has seen substantial
changes. In its early stages, quick-error and error-chain were commonly employed to create errors with ease. error-chain in particular was known for its emphasis on chaining error causes, pretty much as the name implies. failure emerged with a new error interface that would allow users
to cast down the cause of an error, something which was missing in the standard definition. Eventually, std::error::Error was in fact updated to accommodate this major benefit from failure.

In recent times, error handling libraries aim to provide utilities for:

  • declaring structural error types with automatic trait implementations, (usually via procedural macros, which were unstable back in 2016);
  • easier fact checking and/or error throwing through the program stack;
  • presenting errors to the user in a human readable fashion;
  • adding compatibility layers to the generation of backtraces, even with a stable Rust toolchain;
  • an alternative error handling syntax.

Some libraries are intended to be used exclusively, whereas others can complement with existing code (e.g. fehler). The ones most commonly heard about these days are thiserror, anyhow, eyre, and snafu.

Why SNAFU?

thiserror provides similar benefits to quick-error, with the plus that it uses nicely integrated procedural macros
instead of rule-based macros. Those who have played with them a lot will often find diagnostics to be substantially better in the former. It also means that we do not need to increase the compiler's recursion limit whenever our error type increases in size, a consequence of the quick_error! macro being defined in a recursive fashion. On the other hand,
there isn't much else about it, so I could easily be taking the risk of simply rewriting errors without gaining the intended enhancements.

anyhow and eyre are mostly designed for applications,
as they facilitate the aggregation of multiple error types into a single source of information about the error. This error can be expanded across the program stack via the extended context operator, hence gaining additional information. They deliberately employ a "kitchen-sink" error type, which is more complicated to take part in code flow if necessary. When it comes to libraries, structured error types are usually preferred. Using one of these in CLI tools under the domain of DICOM-rs (such as dcmdump) is not out of the question, but is currently left as a backlog task.

And then there's SNAFU, which comes with:

  • an opinionated procedural macro for declaring error types and variants;
  • compile-time generated constructs for context building and automatic backtrace generation;
  • support for both standard (unstable) and stable (crate-based) backtraces;
  • and a well established philosophy on how error types should be designed.

As a project which may grow in unexpected ways, I wanted a library which could also adapt to that scale. Although opinionated, SNAFU incentivizes all the way towards well structured and useful errors.

Things to keep in mind

Next is a set of topics which will hopefully transmit the experience that I gathered while doing this rewrite,
and may help bring new ideas into your own error handling strategy.

Follow the guidelines

This one is the most obvious. To make error types with SNAFU, one will have a better experience if you follow the advice given in the documentation.

One of the most relevant topics is to prefer module-level error types over crate-level error types. If your code is well organised in its module hierarchy, then it is also likely that those modules will face a different set of errors. For example, the dicom-encoding crate is focused on decoding and encoding primitive DICOM data structures from/into a medium (often a file or the network).
In either case, a std::io::Error could be the root cause of the error, but the circumstances of reading and writing are distinct and may often have their own kinds of errors (e.g. invariant checking when reading data headers). By creating an error type for decoding and another one for encoding, errors will be more focused and only need to be as complex as they need to be.

Go top-down, not bottom-up

When rewriting existing code, it is much easier to start with errors which emerge at a higher abstraction level. Every change in a crate's error types needs to be propagated into its dependents. If the same dependents are not structured in a way as to be prepared for this change, the amount of code that needs to be rewritten at the top will increase.

At some point during this refactor, I had something like the error type below at the DICOM object file reading module level. This is a high level abstraction over DICOM data sets, so as to be perceived as a dictionary of attributes.

#[derive(Debug, Snafu)]
pub enum Error {
    // ...
    #[snafu(display("Could not read data set token"))]
    ReadToken {
        source: dicom_parser::error::Error,
    },
    // ...
}

dicom_parser::error::Error is the crate-level error for the mid-level parser and printer of DICOM data, yet to be refactored. dicom-parser actually gathers two levels of abstraction here, something which definitely does not favour the decision to use a single error type for all modules in the crate.

Once we follow the advice of keeping one or more errors per module, we'll have more specific errors and the code above will fail. However, if the variants were well defined, the error to choose for the variant ReadToken will be quite intuitive, requiring no further changes other than the source type.

#[derive(Debug, Snafu)]
pub enum Error {
    // ...
    #[snafu(display("Could not read data set token"))]
    ReadToken {
        #[snafu(backtrace)]
        source: dicom_parser::dataset::read::Error,
    },
    // ...
}

I had tried to convert dicom-core to use SNAFU at least twice until I admitted that this wasn't the right way. It would blow up most of the other crates in the project and require me to either change the entire thing at once or make changes that would eventually be rewritten anyway.

Work with the error type selectors, not against them

Some parts of the documentation may be easy to miss out, because some of the methods do not exist in a single struct or trait. Rather, error type selectors are procedurally built, and hidden behind the Snafu macro.

As for a concrete example, if we have this:

#[derive(Debug, Snafu)]
pub enum Error {
    #[snafu(display(
        "Undefined value length of element tagged {} at position {}",
        tag,
        position
    ))]
    UndefinedValueLength {
        tag: Tag,
        position: u64,
        backtrace: Backtrace,
    }
}

One might think that we need to write this to build the error:

header
    .length()  // custom length type
    .get()     // Option<_>
    .ok_or_else(|| Error::UndefinedValueLength {
        position: self.bytes_read,
        tag: header.tag,
        backtrace: Backtrace::generate(),
    })

When in fact the same process is substantially easier and less error prone with selectors:

header
    .length()
    .get()
    .context(UndefinedValueLength {
        position: self.bytes_read,
        tag: header.tag,
    })

The context method encapsulates information about a specific error variant, while optionally encapsulating an error and generating backtraces. And all it takes to have this context method is to have our Snafu-built error type and import snafu::ResultExt and/or snafu::OptionExt.

Even when context isn't suitable (which is when you don't have a result nor an option), you can call .build() or.fail() on top of a selector.

let string = date_str();
if !is_valid_date(&string) {
    return InvalidDateValue {
        position: self.bytes_read,
        string,
    }
    .fail();
}

In fact, you should never really need to create errors manually.

Libraries, stabilize your APIs

As an ecosystem mostly comprised of libraries, it is admitted that changing any of these public error enum types leads to a breaking change, which would introduce a barrier to an eventual upgrade.

One easy step that gives more freedom to public error types is adding #[non_exhaustive]. This makes it so that matching on the value requires a "catch-all" match arm, which means that introducing new variants will not break existing code.

There usually isn't a need to exhaustively consider each error variant, so this change brings immediate benefits with little undesirable consequences. The same attribute can be applied to the struct variants themselves, so that adding more contextual fields to an error variant also does not lead to a major breaking change.

#[derive(Debug, Snafu)]
#[non_exhaustive]
pub enum Error {
    // ...
    #[snafu(display("Could not read data set token: {}", source))]
    #[non_exhaustive]
    ReadToken {
        #[snafu(backtrace)]
        source: dicom_parser::dataset::read::Error,
    },
    // ...
}

You can also choose to make an error opaque. Opaque errors are reported with the same information, except that their implementation is behind private constructs, therefore purposely limiting the interactions possible with that error. As users of the API cannot touch the implementation details of the error, any change will not break existing code, although it may change the errors displayed.

#[derive(Debug, Snafu)]
pub enum Error(InnerError);

#[derive(Debug, Snafu)]
#[non_exhaustive]
enum InnerError {  // no longer public
    // ...
}

Backtrace? What backtrace?

At some point before this refactor, I had this shallow idea that all I was missing in an error report was the backtrace to the part of the code that constructed the error. However, SNAFU advocates for semantic backtracing: to have good enough error types and variants, conveying enough information so that you don't need a backtrace in practice.

The current recommendation would be to introduce a backtrace on leaf error variants, either always generated or conditionally generated (by wrapping an Option around it), and see for yourself how useful they are. My early experience so far is that a backtrace makes reaching the problematic code a bit easier, but semantic backtracing already provides an outstanding clue of what the issue might be.

Ready to report?

Consider thinking about how you are going to report the errors to the user, and adjust your display implementations accordingly.

Most examples in the SNAFU guide show the following pattern:

#[derive(Debug, Snafu)]
#[non_exhaustive]
pub enum Error {
    #[snafu(display("Could not read data set token: {}", source))]
    #[non_exhaustive]
    ReadToken {
        #[snafu(backtrace)]
        source: dicom_parser::dataset::read::Error,
    },
}

Notice how the source field is chained to the tail of
the type's display implementation. This makes it so that a single line print is enough to present all information about the error (minus the backtrace).

eprintln!("[ERROR] {}", e);

One possible output:

[ERROR] Could not read data set token: Could not read item value: Undefined value length of element tagged (5533,5533) at position 3548

However, this will not integrate well with other error reporting facilities, such as anyhow or eyre, which are already prepared to traverse the chain of error causes and show them individually.

If we were to turn our error into an eyre::Report and print that:

eprintln!("[ERROR] {:?}", eyre::Report::from(e));

We get these maddening repetitions:

[ERROR] Could not read data set token: Could not read item value: Undefined value length of element tagged (5533,5533) at position 3548

Caused by:
   0: Could not read item value: Undefined value length of element tagged (5533,5533) at position 3548
   1: Undefined value length of element tagged (5533,5533) at position 3548

In this case, we would want a display message that does not print its source error.

This issue is not necessary specific to eyre, but will emerge any time that you use a different kind of error reporter. An equivalent report function in SNAFU would be quite similar:

fn report<E: 'static>(err: &E)
where
    E: std::error::Error,
    E: snafu::ErrorCompat,
    E: Send + Sync,
{
    eprintln!("[ERROR] {}", err);
    if let Some(source) = err.source() {
        eprintln!();
        eprintln!("Caused by:");
        for (i, e) in std::iter::successors(Some(source), |e| e.source()).enumerate() {
            eprintln!("   {}: {}", i, e);
        }
    }

    if let Some(backtrace) = ErrorCompat::backtrace(err) {
        eprintln!("Backtrace:");
        eprintln!("{}", backtrace);
    }
}

(Note: This reporter will always try to print an existing backtrace, regardless of environment variables set.)

So yes, you need to define upfront how your errors are supposed to be reported:

  • Implement error displays that show the full chain of errors, so that consumers only have to call a standard print function;
  • Or leave the task of showing the list of causes to a reporter, hence giving more flexibility to the consumer of the API.

There currently isn't an easy way to switch from one to the other. I chose the latter for DICOM-rs because I am interested in maintaining control of how errors are reported
in the project, and I can easily adjust all application crates accordingly. There are also chances of this pattern to become more common place in the wider Rust ecosystem.

Conclusion

In the end, I cannot claim that this was simply the best rewrite one could do, but I can say that rewriting it was a good decision. With a pull request comprising +4,694/−2,900 lines in 23 commits, the quick-error dependency was removed and snafu took over, much to the delight of future users of the DICOM-rs ecosystem.

As DICOM-rs may one day be a cornerstone of future generations of medical imaging software, this refactor came not a moment too soon. Future work will concentrate on other important matters that might be worth communicating as well. Among the topics in the project's roadmap, there is logging, lazy file loading, a data-oriented test harness, and a higher-level image abstraction.

Posted on by:

e_net4 profile

Eduardo Pinho

@e_net4

Software Engineer; expertise in medical imaging informatics; Rustacean since 2016.

Discussion

markdown guide