Michał Siedlaczek

Posted on Jun 27, 2020 • Originally published at msiedlaczek.com

Protobuf code generation in Rust

#rust

Today, I learned how to correctly use Cargo build scripts. Or, more precisely, I learned how to do one particular thing correctly, but it was significant enough for me that I decided write it down. Of course, had I read the Cargo Book more carefully before, I would have saved myself some time, and there would be no dramatic revelation, and no reason to write this post either. I guess what I am trying to say is: thank goodness my reading sucks.

Keywords: cargo build scripts code generation

Context

My problem arose while implementing a little library that reads from and writes to a Protocol Buffer stream. As described by the authors themselves:

Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.

The very first step is to define a message format in Protobuf's dedicated grammar. The message is parsed by the Protobuf library (available for various programming languages), and related code is generated for the target language (in our case, Rust). Each message is an object with a bunch of getters and setters (details differ depending on the language). These messages then can be read from or written to streams available in the Protobuf library. It is all quite straightforward. Take this simple message from the Protobuf web page:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

In Rust, we would use the following chunk of code to construct a Person:

let mut person = Person::new();
person.set_name("Butler");

The details are not very important here. What is important is that this code needs to be generated before the project is compiled, or else there would be no Person to speak of. And, if possible, we would prefer if it was neatly integrated into our build system.

Cargo build scripts

Our situation is by no means unique. Probably the most canonical example is a library that provides Rust API bindings to some C library, such as libc, git2, and many others. Before compiling our Rust crate, we first need to compile the C code, and maybe even generate Rust FFI bindings from C headers (see bindgen crate). Because this is a common pattern and Rust ecosystem is fantastic, there is a standard solution for this: build scripts.

Simply put, Cargo allows us to define a build script, by default named build.rs and located at the root of the project. This is essentially a regular executable (with main function and all), except Cargo provides a bunch of environment variables with useful build information. The build script in turn communicates back to Cargo by writing instructions to the standard output. For example, printing cargo:warning=MESSAGE will instruct Cargo to print a warning to the terminal. More on that a little later, but for now here is a simple example that compiles a C source file using cc crate:

fn main() {
    cc::Build::new()
        .file("src/example.c")
        .compile("example");
}

Note that there is a special section in Cargo.toml where you can define dependencies that are only used for building: [dev-dependencies].

Generating Protobuf code

Now that we have basic information about build scripts, we can take a shot at generating the Rust code for Protobuf messages. Luckily, there is already a crate that will make it very simple:

[dev-dependencies]
protobuf-codegen-pure = "2.14" # Might be different by the time you read this

[dependencies]
protobuf = "2.14" # This will be needed to use the generated code as protobuf messages

A quick look at the documentation explains it all:

fn main() {
    protobuf_codegen_pure::Codegen::new()
        .out_dir("src/protos")
        .inputs(&["protos/person.proto"]),
        .include("protos")
        .run()
        .expect("Codegen failed.");
}

Let's break it down. We first create a Codegen object, which implements a builder pattern. Then, we define where the generated files should be created. Finally, we need to point to the input files that contain the message definitions, and to the directory containing these files. That's it. Simple, right?

Well, not so fast. There is a catch. See, Cargo doesn't want us to write to the src directory:

Build scripts may save any output files in the directory specified in the OUT_DIR environment variable. Scripts should not modify any files outside of that directory.

But why would they care? Well, it is a security concern. If a crate is built remotely, we don't want to allow what is effectively a user-defined program to write anywhere they want. A good example is Docs.rs, which hosts API documentation of all crates available on crates.io. They limit the program's write permissions to only one directory and pass it via an environment variable OUT_DIR. In fact, if you follow the instructions from protobuf-codegen-pure crate, your documentation on Docs.rs will fail to build (this is precisely how I found out about all of this!).

Correcting the build script

So how do we fix our build script? Let's try this:

fn main() {
    let out_dir_env = env::var_os("OUT_DIR").unwrap();
    let out_dir = Path::new(&out_dir_env);
    protobuf_codegen_pure::Codegen::new()
        .out_dir(out_dir)
        .inputs(&["protos/person.proto"]),
        .include("protos")
        .run()
        .expect("Codegen failed.");
}

But where is our file now and how do we use it? Rust provides include! macro that copies the content of a file to the file it is invoked from. For example, here is a little snippet from lib.rs showcasing this:

include!(concat!(env!("OUT_DIR"), "/person.rs"));

fn new_person(name: &str) -> Person {
    let mut person = Person::new();
    person.set_name(name);
    person
}

Does it work now? Unfortunately, not quite. protobuf-codegen-pure takes liberty to add some module-level comments and attributes suppressing certain warnings, which now fail to compile:

error: an inner attribute is not permitted in this context
 --> /home/elshize/dev/ciff/target/debug/build/ciff-e8fd3067377fd4eb/out/common_index_format_v1.rs:5:1
  |
5 | #![allow(unknown_lints)]
  | ^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: inner attributes, like `#![no_std]`, annotate the item enclosing them, and are usually found at the beginning of source files. Outer attributes, like `#[test]`, annotate the item following them.

This is caused by the indirection of include!. The good news is that it is a known problem and chances are this is already resolved when you are reading it. But since I don't have the luxury of travelling through time, I would like to find a workaround. Besides, it is a good opportunity to show how powerful build scripts really are. We are by no means limited to what the codegen library generates for us.

The objective is to get rid of those failing comments and attributes. On the other hand, I would still like to be able to suppress warnings from the generated code. To do that, we can create person.rs module, which will simply define attributes and include the generated code, which can be later re-exported by lib.rs. For example:

#![allow(unknown_lints)]
#![allow(clippy::all)]
#![allow(clippy::pedantic)]
#![allow(box_pointers)]
#![allow(dead_code)]
#![allow(missing_docs)]
#![allow(non_camel_case_types)]
#![allow(non_snake_case)]
#![allow(non_upper_case_globals)]
#![allow(trivial_casts)]
#![allow(unsafe_code)]
#![allow(unused_imports)]
#![allow(unused_results)]
include!(concat!(env!("OUT_DIR"), "/person.rs"));

Great, now the only thing that is left is to remove these from the generated file. This can be easily done in build.rs. Once the file is successfully generated, we can read it line by line and filter out any line that starts with #! or //!.

fn main() {
    let out_dir_env = env::var_os("OUT_DIR").unwrap();
    let out_dir = Path::new(&out_dir_env);
    protobuf_codegen_pure::Codegen::new()
        .out_dir(out_dir)
        .inputs(&["protos/person.proto"]),
        .include("protos")
        .run()
        .expect("Codegen failed.");
    // Resolve the path to the generated file.
    let path = out_dir.join("person.rs");
    // Read the generated code to a string.
    let code = read_to_string(&path).expect("Failed to read generated file");
    // Write filtered lines to the same file.
    let mut writer = BufWriter::new(File::create(path).unwrap());
    for line in code.lines() {
        if !line.starts_with("//!") && !line.starts_with("#!") {
            writer.write_all(line.as_bytes()).unwrap();
            writer.write_all(&[b'\n']).unwrap();
        }
    }
}

Result? See for yourself.

Conclusions

I have a few takeaways from my little experiment. First, the Rust ecosystem, although rich and powerful, it not yet fully matured. Certain details are still being ironed out, such as those in protobuf-codegen-pure. But this is to be expected. I think what is more important is that these libraries are out there, and how many people are actively working on making them better each day. But most of all, I am often blown away by how well thought out many functionalities of Rust or Cargo are, especially compared to those available, say, in C++. Build scripts are one of those gems that elevate Rust to the great piece of technology it is.

Questions? Comments? I am @elshize on Twitter and @siedlaczek at Mastodon.Social. Feel free to say hi.

Oldest comments (6)

Dion Dokter • Jun 30 '20

Very nice article, thanks!

Andrew Watts • Jun 30 '20

An alternative to this would be to copy the generated rust module from $OUT_DIR to $CARGO_MANIFEST_DIR/src. In the past, I've used github.com/danburkert/prost for protobuf things, and used something along these lines:

let proto_path = (the path to the protobuf definitions)
let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
let src_dir = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()).join("src");
let out_path = out_dir.join("messages.rs");
let module_path = src_dir.join("messages").join("generated.rs");

prost_build::Config::new()
    .compile_protos(
        &[proto_path.to_str().unwrap()],
        &[proto_path.parent().unwrap().to_str().unwrap()],
    )
    .expect("Failed to compile protobuf definitions");

fs::copy(out_path, module_path).unwrap();

My experience has generally been that the actual protobuf definitions don't change very often, and the generated code isn't build-dependent, so committing the generated file isn't a huge deal (or you could set it to be ignored). You can then shave a little bit of time off the compilation by only regenerating the code if the definitions are newer than the generated module.

Michał Siedlaczek • Jul 2 '20

Thanks for sharing. I didn't know about prost and would love to know how it compares with protobuf-codegen-pure if you know. For one, it seems to write to OUT_DIR by default.

Do I understand correctly that src_dir will point to your project/src directory? If so, we will experience the same problem of not being able to write to it on Docs.rs, right? I guess your suggestion would be to not run the generation as a build script but rather do it manually and commit the generated files, which certainly seems like a reasonable option for protobuf, which shouldn't change between builds.

Andrew Watts • Jul 10 '20

As I recall, I used prost because at the time it offered me a way to customize derived traits for the generated types (I needed the serde traits), while the other protobuf crate I looked at did not. That might not be the case now.

CARGO_MANIFEST_DIR is the directory where Cargo.toml lives, so $CARGO_MANIFEST_DIR/src is the project's src directory.

In my case, the definitions were in another repository, and might not always be available if that repo had not been checked out in the right place on the machine doing the build, but I still wanted to be able to build in that case.

Thus, I do run the generator as a build script, but I also commit the output to source control, so rustdoc ought to see it. There's some logic in the build script (that I didn't post) that only runs the protobuf generator if the module is older than the definitions. If I need to force it to regenerate, I can always locally delete the generated module and build, and then commit any differences.