Speeding up all builds at Amazon by 40% with the help of Rust

#rust #devex #aws

Developer productivity is a difficult metric to quantify. The end goal of a developer is far removed from lines of code written and the speed of a release pipeline. This makes it difficult for leaders to create business driven goals that focus on developer productivity, since even though we all agree that developer productivity is one of the most important factors for success in a software company it is difficult to set goals and targets that directly relate to business outcomes. I have built and launched services for billions of users at Microsoft and Amazon and my personal experience is that the iteration velocity is the strongest factor in the speed of delivering for customers, but time and again I join teams where lowering these iteration cycles is deprioritized for short term results.

One specific area that I have always worked to prioritize is build speeds. Build speeds are the main bottleneck of iteration speed and I find that not only does shorter build speeds lead to faster feature iteration, but also leads to a deeper engagement since I am able to stay engaged in a development flow instead of waiting for a build to complete.

I had the privilege of being invited to join the builder tools team in Amazon from another service team after I had performed investigations into the Brazil build system that is used at Amazon and was able to come up with performance changes that were able to speed up Amazon’s dependency resolution tool as much as 50% in larger dependency trees. During this time I was able to take the lead on some sub-initiatives to reduce the build speed for all builds across Amazon. In 2024 we were able to reduce our internal metric of developer build time by 40% and I wanted to share the insights with the Rust and Developer communities.

Learnings:

Metrics:
If you ever endeavor to improve developer productivity in your organization the most important step is coming up with comprehensive metrics. To start off this initiative, every metric that could be related to builds was collected. We were able to track for every developer, how long builds took, which commands they were running, what all of the characteristics of their machines and even the specific networking details such as how long it took to establish an SSH connection with a Git server. Not only is this important to validate that you are focusing on the right things but it is most impactful for showcasing to the organization the impact that your work is having to allow you to continue prioritizing this type of work.

One insight in this process that we discovered was that we were limited on how fast we could make improvements due to the caution that was required in making sure that we had no regressions. Having a robust client error collection system was just as important as all of the performance metrics because it means that you can ship changes and make sure that you are not impacting your users. Especially when modifying the dependency management system since changes before have taken entire AWS services such as S3 in the early days of AWS.

Profiling:
Setting up profiling systems for our build tools was by far the most efficient way to identify improvements to build speeds. I dedicated time to look up each of the languages that were used in our build system stack such as RProf for Ruby, async-profiler for Java. What we realized early on was that we were missing a huge set of performance optimizations when looking at this profiling data because we were not able to identify the holistic performance of multiple subsystems using different languages.

This is where the tool Samply shines, it is able to showcase timing information across multiple languages and subsystems and can even tell you for Rust and Java programs which lines your program is spending time on. This was able to give us insights that helped us prioritize which parts of the system were the simplest and most impacting to build times.

Rust
I had joined the builder tools team just after building my first services in Rust and was very excited to hear that the team was eager to adopt more Rust in their services to further their performance goals. The builder tools org at Amazon is the best place to learn to work in Rust since you have so many developers involved in the Rust community and I would sometimes find the lead maintainer Niko at the agile desk beside me.

When it comes to building tooling and scripting, compiled languages are the way to go. After having the metrics collected and the profiling infrastructure setup the next thing was to start on rewriting the performance critical sections into Rust.

The performance impacts were unbelievable, we were seeing most use cases become 100-200x faster, shaving substeps from 500ms to 20ms. With each new subsection we would see the build speeds incrementally go down a few percentage points every week.

I know many in the Rust community are wary of using Agentic development, but the use case of doing rewrites from a scripting language to a compiled language is perfect for Agentic development. We had set up automation to be able to take the original code, write unit tests and integration tests that were able to make sure that all of the features were covered, then use that testing to rewrite the original code into Rust. Rust's type safety also greatly improved the long term maintainability and even in some cases surfaced existing bugs through the borrow checker.

DEV Community

Speeding up all builds at Amazon by 40% with the help of Rust

Top comments (0)