Luca Barbato

Posted on Jan 7, 2020

Profiling and Benchmarking

#rust #profiling #benchmarking

For the 0.2.0 release I was focusing on making rav1e faster.
Luca Bruno asked me to give more context, this is the belated start of a set of blogposts about optimizing rust code.

Optimizing

First, let's try to make clear what I mean with optimizing software and which kind of optimum we can aim to reach and what is the general process.
The high level process is more or less the following:

Decide what you want to optimize for
- Pick a target metric
Pick a set of common use-cases
Figure out which are the parts of the codebase that are impacting the selected metric the most.
Setup benchmarks to measure how your changes impact the decided metric
Change the code and go to back to 3. until you are happy.

Target metrics

First thing you should decide what you want to optimize for.

Usually it is common to optimize for speed: making the same operation produce the same result, but taking less time.
Sometimes you care more about how many resources the operation requires and you want to keep it to a minimum.

Depending on your optimization strategies you can achieve more speed while reducing the resource usage or you may have to consider trade-offs between the two or between a kind of resource or another (e.g. cpu and memory).

To keep the discussion simpler we'll focus on single execution time and memory usage.

Use cases

Once you decided what you want to optimize for you need to select a set of use-cases that are well representative, they will be profiled to figure out what you want to speed up first and used to measure the improvements through benchmarks.

There are many parallels with the debugging activity:

The benchmarks are no different from integration and unit-tests, and usually test harnesses could double as benchmark harnesses or the other way round (e.g. checkasm).
The resource profilers are quite similar to fault detectors as asan and memcheck.

NOTE: valgrind offers a large toolchain of resource profilers, fault detectors and even more.

You have the same code-coverage problems that you have when you are crafting regression and conformance tests.
You need to come up with something that:

is representing the common usage model well
is reproducible
executes in an acceptable time since you want to run it often, potentially as part of your CI, definitely as part of your release workflow.
is easy to instrument so you can extract different metrics with minimal manual intervention.

Once you selected your common use-cases, can profile them first, figure out which are the hot paths that would give you the largest gain and craft benchmarks for both the overall execution and for the routines you want to optimize for.

Profiling

To profile your binary there are a large number of tools that come handy.

Some may give you information over multiple metrics (e.g. dtrace or bpftrace) or a specific one (e.g memory-profiler or uftrace), they may work on unchanged binaries, require to have them compiled with specific settings (e.g. debug information or mcount markers).

You want enough information to come up with about 5 candidates for optimization.

Depending on the tool you may simply pick the top of a list, or inspect a flamegraph or other interactive representations.

Some tools provide the information in multiple formats and really good tools provide means to compare profiling traces of different runs.

Execution time

The main metric you should care about is the cumulative amount of time spent on each function per run.

My favorite tools to get this information are:

cargo flamegraph: producing pretty interactive graphs while using perf record or dtrace under the hood to get timing information.
perf record + perf report: since its tabular representation is usually all you need to start, if you are on Linux.
uftrace since it is faster than perf and has built-in exporters for multiple formats.
cargo instruments since it is the most straightforward way to get profiling information on MacOS.

Memory usage

You want to consider at least two information: allocation count and peak memory usage.
In many cases the peak memory usage (and the maximum resident set) is one of the main limiting factors if you plan to have concurrent runs.
Keeping the allocation count low is important since you might end up fragmenting the memory and in general allocations in an hot path are also impacting the execution speed.

My favorite tools to get this information are:

malt since its web UI is really straightforward. (Linux-only)
memory-profiler since it is faster to collect information even if its web UI isn't as nice to navigate in my opinion. (Linux-only again)
On MacOS cargo instruments would have you covered.

Benchmarking

Once you have your candidates from the profiling phase, the best next step is to write a set of micro-benchmarks for each of them and have a whole-application benchmark ready as well: sometimes speeding up a single function might have unexpected side-effect, better to catch the pathological cases earlier.

You may integrate it in your build system (e.g. cargo bench/criterion), you may use external tools such as perf report or hyperfine or even directly add a call to getrusage in your binary.

	criterion	hyperfine	perf report	getrusage (gnu time)
execution time	✓	✓	✓	✓
memory usage	✗	✗	✗	✓

NOTE: anything that can measure the speed can repurposed to measure the throughput even if it does not explicitly does.

Half of the problem is extracting the benchmark data and the other half is store it properly so you can compare between revisions.

critcmp makes easier to store and compare different criterion-based benchmark runs.
hyperfine has a mean to export to multiple formats and some analysis scripts that may help you.

Once you have the benchmarks ready you can start optimizing the code.

What next

This post is an introduction to the topic with just a list of tools I like to use and what I consider the best practice while dealing with this kind of tasks.

During the next posts I'll try to discuss what are the actual strategies to do the work and which crates may make your life much easier, depending on what you are trying to achieve.

Oldest comments (2)

Daniel Thompson-Yvetot • Jan 7 '20

Awesome writeup. We're planning our benchmarking and profiling approach for our polyglot (typescript + rust) project tauri.studio right now, and you offer some great insights. Any ideas about benching on Windows?

Luca Barbato • Jan 7 '20

criterion works on Windows and dtrace should work on windows so cargo-flamegraph is available.

Depending on what you need VTune is as good as Instruments on macOS or better, or at least so I had been told :) (I do not use Windows that often).