De(v)lightful continuous benchmarks with Go

Omer Karjevsky — Thu, 05 Nov 2020 10:41:14 +0000

The following article is not going to tell you about the importance of benchmarking your Go applications.

Even if we understand the importance, many times benchmark tests are ignored when developing new application features, or when improving existing ones. This can be a result of many reasons, but chiefly it is a result of the hassle required to perform a meaningful benchmark.

Here, we will describe a method for making benchmarking an easy, low-friction step in the feature development lifecycle.

What we need

First, let’s describe the basic requirements when benchmarking an application feature:

The benchmark result should tell us something meaningful
The benchmark result should be reliable and reproducible
Benchmark tests should be isolated from one another

The benchmark tooling available in the Go testing package is a useful and accepted method to meet the aforementioned requirements, but in order to have the best workflow, we want to include additional nice-to-have ones as well:

The benchmark test should be (very) easy to write
The benchmark suite should run as fast as possible, so it can be integrated into CI
The benchmark results should be actionable

Easy, right?

At first glance, all of these requirements should also be covered by the existing tooling.
Looking at the example from Dave Cheney’s great blog post, writing a benchmark for Fib(int) function should be very easy and duplicated code between this test and another would be negligible.

func BenchmarkFib10(b *testing.B) {
    // run the Fib function b.N times
    for n := 0; n < b.N; n++ {
        Fib(10)
    }
}

Not In the real world

For the purpose of this article, let’s take a look at a more challenging use case. This time, we want to benchmark a function which performs a SELECT statement on an SQL database.
The stateful nature of this flow requires us to perform a setup step before running the actual test.

What would such a test look like? Probably something similar to this:

func Benchmark_dao_FindByName(b *testing.B) {
    ctx, ctxCancel := context.WithCancel(context.Background())
    defer ctxCancel()

    // when working with multiple DB types, or randomized ports, 
    // this can involve some logic, for example, 
    // resolving the connection info from env var / config.
    dbDriver := resolveDatabaseDriver()
    dbUrl := resolveDatabaseURL()

    db, err := sql.Open(driver, dbUrl)
    require.NoError(b, err)

    dao := domain.NewDAO(db)

    // populate the db with enough entries for the test to be meaningful
    populateDatabaseWithTestData(b, dao)

    // ignore setup time from benchmark
    b.ResetTimer()

    // run the FindByName function b.N times
    for n := 0; n < b.N; n++ {
        _, err = dao.FindByName(ctx, "last_inserted_name")
        require.NoError(b, err)
    }
}

Yikes, even with the most simplistic example, look at all that setup code. Now imagine writing that for each new benchmark test in a real application codebase!

Another thing to note here, is that if we would have forgotten (yes, mistakes happen) to reset the test timer where required, or running the test inside the required loop, the test would become invalid without any clear indication.

So, what can we do?

At JFrog, we want processes to be as pain-free as possible, so we can shift-left responsibilities to the dev teams without having it affect productivity and development times too much. An example of this mindset is writing benchmark tests as part of the application test suite, instead of being part of the end-to-end suite or the QA pipeline.

So, looking at the example above, we see that the experience of writing new benchmark tests would be painful for our developers - we need to fix that. Let’s look at how we can reduce the setup code as much as possible.

1. Wrap application setup and benchmark state

Most of the setup logic from the example above can be deferred to a wrapping utility.
Resolving the DB connection info, and creating the application object, so that it can be used in the test

type BenchSpec struct {
    Name string
    // Runs b.N times, after the benchmark timer is reset.
    // The provided context.Context and container.AppContainer 
    // are closed during b.Cleanup().
    Test func(t testing.TB, ctx context.Context, app myapp.Application)
}

func runBenchmark(b *testing.B, spec BenchSpec) {
    b.Helper()

    ctx, ctxCancel := context.WithCancel(context.Background())
    b.Cleanup(ctxCancel)

    // when working with multiple DB types, or randomized ports, 
    // this can involve some logic, for example, 
    // resolving the connection info from env var / config.
    dbDriver := resolveDatabaseDriver()
    dbUrl := resolveDatabaseURL()

    db, err := sql.Open(driver, dbUrl)
    require.NoError(b, err)

    dao := domain.NewDAO(db)
    // did you try https://github.com/google/wire yet? ;)
    app := myapp.NewApplication(dao)

    // populate the db with enough entries for the test to be meaningful
    populateDatabaseWithTestData(b, app)

    b.Run(spec.Name, func(b *testing.B) {
        for n := 0; n < b.N; n++ {
            spec.Test(b, ctx, app)
        }
    }
}

func Benchmark_dao_FindByName(b *testing.B) {
    runBenchmark(b, BenchSpec{
        Name: "case #1",
        Test: func(t testing.TB, ctx context.Context, app myapp.Application) {
            _, err = app.Dao.FindByName(ctx, "last_inserted_name")
            require.NoError(t, err)
        }
    })
}

Now, we can see the actual test code is trivial, all we need to do is pass the method we want to benchmark into the runBenchmark helper function. This function handles all the heavy lifting for us.

It also keeps us from making mistakes, note that the BenchSpec.Test argument t is scoped down to testing.TB instead of giving access to the full *testing.B object.

The wrapper utility can, of course, be extended to accept multiple test specifications at once, so that we can run multiple tests against a single database/application initialization.

2. Use pre-populated DB docker images

Populating the database in each test takes too much time. Using a pre-populated docker image will allow us to perform our tests against a well-known DB state, without having to wait.

As an added bonus, since the DB container does not require initialization actions in most cases, that means we can use an image that will start up much faster.

Actionability

Now that we have an easy way to add benchmarks for our new features, what do we do with them? The output will look something like this:

% go test -run="^$" -bench="." ./... -count=5 -benchmem

PASS
Benchmark_dao_FindByName   242    4331132 ns/op     231763 B/op      4447 allocs/op

ok      jfrog.com/omerk/benchleft       3.084s

Simply looking at the output and deciding if performance is good enough is nice, but can we get more actionable results from our benchmark suite?

When working on a change to our application, we rely on our CI to make sure we don’t break anything. Why not use the same flow to make sure we don’t degrade our application performance?

So, let’s run the benchmark suite as part of our CI. But getting the results on our working branch is not enough, we also need to compare these results to a stable state. In order to do that, we can run the benchmark suite on both our working branch and a release branch or the main branch of our git repository.

By using a tool like benchstat, we can compare the results of two separate benchmark runs. The output of such a comparison looks along the lines of:

% benchstat "stable.txt" "head.txt"

name                                                                   old time/op    new time/op    delta
pkg:jfrog.com/omerk/benchleft goos:darwin goarch:amd64 _dao_FindByName 4.04ms ± 8%    5.43ms ±40%  +34.61%  (p=0.008 n=5+5)

name                                                                   old time/op    new time/op    delta
pkg:jfrog.com/omerk/benchleft goos:darwin goarch:amd64 _dao_FindByName 10.9kB ± 0%    10.9kB ± 0%   +0.19%  (p=0.032 n=5+5)

name                                                                   old time/op    new time/op    delta
pkg:jfrog.com/omerk/benchleft goos:darwin goarch:amd64 _dao_FindByName 8.60k ± 0%     8.60k ± 0%   +0.01%  (p=0.016 n=4+5)

As can be seen above, we get a clear understanding of the performance differences (if any) for time, memory and number of allocations per operation.

Storing the stable benchmark results in a bucket or artifact repository (such as JFrog Artifactory) is one option, but to keep performance variance to a minimum, we would prefer to run the suite on both branches on the current CI agent if possible (using the tips above to keep our tests fast is crucial).

Here is an example of a (simplistic) bash script that will help us get actionable results from our benchmark suite:

#!/usr/bin/env bash
set -e -o pipefail

BENCH_OUTPUT_DIR="${BENCH_OUTPUT_DIR:-out}"
DEGRADATION_THRESHOLD="${DEGRADATION_THRESHOLD:-5.00}"
MAX_DEGRADATION=0
THRESHOLD_REACHED=0

function doBench() {
   local outFile="$1"
   go test -run="^$" -bench="." ./... -count=5 -benchmem | tee "${outFile}"
}

function calcBench() {
   local metricName="$1"
   MAX_DEGRADATION=$(cat "${BENCH_OUTPUT_DIR}/result.txt" | grep -A 2 "${metricName}" | head -n 3 | tail -n 1 | awk 'match($0,/(\+[0-9]+\.[0-9]+%)/) {print substr($0,RSTART,RLENGTH)}' | tr -d "+%")
   if [[ "${MAX_DEGRADATION}" == "" ]]; then
       echo "Benchmark ${metricName} - no degradation"
       return
   fi
   if [[ "${THRESHOLD_REACHED}" == "0" ]]; then
       THRESHOLD_REACHED=$(echo "${MAX_DEGRADATION} > ${DEGRADATION_THRESHOLD}" | bc -l)
   fi
   echo "Benchmark ${metricName} degradation: ${MAX_DEGRADATION}% | threshold: ${DEGRADATION_THRESHOLD}%"
}

mkdir -p "${BENCH_OUTPUT_DIR}"

git checkout stablebranch && git pull
doBench "${BENCH_OUTPUT_DIR}/stable.txt" || {git checkout - && exit 1; }

git checkout -
doBench "${BENCH_OUTPUT_DIR}/head.txt"

benchstat -sort delta "${BENCH_OUTPUT_DIR}/stable.txt" "${BENCH_OUTPUT_DIR}/head.txt" | tee "${BENCH_OUTPUT_DIR}/result.txt"

calcBench "time/op"
calcBench "alloc/op"
calcBench "allocs/op"

exit "${THRESHOLD_REACHED}"

doBench() will run the full benchmark suite and output the results into the specified file. calcBench() takes the comparison result output and calculates if our specified degradation threshold has been reached.
benchstat is used with the -sort delta flag, so we take the highest delta into our calculation.

The script performs the benchmarks against “stablebranch” and then against our working branch, compares the two results, and calculates against the threshold.

The exit code can be used to fail our CI pipeline in case of a large enough performance degradation.

Further Considerations

It’s important to remember that your CI pipeline (such as JFrog Pipelines) is a tool meant to increase productivity and observability. Adding the benchmark step can be very helpful but does not entirely alleviate the developer’s responsibility.

When following the examples above, you may notice that newly added benchmark tests are not considered when automating the degradation calculation, as there is no stable result to compare against. For this case, developer discretion is required to decide on a good base value.

There will also inevitably be cases where changes would cause expected and acceptable performance degradation. The CI pipeline should support these cases.

Lastly, regarding database-dependent benchmarks, it would be best to keep the tests isolated from the specific data stored in your snapshots - such tests should be based on the amount of data, and not the contents. This would allow updating the database snapshot without having to worry about breaking previously written tests.

Conclusion

Utilizing this methodology, we can reduce the cognitive load of writing benchmark tests, as well as reducing the time it takes, which ultimately allows us to include them as part of our feature development lifecycle with very little added effort. On top of that, having actionable benchmark results as part of our CI pipelines allows us to handle any issues much faster in the development lifecycle.

Taking full ownership and responsibility of feature delivery becomes easier, letting our developers focus on development instead of verification.

Interested in working on high-performance products with team members who care about improving workflows while keeping high standards? Come join us at JFrog R&D!

DEV Community: Omer Karjevsky