Valentin Deleplace

Posted on May 4, 2021

Data-oriented ant, in Go

#go #performance #architecture

I read the great article The compiler will optimize that away and wanted to quick check how the "most naive data-oriented approach" would impact the performance of a simple Go code.

Type definitions

Let's say an "Ant" has eight fields, 4 ints and 4 strings:

// An Ant is a struct with a few arbitrary fields.
type Ant struct {
    Field1 int
    Field2 string
    Field3 int
    Field4 string
    Field5 int
    Field6 string
    Field7 int
    Field8 string
}

The numbers presented here hold for a 64-bit machine, where an int is 8 bytes and a string header is 16 bytes.

The straightforward way to represent a colony of ants is a contiguous slice of ants:

// An AntColony is a slice of Ants.
type AntColony []Ant

The simple data-oriented way to represent a colony of ants is to pack all the values of each field in its own slice:

// A DataOrientedAntColony is an alternative representation of
// an AntColony, where all the values of each fields are stored
// in a contiguous slice, in a data-oriented manner.
type DataOrientedAntColony struct {
    Field1 []int
    Field2 []string
    Field3 []int
    Field4 []string
    Field5 []int
    Field6 []string
    Field7 []int
    Field8 []string
}

Here is the source of the type definitions.

Benchmarks

Here is the source for the benchmarks.

My general expectation is that the "data-oriented" approach will be faster for some access patterns, because (as explained in the article) the bottleneck is accessing data in the memory, so fetching fewer bytes from the main memory should be faster, when we only care about a subset of the fields.

For 1,000,000 ants:

SearchByField1: traverse the whole colony, use only Field1 (an int)
- my expectation: data-oriented will be faster.
SearchByField2: traverse the whole colony, use only Field2 (a string)
- my expectation: data-oriented will be slightly faster (if at all), as the time will be dominated by following pointers to string contents to compare.
Inspect: traverse the whole colony, use Field1, Field3, Field5, Field7
- my expectation: data-oriented fetches less memory, but from 4 distinct slices in different memory locations, so I guess it will be overall slower.

Results

$ go test -bench=.

cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz

BenchmarkAntSearchByField1-4             172       6943993 ns/op
BenchmarkDOAntSearchByField1-4          2064        604584 ns/op

BenchmarkAntSearchByField2-4             100      13402776 ns/op
BenchmarkDOAntSearchByField2-4           244       4948108 ns/op

BenchmarkAntInspect-4                     86      13767480 ns/op
BenchmarkDOAntInspect-4                  100      10134642 ns/op

SearchByField1: a whooping 91% latency improvement (11x speedup). We're using only 8 bytes of each Ant out of 96 (8%). The total time is dominated by, and proportional to, the amount of memory fetched.
SearchByField2: a 63% latency improvement (3x speedup). We're using only 16 bytes of each Ant out of 96 (17%), and then chasing a pointer to the actual string contents. The total time is still dominated by the amount of Ant memory fetched, regardless the extra string contents fetch and comparison. I did not expect such a high speed-up for the data-oriented version.
Inspect: a 26% latency improvement (1.4x speedup). We're using 32 bytes of each Ant out of 96 (33%). My intuition about the cost of dealing with memory scattered in 4 different slices is dispelled: the data-oriented version is faster again!

Conclusion

A simple data-oriented memory layout can provide dramatic speedups.

Top comments (3)

Egon Elbre • May 4 '21

Note, your benchmark code has:

for _, ant := range ants {
    if ant.Field1 == needle {

However, that will make a copy of the ant on every iteration. Using a pointer to the element will be faster:

for i := range ants {
    ant := &ants[i]
    if ant.Field1 == needle {

Valentin Deleplace • May 4 '21

Good catch Egon!

I had to try and see it with my eyes. I find this fascinating because it contradicts my mental model of a memory bus that has very high throughput only if the fetched data is truly contiguous. Even if data-oriented is still the fastest in my benchmark, I'm now taking my data-oriented insight with a grain of salt. Thanks for bringing up the pointer in the loop!

I'm even more surprised because in unrelated experiments, I had never seen a compelling argument based on performance numbers for passing a small/medium object around by value or by pointer. In the Ants case, we see that passing by pointer can provide a significant speedup.

Now I'm curious because I still think that fetching from memory is the bottleneck, and everything else (CPU, L1 cache reads, etc.) should be negligible. This would mean (maybe) that the pointer approach lets us retrieve fewer bytes of memory by not fetching the full Ant instance, while still achieving high throughput when retrieving non-contiguous bytes. Would need a bit more research to figure this out.

David Kröll • May 4 '21

Very interesting! I've only seen struct optimizations through field reordering to reduce binary offsets so far.