I read the great article The compiler will optimize that away and wanted to quick check how the "most naive data-oriented approach" would impact the performance of a simple Go code.
Type definitions
Let's say an "Ant" has eight fields, 4 ints and 4 strings:
// An Ant is a struct with a few arbitrary fields.
type Ant struct {
Field1 int
Field2 string
Field3 int
Field4 string
Field5 int
Field6 string
Field7 int
Field8 string
}
The numbers presented here hold for a 64-bit machine, where an int
is 8 bytes and a string
header is 16 bytes.
The straightforward way to represent a colony of ants is a contiguous slice of ants:
// An AntColony is a slice of Ants.
type AntColony []Ant
The simple data-oriented way to represent a colony of ants is to pack all the values of each field in its own slice:
// A DataOrientedAntColony is an alternative representation of
// an AntColony, where all the values of each fields are stored
// in a contiguous slice, in a data-oriented manner.
type DataOrientedAntColony struct {
Field1 []int
Field2 []string
Field3 []int
Field4 []string
Field5 []int
Field6 []string
Field7 []int
Field8 []string
}
Here is the source of the type definitions.
Benchmarks
Here is the source for the benchmarks.
My general expectation is that the "data-oriented" approach will be faster for some access patterns, because (as explained in the article) the bottleneck is accessing data in the memory, so fetching fewer bytes from the main memory should be faster, when we only care about a subset of the fields.
For 1,000,000 ants:
-
SearchByField1: traverse the whole colony, use only
Field1
(an int)- my expectation: data-oriented will be faster.
-
SearchByField2: traverse the whole colony, use only
Field2
(a string)- my expectation: data-oriented will be slightly faster (if at all), as the time will be dominated by following pointers to string contents to compare.
-
Inspect: traverse the whole colony, use
Field1
,Field3
,Field5
,Field7
- my expectation: data-oriented fetches less memory, but from 4 distinct slices in different memory locations, so I guess it will be overall slower.
Results
$ go test -bench=.
cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
BenchmarkAntSearchByField1-4 172 6943993 ns/op
BenchmarkDOAntSearchByField1-4 2064 604584 ns/op
BenchmarkAntSearchByField2-4 100 13402776 ns/op
BenchmarkDOAntSearchByField2-4 244 4948108 ns/op
BenchmarkAntInspect-4 86 13767480 ns/op
BenchmarkDOAntInspect-4 100 10134642 ns/op
SearchByField1: a whooping 91% latency improvement (11x speedup). We're using only 8 bytes of each Ant out of 96 (8%). The total time is dominated by, and proportional to, the amount of memory fetched.
SearchByField2: a 63% latency improvement (3x speedup). We're using only 16 bytes of each Ant out of 96 (17%), and then chasing a pointer to the actual string contents. The total time is still dominated by the amount of Ant memory fetched, regardless the extra string contents fetch and comparison. I did not expect such a high speed-up for the data-oriented version.
Inspect: a 26% latency improvement (1.4x speedup). We're using 32 bytes of each Ant out of 96 (33%). My intuition about the cost of dealing with memory scattered in 4 different slices is dispelled: the data-oriented version is faster again!
Conclusion
A simple data-oriented memory layout can provide dramatic speedups.
Top comments (3)
Note, your benchmark code has:
However, that will make a copy of the ant on every iteration. Using a pointer to the element will be faster:
Good catch Egon!
I had to try and see it with my eyes. I find this fascinating because it contradicts my mental model of a memory bus that has very high throughput only if the fetched data is truly contiguous. Even if data-oriented is still the fastest in my benchmark, I'm now taking my data-oriented insight with a grain of salt. Thanks for bringing up the pointer in the loop!
I'm even more surprised because in unrelated experiments, I had never seen a compelling argument based on performance numbers for passing a small/medium object around by value or by pointer. In the Ants case, we see that passing by pointer can provide a significant speedup.
Now I'm curious because I still think that fetching from memory is the bottleneck, and everything else (CPU, L1 cache reads, etc.) should be negligible. This would mean (maybe) that the pointer approach lets us retrieve fewer bytes of memory by not fetching the full Ant instance, while still achieving high throughput when retrieving non-contiguous bytes. Would need a bit more research to figure this out.
Very interesting! I've only seen struct optimizations through field reordering to reduce binary offsets so far.