The Surprising Benchmark
Imagine we have a task, me and you: a file with 10000 words in it - count the occurrence of each word, and present us with top-10 of the most common words there.
We come up with some ideas, write those ideas out on an IDE, and now we benchmark it to see how blazingly fast our API is.
word_counting_hashing time: [4.6504 ms 4.6591 ms 4.6686 ms]
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) high mild
7 (7.00%) high severe
word_counting_binary time: [29.797 ms 30.042 ms 30.304 ms]
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild
Benchmarking word_counting_linear: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.0s, or reduce sample count to 40.
word_counting_linear time: [120.82 ms 121.11 ms 121.39 ms]
end_to_end time: [4.8724 ms 4.8830 ms 4.8946 ms]
change: [−0.3363% +0.9093% +1.7798%] (p = 0.09 > 0.05)
No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
7 (7.00%) high mild
5 (5.00%) high severe
In essence:
- Linear: 121 ms
- Binary: 30 ms (4x improvement, right?)
- HashMap: 4.65 ms - What?! But how?
Fantastic question! Before I answer the How, we can both agree that a binary-search algorithm is still not bad, right? Because we perform up to log(N) iterations and not up to N iterations.
So what's the deal here?
When you hear a word "cache", what's the first thing that comes to your mind?
Browser's cache and cookies?
A physical safe-like box to store something important to retrieve it later in the future?
Something else?...
In either case, the idea is pretty much that: a space to put something in for an easy retrieval later in the future.
In the technical sense, 'cache-friendly' code means structuring your data so the CPU can access it efficiently.
The difference? Milliseconds versus seconds.
Now, the secret lies in how CPUs actually access memory.
Let me show you what's happening under the hood - and why it matters for every Rust program you write.
How CPU Cache Actually Works
Cache Hierarchy
The cache is not just 1 area of quick storage of data for a later retrieval. There are 3 layers of it, each with its own characteristics:
- L1: fastest yet smallest (on-CPU, can hold up to 32 - 64 KB)
- L2: not as fast as L1, yet still much faster than RAM (on modern CPU architectures - also on CPU: ~512 KB - 2 MB on average)
- L3: typically a shared resource between several CPU cores ( 2 < x < 10 MB)
- RAM: the slowest, yet the largest (you typically want to avoid having all the operational data stored here - otherwise, it hurts the optimization and performance - nowadays can store dozens of GB)
The better-designed the API is, the closer to L1 your data will stick to - ensuring the most possible throughput in the shortest possible timeframe.
Cache Lines - The 64-Byte Unit
When the CPU grabs some data for processing in some CPU-bound task (i.e.: read a file) - it does not grab that data individually
by bytes. It loads something called a cache line - a 64-byte-sized chunk of memory that contains this data at that given moment
to iterate through quickly.
Since this chunk is only 64 bytes, it easily fits into the L1 cache layer, so the processing of it is virtually instant (we are talking roughly microseconds).
If we take a concrete example, like a vector of u32 integers - that are each 4 bytes (32 bits / 8 = 4 bytes) - when you
iterate over it, the CPU will load a 64-byte cache line that contains the part of the vector just large enough to fit in and very quickly iterate through. In this case, it would be able to host up to 16 elements per cache line.
Spatial Locality
Now, when it comes to cache optimization, there are two types of the locality (locating and accessing the data in memory on cache):
- Spatial locality
- Temporal locality
Spatial locality refers to the data being located contiguously (like in a vector, sequentially), and the cache line loading means grabbing one contiguous chunk of that
data structure to quickly iterate through at a time.
If you remember from before, our benchmark showed some rather decent latency results for both a linear and a binary approaches in iterating over the words and counting up the occurrences.
The reason is spatial locality. The CPU loads 64-byte chunks of the vector, processing multiple words per cache line instead of fetching each word individually. What this provides us is a rather high chance of so-called "cache hits" and reduces an excessive need to reload the cache line per chunk.
Temporal Locality
Temporal locality means accessing the same memory location repeatedly over time. If you keep hitting the same data (also known as "cache hit"), it stays hot in cache.
The opposite of the above is called the cache miss.
What's worth noting regarding the latter is upon the cache miss - the CPU has to reload a fresh 64-byte cache line with that new data to be stored "temporally"
until a data searched for, not-yet existent on the CPU, gets replaced in the cache line.
Looking back again at our benchmarks, we noticed a significant increase in performance for the HashMap approach in the same task for finding word duplicates.
Let's analyze deeper why that is.
In that particular case, the text in question had quite a lot of repeating common words such as "the, a, most" etc.
HashMap can be viewed as a table with some memory buckets that contain a pointer to the key (the key itself living in the Heap memory), and the associated value linked to that key.
Once we encounter a cache miss upon grabbing an index that contains the pointer to the key that shows what value is associated with it, that bucket is loaded into the cache line.
The more duplicates encounter upon each search, the hotter the cache line is - since it will be reused and stay intact - hence the key advantage of this approach here.
Sure, you could still stick to the algorithm that favour spatial locality by checking in on the neighboring data elements, but that
would always require to reload the cache line, even if we encounter duplicate words multiple times.
Issue with the Binary Search
Despite the binary search having performed in around 30 ms, there may be a concern regarding when this algorithm may be beneficial or when it may be an overhead on the CPU
and on memory access.
Think about it: in this algorithm, assuming it's a large vector in question, we get the index in the middle of the collection.
And then, based on that value, we check if it's greater or less than the middle - requiring us to jump to a completely different memory region. Based on that we load another cache line, and the procedure repeats.
Imagine how many jumps across various memory regions we need to perform and how many cache misses we encounter...
Sure - in regards to time complexity, we still win over the linear approach. But is it always the case?
The CPU cache might prefer sometimes to just feed it one vector, iterate over it linearly by chunking in cache lines and just get onto the next one until we find the necessary value.
We do not need to reload the cache line that much - only if the cache line is fully iterated over and finds no necessary result.
In essence, nearly every step in binary search triggers a cache miss because each comparison jumps to a distant memory location, loading a fresh cache line while discarding the previous one.
Cache Coherency and False Sharing
Although we have now understood the essence of how cache performs during several simple examples of word counting, we still need to understand what happens when multiple threads are involved.
Let's take a look at a more complex scenario: blurring an image using multiple threads.
(Link to the project's slightly-messy GitHub repo here)
Cache Coherency: The Foundation
Imagine you want to speed up image processing by using multiple CPU cores. You split the work:
- Thread 1 (Core 1): Process the red channel
- Thread 2 (Core 2): Process the green channel
- Thread 3 (Core 3): Process the blue channel
Sounds perfect, right? But there's a catch.
Each CPU core has its own L1 cache. When Core 1 loads some red channel data into its L1 cache and modifies it, the other cores need to know about this change. Otherwise, Core 2 might read stale data that's no longer correct.
This is cache coherency - the mechanism that keeps all the caches in sync across multiple cores.
Here's what happens when Core 1 modifies data:
Core 1 writes to its L1 cache
The cache line is marked as "modified"
Other cores that have this cache line must invalidate their copies
Next time Core 2 needs that data, it must fetch the updated version
In essence, cache coherency ensures correctness in multi-threaded programs, but it comes with a performance cost - cores must communicate and synchronize their caches.
False Sharing: When Cache Lines Betray You
Now here's where things get tricky. Remember that the CPU loads entire 64-byte cache lines, not individual variables?
False sharing happens when two or more threads modify DIFFERENT variables that happen to live in the SAME cache line.
Here's a concrete example. Suppose we're tracking progress from multiple threads:
struct ThreadCounters {
thread1_count: u64, // Bytes 0-7
thread2_count: u64, // Bytes 8-15
thread3_count: u64, // Bytes 16-23
}
// All three counters fit in ONE 64-byte cache line!
Now watch what happens:
- Core 1 increments thread1_count → invalidates the cache line on Core 2 and Core 3
- Core 2 increments thread2_count → invalidates the cache line on Core 1 and Core 3
- Core 3 increments thread3_count → invalidates the cache line on Core 1 and Core 2
The threads aren't sharing data logically (each has its own counter),
but they're sharing a cache line physically. Hence the name: false sharing.
The result? Cache ping-pong.
Performance tanks even though the threads are supposedly independent.
The solution? Force each counter into its own cache line:
#[repr(align(64))]
struct CachePadded<T> {
value: T,
}
struct ThreadCounters {
thread1_count: CachePadded<u64>, // Gets its own 64-byte line
thread2_count: CachePadded<u64>, // Gets its own 64-byte line
thread3_count: CachePadded<u64>, // Gets its own 64-byte line
}
Now each thread can work independently without invalidating the others' cache lines.
How This Relates to Image Blur
You might be wondering: does our image blur code suffer from false sharing?
Actually, no! Here's why:
struct Image {
red: Vec<u8>,
green: Vec<u8>,
blue: Vec<u8>,
}
Each Vec is a separate heap allocation. When Thread 1 processes the red channel, it's modifying memory in a completely different region from where Thread 2 is processing green. They're not sharing cache lines.
But - and this is important - the choice between Array of Structs (AoS) and Struct of Arrays (SoA) does affect cache behavior in a single-threaded context, which is what we'll explore next.
Now that we've covered both cache coherency (how cores stay in sync) and false sharing (when proximity hurts), let's see how data layout choices affect our image blur performance.
Image Blur Optimization
As a note to keep in mind before we dive into each approach.
The algorithm that will serve as the foundation how we will blur the image is via a 3x3 Box Blurring.
The pixel's value will be the sum of that pixel and all 8 of its neighbors divided by 9 (the pixel count).
For the sake of simplicity, we will not opt in for a toroidal looping over the edges to blur those - but rather will just return those as they are.
As a reference, see below the last benchmark results (some may not seen as intuitive, but we will cover these in just a moment).
Gnuplot not found, using plotters backend
blur_naive time: [2.0894 ms 2.0918 ms 2.0942 ms]
change: [−0.2921% −0.1217% +0.0394%] (p = 0.15 > 0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
blur_cache_optimized time: [2.2716 ms 2.2772 ms 2.2842 ms]
change: [−18.085% −17.836% −17.549%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
Benchmarking blur_separable: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50.
blur_separable time: [1.8354 ms 1.8374 ms 1.8395 ms]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
To simplify (on average):
- Naive: 2.01 ms
- Cache-optimized (standard): 2.27 ms
- Cache-optimized (separable filtering): 1.83 ms
Naive approach
In a naive approach, we went for an AoS data structure type (Array of Structs).
This is what it looks like:
struct Pixel {
r: u8,
g: u8,
b: u8,
}
type ImageAoS = Vec<Pixel>;
That means if we want to blur this image, we will have to loop over the vector once, grab each field's red / green / blue - and return a new vector of blurred image data.
The function signature:
fn blur_naive(image: &[Pixel]) -> Vec<Pixel> { /* the core logic */ }
Let's look at the pros and cons of this approach
Pros
- one loop -> which means one load of 64 bytes into the cache line
- simple design
- Temporal locality: accessing r→g→b of same pixel happens immediately
- Cache line holds all three channels for ~21 pixels
Cons
- locating the neighbors: since we are dealing with a 1D vector, identifying the vertical and horizontal neighbors requires some specific index arithmetics
(based on the provided
widthandheightarguments) - non-contiguous channel data: while the pixels as a whole are located contiguously in memory, the channel values are not
Cache-optimized (standard)
By the logic, if we want to process one entire channel at a time (all reds, then all greens etc.), we would want a SoA structure
(Struct of Arrays).
As a gentle reminder from our False Sharing section, here is what that looks like:
#[derive(Debug)]
pub struct Image {
red: Vec<u8>,
green: Vec<u8>,
blue: Vec<u8>,
}
Each field is a vector of all the relevant channel's pixel values.
Let's take a quick look at the function signature:
fn blur_cache_optimized(image: &Image, width: usize, height: usize) -> Image { /* the core logic */ }
If we want to respect the intent that each channel should be processed one at a time - we cannot have the same loop to deal with THREE distinct vectors.
Not only would this lead to even more memory hopping but for three different vectors per one iteration, but this function call would last longer if we were to profile its performance in a flamegraph.
Let's look at some pros and cons:
Pros
- one loop -> which means one load of 64 bytes into the cache line
- Spatial locality: one channel vector - all elements are contiguous
- Cache line holds one channel per loop
Cons
- three loops: instead of having one looping, we need three loops to handle each channel separately - that is because our project was single-threaded
- worse performance: relaying the above-demonstrated results from the benchmark, the average latency shows a slowdown by roughly 200 microseconds - contradicting to the original intent
In regards to the contradictory intent, it seems like we overlooked another piece of the puzzle that didn't seem as directly-related yet that would greatly affect the outcome and bring all the components discussed up until now together.
Separable filtering
In order to understand what we mean by separable filtering, it is important to be aware of what a professional-grade algorithm would look like to achieve a goal within the same context.
Let us reference this source code of an actual crate publicly available: imageproc
What the author utilizes there is something that aligns with something called separation of concerns, which means each piece of logic performs this task AND ONLY this task, which does not interfere in the underlying logic of another code's functionality.
What we mean by it here is that you shall see we have components that make up for the main function: filter_horizontal() (in our case - blur_horizontal()
and filter_vertical() (in our case - blur_vertical()).
If you were to go back to the standard cache-optimized approach, we still had to perform the average-pixel-value calculation from adding up 9 elements.
In this case, however ,each component needs to get the average pixel value of 3 (horizontally: left-center-right, and vertically: top-center-bottom).
Together, not only do we perform 6 additions instead of 9, but we jump across scattered memory regions in much less quantities compared to the previous approach, hence the improved result by 200 microseconds form the naive approach.
Now what does this all tell us regarding the caching, false sharing and deciding what approach works best when?
Results & Lessons
Give me them numbers!
Word counting
- HashMap: 4.6 ms avg
- Vector - Binary search: 30 ms avg
- Vector - Linear: 121 ms avg
Image blurring
- Naive (AoS): 2.01 ms avg
- Cache-optimized (SoA: Standard): 2.2 ms avg
- Cache-optimized (SoA: Separable blurring): 1.8 ms
As you will see below, the reason for such results don't have to do necessarily only with how cache-friendly our data structure is, but rather how our logic interacts with it, and what the Rust compiler does to the data on runtime.
What do the Rust compiler and the algorithm picks have to do with cache-friendliness?
If we are dealing with a set of data that has a great deal of repeated values, a HashMap approach will be most suitable in order to have a higher probability of retaining the hot cache line with the stored values in it.
However, that is not to say that a linear - binary search should be ignored. Either of these approaches is feasible if looking up data is not as frequent and / or the data set size is not as substantial for storage - here, it will be more of a question whether we need to speed up this infrequent iterations or not (O(log(N)) vs O(N) time complexity)
In regards to the image-blurring project, the concern is presented from a different angle that is equally important to keep in mind.
The benchmark results demonstrated that an algorithm choice matters much more than just picking between AoS or SoA - because picking a data structure is one part of the story - it's a whole different part when it comes to how our API interacts with that data structure.
What can I learn from all this?
Let me briefly go over each key lessons I have cemented over the course of weeks 6 and 7 when covering cache coherence, cache optimisation and how to interpret various relevant benchmark results in this regard:
Always measure, never assume - as practice showed twice in my case here, do not assume that just because a certain approach looks solid on paper - will always present as such on binary. Code them up and test each hypothesis.
Study professional implementations (like imageproc's separable filter) - there's a reason why certain implementations work the way they do. Studying production code (like Rust crates on crates.io) teaches you patterns you wouldn't discover on your own.
Test across different scenarios (small vs large datasets, different patterns) - HashMap dominated my word-counting benchmark because the text had many duplicates. But what if the data were different - unique formulas, random UUIDs, or different sizes? The optimal choice changes with the workload. Test your actual use case.
Data structure choice isn't just about API ergonomics - it directly controls memory layout, which determines cache behavior. This makes it a first-class performance consideration, not an afterthought.
The North Star
If you want your Rust code to be blazingly fast as marketed by some people online, you need to keep in mind and be constantly aware how your code actually interacts with your AND someone else's CPU + memory you wish to distribute with - that shall be your northern light in picking the right algorithm, the right data structure and the right optimisations to apply.
No matter where you are in your learning journey of Rust programming (or any systems programming language) - you are in charge of how you want your code to perform.
No matter the results you see at the current moment - there will always be at least one way how the API could be different to chop down those extra milliseconds / microseconds. Your users (and your future debugger self) will thank you for that!
Top comments (0)