benw

Posted on Oct 14, 2019

CPU Cache

#cpp #architecture #c

Since the 90's, the performance of CPU's increased dramatically, while the performance of RAM chips did not increase proportionally. In order to mitigate the difference, a caching mechanism was introduced.

During the program execution, there is a good chance that the same code or data gets reused over short periods of time, that is the temporal locality, and there is also spatial locality, the memory addresses accessed are close to each other. Realizing that locality exists is key to the concept of CPU caches as we use them today.

The caching mechanism is fast but small type of memory that stores recently accessed memory locations. There are several types of CPU caches, the data cache, the instruction cache and there is also the TLB (translation look-aside buffer), which caches the translation from a virtual page address to real page address.
In order to create efficient data structures and algorithms we should be aware of cache and how to utilize it:

Compact localized code is going to be fastest.
Compact localized data structures that fit into cache, are going to be fastest.
Data structure traversal that touch only cached data is going to be the fastest.

Cache hierarchy

The cache is organized in several layers, where each cache layer is larger and slower than the previous one.
For example, on an Intel core i-7:

L1 I-Cache (instruction cache) 32Kb per core.
L1 D-Cache (data cache) 32Kb per core.
L2 Cache 256Kb per core.
L3 Cache 8 MB

The smallest unit of memory that is being read or written from the main memory to cache is called a cache-line.
When you read a byte from the main memory, you actually read an entire cache line of 64 bytes (for example) that contains that byte. When designing data structures or algorithm we should utilize as much of the cache line. This is the reason that row major traversal is much faster than column major traversal. When using row major traversal, on access of an element, we get the entire cache line that contains subsequent elements, so while traversing the elements, you get a minimal amount of cache misses, because the cache line is utilized completely. Column traversal on the other hand, has the potential of reading an element and discarding the rest of the cache line.

The hardware will detect access patterns and prefetch the next cache lines accordingly, so if you use every second element in a row major traversal the speed will be pretty much the same. Predictable access patterns matter, as long as it's a linear predictable pattern.

From the hardware perspective, data structure which preserves locality and predictable access patterns such as arrays are favorable. This also means that for some small enough N, traversing an array might be faster than searching binary tree or a hash table.

An Example

Let's explore the principles above using an example. Suppose we are writing a computer vision library that has a functionality to find matching points in a pair of images. Each point has a descriptor and by comparing those descriptors we would find the most similar point in the second image.

We create a data structure that represents a point:


struct Descriptor {
  static constexpr size_t kDescriptorSize = 8;
  std::array<uint8_t, kDescriptorSize> data{0};
};

struct Point {
  uint32_t x, y;
  Descriptor desc;
};

One way to organize the points is in a map, such that we could easily query a point by its coordinates in the image:

  std::map<std::pair<uint32_t, uint32_t>, Point> points;

In order to find the closest point, we compare the descriptors and look for the minimal L2 distance.

    float L2(const Descriptor& desc1, const Descriptor& desc2) {
      auto sqrd_diff{0.f};
      for (auto i(0u); i < Descriptor::kDescriptorSize; i++) {
          sqrd_diff += std::pow(desc1.data[i] - desc2.data[i], 2);
      }
      return std::sqrt(sqrd_diff);
    }

    auto min_dist = std::numeric_limits<float>::max();
    for (const auto& key_point : points) {
      auto l2 = L2(query, key_point.second.desc);
      min_dist = min_dist < l2 ? min_dist : l2;
    }

When using this type of data structure, memory is not contiguous, which means there is a high probability we get a lot of cache misses during the iterations of the loop.

Using google benchmark library, the results of this implementation (with 65536 points):

Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
Load Average: 1.42, 1.73, 1.97
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
CalculateL2CacheMissed/65536       8.93 ms         8.91 ms           68

We could consider a different implementation which utilizes better the cache mechanism by iterating over a contiguous block of memory. That way the hardware could prefetch the next cache line and the program will spend less time waiting for memory.

Lets create an array of all descriptors, and a map between an index in the array and a point. Now iterate over the array of descriptors to find the minimal distance and after all computation is done, return the selected point by using the mapping above.

    std::vector<Descriptor> vector_of_descriptors;

    auto min_dist = std::numeric_limits<float>::max();
    for (const auto& descriptor : vector_of_descriptors) {
      auto l2 = L2(query, descriptor);
      min_dist = min_dist < l2 ? min_dist : l2;
    }

Comparing the performance of both implementations:

Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
Load Average: 1.42, 1.73, 1.97
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
CalculateL2CacheMissed/65536       8.93 ms         8.91 ms           68
CalculateL2CacheHit/65536          4.42 ms         4.41 ms          157

By choosing a data structure that makes better utilization of the cache we gain twice the performance in this case. So, to sum it up, computer architecture and the caching mechanism specifically is something to consider when the goal is to write efficient data structures and algorithms.
Below are some good resources that expand on this topic.

DEV Community

CPU Cache

Cache hierarchy

An Example

References

Top comments (0)

Read next

.NET Aspire and Dapr. Overlap or complementary?

Monolithic vs Microservice: Are You Running a Restaurant or a Food Delivery Empire?

WTF is Low-Level Design?

Applying SOLID Principles in JavaScript and TypeScript Framework