Armaan Chahal

Posted on Sep 6 • Originally published at amn.sh on Sep 4

Harmony Between OOP and Data-Oriented Design

#cpp #learning #architecture #softwareengineering

OOP For the Uninitiated

Data-Oriented Design is something easy to understand but hard to implement. There are no clear rules with DOD, nor any clear models to follow.
Object-Oriented Programming follows the model of the world approach. You think about how something works in the real world,
and then you translate it into code. Take an example anyone can understand:
- You have a white chair
- You have a pink chair that is wider than the white one

Clearly, both of these things have something in common. They are both chairs!
This shared property is usually modeled with an abstract base class called Chair with properties such as width and color.
These properties can be overridden by the derived classes, such as PinkChair and WhiteChair.

class Chair {
public:
    virtual ~Chair() = default;
    virtual double getWidth() const = 0;
    virtual std::string_view getColor() const = 0;
};

class PinkChair : public Chair {
public:
    double getWidth() const override { return m_width; }
    std::string_view getColor() const override { return m_color; }
private:
    double m_width = 1.5;
    std::string m_color = "pink";
};

class WhiteChair : public Chair {
public:
    double getWidth() const override { return 1.0; }
    std::string getColor() const override { return "white"; }
};

Inheritance is not a given of OOP, but it is a common pattern.

Something that OOP does emphasize is having invisible data.
This means that you expose functions and hide the implementation details and state behind lock and key.
The client only uses your public API and generally has no idea of the underlying implementation.
Seems optimal, right?

DOD For the Uninitiated

Data-Oriented Design takes a different approach, focusing on the data itself.
When you model your code, you think about the data and the transformations of that data.
This is a very abstract concept to grasp, especially when you are new to DOD.
You may be thinking, "Okay, but what does that actually mean in code?"

I had the same question. Let's revisit the Chair example from above.
How would you have multiple chairs? You would likely do something like std::vector<Chair*>.
This is widespread in OOP, and it is called an array of structures, or AOS.

The alternate in DOD could look something like this:

struct Chairs final {
    std::vector<double> widths{};
    std::vector<std::string> colors{};
};

import microBenchmarkSOAAOS from "./micro_benchmark_soa_aos.png";

This is called a structure of arrays, or SOA.
As an aside, if you do a micro-benchmark on this, you will likely see varying results.
SOA won't be slower than AOS, but it will likely be very similar in terms of CPU cycles if you do something like this quick and dirty benchmark:

In this micro-benchmark, SOA is a little faster than AOS, but we aren't using polymorphism on the AOS (Chair*).
Check out this talk for details about polymorphism and micro-benchmarks!

The Abstraction

Abstraction is often times a point people new to DOD or traditional OOP programmers are confused with.
Are you even supposed to abstract? How do you abstract? Ultimately, the question should be:

"What is the philosophy of abstraction?"

The answer, as it always is with abstraction, is unclear. DOD purists will say that you should never abstract away the data,
and that the client should always be aware of the data transformations. This results in very traditional, C89-style code. Consider this:

int* data = new int[100000];

for (int i = 0; i < 10000; i++) {
    data[i] = (i + 1) * -3;
}

Can you spot any problems?

They are as follows:

The pointer to data is never deleted
The range doesn't go through all the data (100000 vs 10000), though that could be intended

This may not seem like a big deal, but the problem scales as the code scales.

There are four (five) horsemen of the apocalypse in low-level programming:

Bounds Safety
Type Safety
Lifetime Safety
Initialization Safety
(Undefined Behavior, sometimes)

The above example really doesn't account for any of these. At the very least, it doesn't use malloc, but you should never use new in C++, so initialization safety has a big X.
Obviously, bounds safety is not accounted for, we could copy-paste any bounds into the for loop. Lifetime safety is unaccounted for, we could forget to delete the pointer.
Type safety may not seem like a problem here, but it could become one. What if you are convert your data array to be a different type, like a uint8_t?
Sure, you may see that in this example, but what if you are getting data from a function that lives in a different file, where a different team works on it?

So, a better example is this:

std::vector<int> data(100000);

std::size_t i = 0;
std::ranges::for_each(data, [&](auto& n) {
    n = ++i * -3;
});

This is much better and satisfies all the horsemen. However, this is not pure DOD. The for_each is an abstraction,
and though it is safer, it is hiding some of the code path/data transformation.
Consider if we wanted to accumulate all the values. For that, we could use a for loop, which is pure DOD,
but we could also use std::reduce or std::accumulate which, though safer, hides the code path and data transformation.
However, this is far better code. Why?

The Harmony

The number one advice I would give you for when you decide to abstract is to always be aware of the code flow and see if your
abstractions are ever hiding important data transformations that the client would otherwise be unaware of.

The next most important thing, and this could even be more important depending on what you're developing, is to be aware of the CPU.
DOD is optimal because of the CPU cache. Check out this amazing talk by Scott Meyers for more details on the CPU cache.
If you want a brief explanation, the CPU cache is a small amount of memory that lives very close to the CPU.
It is the fastest memory to access, aside from registers, and is often used to store data that is accessed frequently.
How does the cpu decide what to keep in the cache? It uses a heuristic called locality of reference consisting of temporal locality and spatial locality.
Temporal locality is a tendency of a program to access the same data repeatedly in a short time frame.
Spatial locality is a program's tendency to access data nearby data already being accessed (think contiguous memory in an array).

The CPU cache brings in data in a cache line, which is usually 64 bytes on most platforms. Whenever you are accessing data,
know that you are using a cache line and don't waste the limited data (If you're wondering what happens when you access data larger than a cache line, the CPU uses multiple lines).
This is why SOA is so good, because it is optimal for the cache line. If you only need the colors of a Chair, you don't need to waste the cache line with the widths.
Keep in mind that accessing memory is orders of magnitude slower than doing something like division or square root.
Optimize for the hardware, as Mike Acton would say in his great talk on DOD.

Going back to the harmony, OOP is not mutually exclusive with DOD.
You can, and should, use both. DOD does not mean using only SOA, it means that in performance-critical code paths, you
code while being aware of the hardware. Ultimately, your code is always executing on hardware, and you have to be aware of what is happening.
That is the only way to write good, performant code.

Using traditional design patterns is not forbidden, and some of them are still useful. Sure, a std::variant isn't the most optimal way to have polymorphic data,
but it is still useful and a good abstraction. It may be just as good as a more DOD struct of function pointers that are initialized at runtime or compile time.

Conclusion

DOD is not a silver bullet. But it is a mindset that everyone should have. Even in the world of web development,
having fast websites is still important. Performance should be a top priority of everyone.
Back in the day, people would code specifically knowing basically what every line of code was doing, and how many CPU cycles it was taking.
Nowadays, we use tools with higher level abstractions. This isn't wrong. This is a good step. Otherwise, all performance-critical code would still be written in assembly.
Of course, that isn't the case. A core philosophy of C++ is Zero-Overhead Abstractions. Rust has Zero-Cost Abstractions, which means the same thing in essence. Use abstractions when they are more declarative, have clear data transformations (i.e., accumulate should clearly just loop through a range and add up the values).

On the topic of encapsulation, it is still necessary to encapsulate your data for the sake of separation of data, though encapsulation in DOD is different from OOP. You're not encapsulating or abstracting for the sake of hiding data, you're doing it for the sake of being able to scale up your data as your application scales up.

DEV Community