Robert Mendola

Posted on Dec 21

What Building a C++ Benchmarking Suite Taught Me About "Simple" Data Structures

#algorithms #computerscience #performance #cpp

We all know the Big-O complexity of basic data structures. Arrays are O(n) for search. Hash maps are O(1). Linked lists are... well, complicated. But when I set out to build hashbrowns — a C++17 benchmarking suite comparing arrays, linked lists, and hash maps — I discovered that theory and practice are very different beasts.

Here's what I learned building this project from scratch, and why you should probably benchmark before you optimize.

🎯 The Goal Was Simple (Ha!)

I wanted a clean, educational project that would:

Implement dynamic arrays, linked lists, and hash maps from scratch
Benchmark insert, search, and remove operations
Find the "crossover points" where one structure beats another
Export everything to CSV for analysis

Sounds straightforward, right? Four months later, I had written a custom memory tracker, implemented multiple hash map strategies, added statistical bootstrapping for confidence intervals, and learned more about CPU caches than I ever wanted to know.

📚 Lesson 1: Polymorphism Has a Price (But It's Worth It)

My first architectural decision was creating a common DataStructure interface:

class DataStructure {
public:
    virtual void insert(int key, const std::string& value) = 0;
    virtual bool search(int key, std::string& value) const = 0;
    virtual bool remove(int key) = 0;
    virtual size_t memory_usage() const = 0;
    virtual std::string type_name() const = 0;
    // ...
};

This made benchmarking elegant — I could write generic code that tested any data structure:

for (auto& structure : structures) {
    timer.start();
    structure->insert(key, value);
    timer.stop();
}

But virtual function calls have overhead. In tight loops, that vtable lookup adds up. I spent a whole weekend convinced my hash map was slower than expected... until I realized I was measuring the cost of polymorphism, not the data structure itself.

The fix? I kept the clean interface for the benchmarking harness but used templates internally where performance-critical code needed direct calls. The polymorphic interface was still worth it for maintainability and adding new structures easily.

⏱️ Lesson 2: Benchmarking Is Harder Than It Looks

My first timer was naive:

auto start = std::chrono::high_resolution_clock::now();
do_operation();
auto end = std::chrono::high_resolution_clock::now();

The numbers were all over the place. Some runs were 10x faster than others. What was going on?

Problem 1: Warm-up matters. The first few runs are always slower because CPU caches are cold and the branch predictor hasn't learned patterns yet. I added configurable warm-up runs:

template<typename Func>
void warmup(size_t warmup_count, Func&& operation) {
    for (size_t i = 0; i < warmup_count; ++i) {
        operation();
    }
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

Problem 2: Outliers destroy your mean. That one run where your OS decided to run garbage collection? It'll skew everything. I implemented automatic outlier detection using Z-scores:

// Remove samples more than 2 standard deviations from mean
std::vector<duration> remove_outliers(const std::vector<duration>& data) const;

Problem 3: CPU frequency scaling. Modern CPUs boost and throttle constantly. I added options to pin CPU affinity and (on Linux) attempt to lock the CPU governor.

Problem 4: You need more than the mean. Reporting mean ± stddev isn't enough for real analysis. I ended up implementing:

Median and P95 percentiles
Bootstrap confidence intervals
Multiple output formats (CSV and JSON)

The JSON output now includes hardware metadata, git commit SHA, and the seed used for random patterns — because reproducibility matters.

🔧 Lesson 3: Growth Strategies Are a Rabbit Hole

When implementing DynamicArray, I thought I'd just double the capacity when full. Classic amortized O(1) insertion. Done, right?

But then I wondered: what if we used 1.5x growth instead? What about Fibonacci growth? What about fixed-size increments?

So I implemented all four:

enum class GrowthStrategy {
    MULTIPLICATIVE_2_0,  // Growth factor of 2.0
    MULTIPLICATIVE_1_5,  // Growth factor of 1.5
    FIBONACCI,           // Fibonacci sequence growth
    ADDITIVE             // Fixed increment growth
};

And then I benchmarked them. Here's what I found:

2.0x is fastest for pure insertion speed
1.5x uses ~25% less memory on average
Fibonacci is basically 1.618x growth with extra complexity
Additive is terrible for large arrays (as expected)

The interesting insight: for most real-world workloads where you're not just inserting millions of elements, the difference is negligible. The "obvious" choice of 2x doubling is usually fine.

🗺️ Lesson 4: Hash Maps Have Hidden Complexity

I implemented two hash map strategies: open addressing (linear probing) and separate chaining. I figured open addressing would win because of better cache locality.

The reality was more nuanced:

enum class HashStrategy { OPEN_ADDRESSING, SEPARATE_CHAINING };

Open addressing wins when:

Load factor is kept low (~0.7)
Keys are well-distributed
You're doing mostly lookups

Separate chaining wins when:

You have clustering from bad hash distribution
Deletion is common (tombstones hurt open addressing)
Load factor is high

The tombstone problem was particularly sneaky. When you delete from an open-addressed hash map, you can't just mark the slot empty — future probes might have skipped over it. So you mark it as a "tombstone," but too many tombstones degrade performance as bad as too many collisions.

I ended up tracking probe counts per operation as a metric:

double insert_probes_mean{0.0};
double search_probes_mean{0.0};
double remove_probes_mean{0.0};

This made it obvious when the hash function wasn't distributing well.

🧠 Lesson 5: Memory Tracking Reveals Everything

Early on, I built a MemoryTracker singleton that hooks into all allocations:

class MemoryTracker {
public:
    void record_allocation(void* ptr, size_t size, const char* file, int line);
    void record_deallocation(void* ptr);
    bool check_leaks() const;
    // ...
};

Combined with a custom TrackedAllocator<T> that plugs into STL patterns, I could answer questions like:

How much memory does each structure actually use?
What's the peak memory during a benchmark run?
Are there any leaks?

The memory overhead numbers were eye-opening:

Structure	10K elements	Memory	Overhead
DynamicArray	10,000	~240 KB	~24 bytes/element
SinglyLinkedList	10,000	~400 KB	~40 bytes/element
HashMap (OA)	10,000	~380 KB	~38 bytes/element
HashMap (SC)	10,000	~520 KB	~52 bytes/element

Linked lists have nearly 2x the memory overhead of arrays when you account for the pointer overhead per node. With my memory pool optimization (MemoryPool<Node>), I got that down somewhat, but arrays still win on density.

🎲 Lesson 6: Reproducibility Is Non-Negotiable

Nothing is more frustrating than "it was faster yesterday." To make benchmarks reproducible, I added:

Explicit RNG seeding: Every random-pattern run records its seed
Pattern modes: Sequential, random, or mixed insertion patterns
Hardware fingerprinting: JSON output includes CPU model, OS, and build info
Baseline regression checks: Compare current run against stored baselines

struct BenchmarkConfig {
    enum class Pattern { SEQUENTIAL, RANDOM, MIXED };
    Pattern pattern = Pattern::SEQUENTIAL;
    std::optional<unsigned long long> seed;
    bool seed_was_generated = false;
    // ...
};

Now when someone reports weird numbers, I can ask: "What was your seed? What pattern? What hardware?" and actually reproduce the issue.

🏁 The Crossover Analysis: The Whole Point

After all this infrastructure, I could finally answer the question I started with: when does one structure beat another?

The crossover analysis sweeps through sizes and finds where performance curves intersect:

Operation: search
Array vs HashMap crossover at N ≈ 150
(Below 150 elements, array linear search beats hash lookup!)

Yes, for small collections, the overhead of hashing often exceeds the benefit. Your hash map with 50 elements might be slower than a linear array scan. Cache locality is that powerful.

💡 Key Takeaways

Benchmark before optimizing. Your intuition about performance is probably wrong.
Big-O is necessary but not sufficient. Constants matter. Cache locality matters. Memory overhead matters.
Build measurement infrastructure first. You can't improve what you can't measure accurately.
Simple interfaces, complex implementations. A clean DataStructure base class made everything easier to extend and test.
Reproducibility is a feature. Seeds, hardware info, and baseline comparisons save debugging time.
The "best" data structure depends on your workload. There's no universal winner.

🥔 Try It Yourself

hashbrowns is open source and designed to be educational:

git clone https://github.com/aeml/hashbrowns.git
cd hashbrowns
scripts/build.sh -t Release --test
./build/hashbrowns --size 10000 --runs 10 --structures array,hashmap

The code is extensively commented, and there are tutorials for adding your own data structures. Sometimes the best way to understand performance is to measure it yourself.

What data structure assumptions have surprised you? Drop a comment below — I'd love to hear about your benchmarking adventures!

DEV Community