How Quadrupling a Buffer Let My Allocator Catch Up with mimalloc

#cpp #programming

TL;DR

By simply increasing the internal ring buffer size from 256 to 1024 in my custom allocator, hz3, multi-threaded performance improved by +16–46%, effectively catching up to (and in some cases beating) mimalloc.

Context

hz3 is an experimental memory allocator I'm developing under the hakozuna project. My recent focus has been optimizing "remote free" patterns (freeing memory allocated by a different thread) in multi-threaded environments.

One day, I noticed hz3 was lagging behind mimalloc in an extreme benchmark scenario: 32 threads with a 90% remote free rate.

Investigation

I enabled internal statistical flags and ran a profile to find the bottleneck. The culprit was clear:

overflow_sent = 943435

The "ring buffer," used to temporarily store remote frees before batch processing, was overflowing constantly.

What happens when it overflows:

Degradation: Execution falls back from batch processing to single-item processing.
Contention: The frequency of Atomic CAS (Compare-And-Swap) operations increases drastically.
Cache Misses: CPU cache efficiency plummets.

The Solution

The fix was literally a one-line change:

// Before
#define HZ3_REMOTE_STASH_RING_SIZE 256

// After
#define HZ3_REMOTE_STASH_RING_SIZE 1024

That's it.

Results

Here is the performance comparison after the change:

Condition	hz3	mimalloc	Diff
T=8 R=90%	193M	132M	+46% 🚀
T=16 R=50%	271M	209M	+30% 🚀
T=32 R=90%	199M	196M	+1.2%

T=32 R=90%: Successfully tied with mimalloc.
T=8 / T=16: Achieved a significant victory with a huge margin.
Overflow Count: Dropped from 943K to 131K (-86%).

The Trade-off (Cost)

Of course, memory isn't free, but the cost was negligible compared to the speed gain:

TLS Memory: +12KB per thread (16 Bytes × 768 additional entries).
Total Impact: Approximately +384KB for 32 threads.

DEV Community