https://github.com/hakorune/hakozuna
TL;DR
By simply increasing the internal ring buffer size from 256 to 1024 in my custom allocator, hz3, multi-threaded performance improved by +16β46%, effectively catching up to (and in some cases beating) mimalloc.
Context
hz3 is an experimental memory allocator I'm developing under the hakozuna project. My recent focus has been optimizing "remote free" patterns (freeing memory allocated by a different thread) in multi-threaded environments.
One day, I noticed hz3 was lagging behind mimalloc in an extreme benchmark scenario: 32 threads with a 90% remote free rate.
Investigation
I enabled internal statistical flags and ran a profile to find the bottleneck. The culprit was clear:
overflow_sent = 943435
The "ring buffer," used to temporarily store remote frees before batch processing, was overflowing constantly.
What happens when it overflows:
- Degradation: Execution falls back from batch processing to single-item processing.
- Contention: The frequency of Atomic CAS (Compare-And-Swap) operations increases drastically.
- Cache Misses: CPU cache efficiency plummets.
The Solution
The fix was literally a one-line change:
// Before
#define HZ3_REMOTE_STASH_RING_SIZE 256
// After
#define HZ3_REMOTE_STASH_RING_SIZE 1024
That's it.
Results
Here is the performance comparison after the change:
| Condition | hz3 | mimalloc | Diff |
|---|---|---|---|
| T=8 R=90% | 193M | 132M | +46% π |
| T=16 R=50% | 271M | 209M | +30% π |
| T=32 R=90% | 199M | 196M | +1.2% |
-
T=32 R=90%: Successfully tied with
mimalloc. - T=8 / T=16: Achieved a significant victory with a huge margin.
- Overflow Count: Dropped from 943K to 131K (-86%).
The Trade-off (Cost)
Of course, memory isn't free, but the cost was negligible compared to the speed gain:
- TLS Memory: +12KB per thread (16 Bytes Γ 768 additional entries).
- Total Impact: Approximately +384KB for 32 threads.
Top comments (0)