CharmPic

Posted on Jan 22

From "Losing at T=16" to "Matching mimalloc": A Day with hz3 Lane16

#cpp

TL;DR

Fixed a performance gap where hz3 was losing to mimalloc at T=16 / R=90 by optimizing remote management.
Introduced Lane16 PageRemote, which initially caused a hang at T=2.
The culprit: A bug where tail->next was overwritten on CAS failure, creating a circular list.
After the fix: Performance improved by +9.6% at T=16 / R=90, reaching parity with or slightly outperforming mimalloc.
Discovered a massive 3.5x regression caused by an unnecessary atomic operation in the hot path.

1. The Initial Problem

At T=16 (16 threads), hz3 began to fall behind mimalloc.

While it maintained a lead at T=8, the performance dipped at T=16. I suspected L1 cache contention and an increased instruction count.

Analysis:

Observed a difference in instruction counts + increased cache misses.
The real bottleneck wasn't the "hot path" but rather the remote/owner stash management.

2. Introducing Lane16 (PageRemote)

To counter this, I implemented a strategy inspired by mimalloc’s strengths (page-local management and delayed collection) as a "separate lane" within hz3.

Preserved the existing hot path.
Minimized the logic boundaries to only two points.
Implemented as an opt-in feature via HZ3_LANE_T16_R90_PAGE_REMOTE.

3. The T=2 Hang Incident

During testing, the system hung at T=2.

The Cause: Every time a CAS (Compare-And-Swap) failed, the code was overwriting tail->next, eventually corrupting the list into a circular loop.

The Buggy Code

do {
    hz3_obj_set_next(tail, old_head);  // Overwritten on every retry
} while (!CAS(...));

The Fix

void* old_head = atomic_load(...);
for (;;) {
    hz3_obj_set_next(tail, old_head);
    if (CAS_strong(...)) break; // Success
}

This resolved the hang issue for all configurations where T > 1.

4. Performance Benchmarks (A/B Test)

Config: T=16 / R=90 / RUNS=5

Baseline: 69.5M ops/s
Lane16: 76.1M ops/s
Improvement: +9.6%

Further measurements with warmup and taskset pinning showed that hz3 now stands toe-to-toe with mimalloc, occasionally maintaining a 5–7% lead.

5. Identifying a Major Regression

During this process, I noticed a staggering performance gap.

Version at paper release (e165faccc): 234M ops/s
Current HEAD: 65M ops/s (3.5x regression)

The culprit was an atomic_fetch_add placed inside the hot path:

static _Atomic uint32_t g_hz3_malloc_first = 0;
if (atomic_fetch_add_explicit(&g_hz3_malloc_first, 1, ...) == 0) { ... }

This caused an atomic Read-Modify-Write (RMW) operation to occur on every single allocation.

The Solution:

Introduced HZ3_MALLOC_FIRST_LOG (disabled by default).
Removed the atomic operation from the hot path.

DEV Community