DEV Community

CharmPic
CharmPic

Posted on

From "Losing at T=16" to "Matching mimalloc": A Day with hz3 Lane16

TL;DR

  • Fixed a performance gap where hz3 was losing to mimalloc at T=16 / R=90 by optimizing remote management.
  • Introduced Lane16 PageRemote, which initially caused a hang at T=2.
  • The culprit: A bug where tail->next was overwritten on CAS failure, creating a circular list.
  • After the fix: Performance improved by +9.6% at T=16 / R=90, reaching parity with or slightly outperforming mimalloc.
  • Discovered a massive 3.5x regression caused by an unnecessary atomic operation in the hot path.

1. The Initial Problem

At T=16 (16 threads), hz3 began to fall behind mimalloc.

While it maintained a lead at T=8, the performance dipped at T=16. I suspected L1 cache contention and an increased instruction count.

Analysis:

  • Observed a difference in instruction counts + increased cache misses.
  • The real bottleneck wasn't the "hot path" but rather the remote/owner stash management.

2. Introducing Lane16 (PageRemote)

To counter this, I implemented a strategy inspired by mimalloc’s strengths (page-local management and delayed collection) as a "separate lane" within hz3.

  • Preserved the existing hot path.
  • Minimized the logic boundaries to only two points.
  • Implemented as an opt-in feature via HZ3_LANE_T16_R90_PAGE_REMOTE.

3. The T=2 Hang Incident

During testing, the system hung at T=2.

The Cause: Every time a CAS (Compare-And-Swap) failed, the code was overwriting tail->next, eventually corrupting the list into a circular loop.

The Buggy Code

do {
    hz3_obj_set_next(tail, old_head);  // Overwritten on every retry
} while (!CAS(...));

Enter fullscreen mode Exit fullscreen mode

The Fix

void* old_head = atomic_load(...);
for (;;) {
    hz3_obj_set_next(tail, old_head);
    if (CAS_strong(...)) break; // Success
}

Enter fullscreen mode Exit fullscreen mode

This resolved the hang issue for all configurations where T > 1.


4. Performance Benchmarks (A/B Test)

Config: T=16 / R=90 / RUNS=5

  • Baseline: 69.5M ops/s
  • Lane16: 76.1M ops/s
  • Improvement: +9.6%

Further measurements with warmup and taskset pinning showed that hz3 now stands toe-to-toe with mimalloc, occasionally maintaining a 5–7% lead.


5. Identifying a Major Regression

During this process, I noticed a staggering performance gap.

  • Version at paper release (e165faccc): 234M ops/s
  • Current HEAD: 65M ops/s (3.5x regression)

The culprit was an atomic_fetch_add placed inside the hot path:

static _Atomic uint32_t g_hz3_malloc_first = 0;
if (atomic_fetch_add_explicit(&g_hz3_malloc_first, 1, ...) == 0) { ... }

Enter fullscreen mode Exit fullscreen mode

This caused an atomic Read-Modify-Write (RMW) operation to occur on every single allocation.

The Solution:

  • Introduced HZ3_MALLOC_FIRST_LOG (disabled by default).
  • Removed the atomic operation from the hot path.

Top comments (0)