DEV Community

CharmPic
CharmPic

Posted on

Custom Memory Allocator HAKOZUNA HZ8: A balanced allocator that prioritizes low RSS over raw speed.


I have created HZ8 as a new line for my custom memory allocator, Hakozuna.

HZ8 is not an allocator designed to be the "fastest across all benchmarks."

The goal is to return the post-workload RSS to a low level while maintaining a practical throughput.

In short, HZ8 is characterized as follows:

HZ8:
balanced low-RSS allocator
practical throughput
fail-closed ownership
cross-thread free correctness

I have consolidated the insights gained from experimenting with HZ3, HZ4, HZ5, and HZ6 into HZ8, organizing it as the primary allocator line to choose for general use.

HZ8 Design Principles

In HZ8, we put a particular emphasis on the following points:

  1. Keeping RSS low.
  2. Preventing breakdown under remote-heavy workloads.
  3. Handling cross-thread frees safely.
  4. Making ownership and route determination fail-closed.
  5. Balancing practical speed and memory usage rather than aiming for the absolute fastest.

The current default is HZ8-v2 / KeepRefill.

KeepRefill is a mechanism designed to avoid heavy empty/reactivate loops under remote-heavy workloads. When a medium run becomes empty, it retains the owner-local refill candidate rather than destroying it immediately.

Benchmark Results

The environment is Ubuntu 22.04.5 / Linux 6.8.0-90 / x86_64, RUNS=10, THREADS=16, ITERS=50000.

The representative results are as follows:

Row HZ8 ops/s HZ8 post RSS mimalloc ops/s mimalloc post RSS tcmalloc ops/s tcmalloc post RSS
small_interleaved_remote90 12.023M 2.91 MiB 10.960M 50.98 MiB 23.900M 32.94 MiB
main_interleaved_r90 6.048M 4.57 MiB 4.715M 183.12 MiB 12.178M 90.31 MiB
medium_interleaved_r50 8.128M 3.81 MiB 4.151M 162.54 MiB 15.870M 79.06 MiB

tcmalloc shows strong throughput in many rows.

On the other hand, HZ8 demonstrates significantly lower post-workload RSS.

Therefore, the core proposition of HZ8 is as follows:

HZ8 is not intended to fully replace tcmalloc.

However, it is highly useful as an allocator that returns RSS to a low level while maintaining practical speed.

Comparison: MT lane x remote%

Aligning HZ3, HZ4, HZ5, HZ6, and HZ8 makes the positioning of HZ8 slightly easier to visualize.

Lane hz3 hz4 mimalloc tcmalloc Best HZ5 HZ6 HZ8
main_r0 292.15M 85.63M 146.73M 318.82M 157.44M 16.88M 107.633M
main_r50 31.46M 62.32M 14.26M 64.87M 79.43M 15.08M 29.633M
main_r90 22.31M 67.14M 7.72M 45.42M 62.31M 10.99M 20.610M
guard_r0 318.98M 156.68M 258.19M 375.71M 149.00M 189.48M 224.750M
cross128_r90 2.78M 27.66M 3.52M 7.21M 22.39M 6.38M 37.342k

HZ8 is not universally fast.

In particular, cross128_r90 is a current bottleneck.

However, since HZ8 is a line focused heavily on keeping RSS low, it shouldn't be evaluated solely by throughput.

LargeDirect Experiment

To address the weakness in cross128_r90, we also tested an opt-in profile called LargeDirectOwned.

This provides evidence showing that the performance bottleneck in cross128_r90 stems from the large/direct boundary.

cross128_r90:
  baseline: 62.940k ops/s
  LargeDirect candidate: 2.835M ops/s
  ratio: 45.048x

Enter fullscreen mode Exit fullscreen mode

However, the RSS increases:

peak RSS:
  150.17 MiB -> 260.07 MiB

post RSS:
  107.04 MiB -> 190.61 MiB

Enter fullscreen mode Exit fullscreen mode

For this reason, LargeDirect is not enabled by default.

The default for HZ8 remains the KeepRefill balanced default.

Summary

HZ8 is not the fastest allocator.

However, it has become an allocator with the following distinct characteristics:

HZ8:
  Practical speed
  Low post-workload RSS
  Resilience against remote-heavy workloads
  Cross-thread free correctness
  Fail-closed ownership

Enter fullscreen mode Exit fullscreen mode

If speed is the sole metric, tcmalloc remains incredibly strong.

On the other hand, for workloads where returning RSS to a low level is critical, HZ8 occupies a very compelling position.

Moving forward, we plan to maintain the balanced line of HZ8 while advancing further speed-oriented research under HZ9.

Links

GitHub: https://github.com/hakorune/hakozuna

HZ8 paper / Zenodo: https://zenodo.org/records/21084279

Top comments (0)