Introduction: I am not a "Wizard," I am a "Pigeon"
First, a confession: I am not a C language wizard who writes assembly in my sleep.
I have been a programmer for many years with experience in C++ and C#, and I've written simple HTTP servers in C#, but my knowledge of memory allocators was zero.
It all started when I was chatting with ChatGPT. It tempted me, saying, "The MIT ACE paper could be used as a learning layer for a memory allocator." I started building it on a whim, and before I knew it, I was deep into serious development.
(It is a secret, but I am actually studying the internal mechanisms of hakozuna by asking the AI after the development is finished. π€«)
What I understood during development is the difference between PTAG32 (global management) for front pointers and the Segment Header method. I'm currently studying how the backend shared area gracefully handles threads.
The Contenders
- hakozuna The main project built over 3 months of this "Pigeon Behavior" (shuttling messages between AIs). It is a crystallization of human persistence and AI coding power.
-
hakozuna-mt
Since
hakozunawas losing tomimallocat T=16 (16 threads), I developed this prototype specifically to beat it. It was built in just 2 days.
Today, I pitted these two against the industry giants (mimalloc, tcmalloc).
Origin: 8.8x Slower and the "Box Theory"
In the beginning, my first score was 103M ops/s.
In contrast, mimalloc was 908M ops/s.
The difference was 8.8x.
At first, it was "Segfault Hell."
Also, I was developing based on my own "Box Theory," but the AI misunderstood the concept. It thought "Box = Module boundary = Function call required."
I had to write in agents.md that "Human conceptual boundaries are different from program execution boundaries," and forced it to use inline calls to clear up the misunderstanding. (Ironically, towards the end, there were cases where inlining too much killed optimizations and made it slower...)
After the 5th rewrite (Opus 4.5 generation), designing it from scratch with all the accumulated knowledge and getting a "Go" sign from ChatGPT, it finally reached about 70% of mimalloc's score.
The "Pigeon" Workflow: 10-Minute High-Speed Cycles
From there, for 3 months, I relentlessly repeated the following cycle. One cycle takes about 10 minutes.
- Ideation: Consult ChatGPT Pro on design for optimization at that specific moment (Let it think for 30 mins; sometimes it proposes multiple plans).
- Spec: Have ChatGPT write a detailed instruction document based on the idea.
- Plan: Pass the instructions to Claude Code and have it create an implementation plan.
- Review: Ask ChatGPT to review Claude's plan for problems (There are usually mistakes).
- Build: Claude Code writes the code based on the reviewed plan.
- Test: Run the benchmark. (Result: Usually "No Go" / Performance Regression)
Most trials ended in failure, but by stacking the few "Go" results, I repeated the optimizations.
The Showdown: Benchmark Results
Environment: Linux (Ubuntu), 16 Cores.
Contenders: hakozuna, hakozuna-mt (Multi-thread specialized), mimalloc, tcmalloc.
Round 1: High Concurrency (hakozuna-mt)
Test: T Sweep (T=16, R=90, Median of 3)
| Allocator | ops/s | vs hz3 | Winner |
|---|---|---|---|
| hakozuna-mt | 106.3M | +39% | π 1st |
| mimalloc | 85.0M | +11% | 2nd |
| tcmalloc | 79.5M | +4% | 3rd |
| hakozuna | 76.6M | Baseline | 4th |
Verdict:
The AI (hakozuna-mt) scaled magnificently.
In just 48 hours, it built a high-throughput engine that fully utilizes 16 threads. If you just want to "spin it at T=16," the AI wins hands down.
Round 2: Memory Efficiency (The Human's Domain)
Test: RSS (Resident Set Size) Usage - MT Remote (Lower is better)
| Allocator | Max RSS | Comparison | Winner |
|---|---|---|---|
| hakozuna | 1.36 GB | Baseline | π 1st |
| mimalloc | 1.52 GB | +11.8% | 2nd |
| hakozuna-mt | 2.04 GB | +50.0% | 3rd |
| tcmalloc | 2.34 GB | +72.1% | 4th |
Verdict:
This is where "3 months of hard work" paid off.
hakozuna runs on about half the memory of tcmalloc and significantly less than hakozuna-mt.
Round 3: Local Performance (Real World)
Test: R Sweep (R=0%, Local allocation/free)
| Allocator | ops/s | vs hz4 | Winner |
|---|---|---|---|
| hakozuna | 359.6M | +44% | π 1st |
| tcmalloc | 357.0M | +43% | 2nd |
| mimalloc | 300.5M | +20% | 3rd |
| hakozuna-mt | 249.6M | Baseline | 4th |
Verdict:
hakozuna is too strong here. I was surprised myself.
Final Results & Conclusion
I tallied the wins across all 9 benchmark categories (including Redis workloads and single-threaded apps).
| Allocator | Total Wins | Strength |
|---|---|---|
| hakozuna | 9 Wins | Memory Efficiency, Local Speed, Real App Performance |
| hakozuna-mt | 5 Wins | Extreme Concurrency (T=16), Scalability |
| mimalloc | 2 Wins | Balanced Remote Loads |
| tcmalloc | 2 Wins | Single Threaded Dist App |
Repository is here:
https://github.com/hakorune/hakozuna
hakozuna is still under development! There are many things I want to do, like the learning layer!
But for now, let me rest a bit... nya (meow).
Fun Fact: Origin of the Name
- Hako: Developed based on the "Box (Hako) Theory."
- zuna: Meaning "Yokozuna" (Grand Champion), hoping it would be strong.
- Naming: By Claude Code!
Top comments (0)