[Humans and AI] 3-month "pigeon" development (hakozuna) vs. 2-day multi-threaded development based on that (hakozuna-mt)

#chatgpt #claudecode #mimalloc #tcmalloc

Introduction: I am not a "Wizard," I am a "Pigeon"

First, a confession: I am not a C language wizard who writes assembly in my sleep.
I have been a programmer for many years with experience in C++ and C#, and I've written simple HTTP servers in C#, but my knowledge of memory allocators was zero.

It all started when I was chatting with ChatGPT. It tempted me, saying, "The MIT ACE paper could be used as a learning layer for a memory allocator." I started building it on a whim, and before I knew it, I was deep into serious development.

(It is a secret, but I am actually studying the internal mechanisms of hakozuna by asking the AI after the development is finished. 🤫)
What I understood during development is the difference between PTAG32 (global management) for front pointers and the Segment Header method. I'm currently studying how the backend shared area gracefully handles threads.

The Contenders

hakozuna The main project built over 3 months of this "Pigeon Behavior" (shuttling messages between AIs). It is a crystallization of human persistence and AI coding power.
hakozuna-mt Since hakozuna was losing to mimalloc at T=16 (16 threads), I developed this prototype specifically to beat it. It was built in just 2 days.

Today, I pitted these two against the industry giants (mimalloc, tcmalloc).

Origin: 8.8x Slower and the "Box Theory"

In the beginning, my first score was 103M ops/s.
In contrast, mimalloc was 908M ops/s.
The difference was 8.8x.

At first, it was "Segfault Hell."
Also, I was developing based on my own "Box Theory," but the AI misunderstood the concept. It thought "Box = Module boundary = Function call required."
I had to write in agents.md that "Human conceptual boundaries are different from program execution boundaries," and forced it to use inline calls to clear up the misunderstanding. (Ironically, towards the end, there were cases where inlining too much killed optimizations and made it slower...)

After the 5th rewrite (Opus 4.5 generation), designing it from scratch with all the accumulated knowledge and getting a "Go" sign from ChatGPT, it finally reached about 70% of mimalloc's score.

The "Pigeon" Workflow: 10-Minute High-Speed Cycles

From there, for 3 months, I relentlessly repeated the following cycle. One cycle takes about 10 minutes.

Ideation: Consult ChatGPT Pro on design for optimization at that specific moment (Let it think for 30 mins; sometimes it proposes multiple plans).
Spec: Have ChatGPT write a detailed instruction document based on the idea.
Plan: Pass the instructions to Claude Code and have it create an implementation plan.
Review: Ask ChatGPT to review Claude's plan for problems (There are usually mistakes).
Build: Claude Code writes the code based on the reviewed plan.
Test: Run the benchmark. (Result: Usually "No Go" / Performance Regression)

Most trials ended in failure, but by stacking the few "Go" results, I repeated the optimizations.

The Showdown: Benchmark Results

Environment: Linux (Ubuntu), 16 Cores.
Contenders: hakozuna, hakozuna-mt (Multi-thread specialized), mimalloc, tcmalloc.

Round 1: High Concurrency (`hakozuna-mt`)

Test: T Sweep (T=16, R=90, Median of 3)

Allocator	ops/s	vs hz3	Winner
hakozuna-mt	106.3M	+39%	🏆 1st
mimalloc	85.0M	+11%	2nd
tcmalloc	79.5M	+4%	3rd
hakozuna	76.6M	Baseline	4th

Verdict:
The AI (hakozuna-mt) scaled magnificently.
In just 48 hours, it built a high-throughput engine that fully utilizes 16 threads. If you just want to "spin it at T=16," the AI wins hands down.

Round 2: Memory Efficiency (The Human's Domain)

Test: RSS (Resident Set Size) Usage - MT Remote (Lower is better)

Allocator	Max RSS	Comparison	Winner
hakozuna	1.36 GB	Baseline	🏆 1st
mimalloc	1.52 GB	+11.8%	2nd
hakozuna-mt	2.04 GB	+50.0%	3rd
tcmalloc	2.34 GB	+72.1%	4th

Verdict:
This is where "3 months of hard work" paid off.
hakozuna runs on about half the memory of tcmalloc and significantly less than hakozuna-mt.

Round 3: Local Performance (Real World)

Test: R Sweep (R=0%, Local allocation/free)

Allocator	ops/s	vs hz4	Winner
hakozuna	359.6M	+44%	🏆 1st
tcmalloc	357.0M	+43%	2nd
mimalloc	300.5M	+20%	3rd
hakozuna-mt	249.6M	Baseline	4th

Verdict:
hakozuna is too strong here. I was surprised myself.

Final Results & Conclusion

I tallied the wins across all 9 benchmark categories (including Redis workloads and single-threaded apps).

Allocator	Total Wins	Strength
hakozuna	9 Wins	Memory Efficiency, Local Speed, Real App Performance
hakozuna-mt	5 Wins	Extreme Concurrency (T=16), Scalability
mimalloc	2 Wins	Balanced Remote Loads
tcmalloc	2 Wins	Single Threaded Dist App

Repository is here:
https://github.com/hakorune/hakozuna

hakozuna is still under development! There are many things I want to do, like the learning layer!
But for now, let me rest a bit... nya (meow).

Fun Fact: Origin of the Name

Hako: Developed based on the "Box (Hako) Theory."
zuna: Meaning "Yokozuna" (Grand Champion), hoping it would be strong.
Naming: By Claude Code!

DEV Community

[Humans and AI] 3-month "pigeon" development (hakozuna) vs. 2-day multi-threaded development based on that (hakozuna-mt)