Every allocator benchmark leads with the median. malloc does 15M ops/sec, the typical call is 15 ns, ship it.
The median never paged me at 3 a.m. The tail did.
Same allocator, same workload, but measuring the worst single call instead of the typical one:
- Median: ~16 ns
- Worst case: 6,950,000 ns — almost 7 milliseconds.
One malloc, occasionally, taking seven milliseconds. If you're rendering a frame in 16 ms or quoting a price in a market, that call just ate your entire budget — and the average looked fantastic the whole time.
General allocators are fast on average because they have a slow path. Thread caches, coalescing, tree walks, OS fallback. Most calls dodge it. The ones that don't are your tail.
So I built PMAD, and made the opposite bet: delete the slow path entirely.
The whole idea
One mmap at startup grabs the pool. You declare your block sizes up front. alloc is a lookup-table index + a free-list pop. free is a header read + a push. That's it — no fallback, no coalescing, no growth, no lock.
size_t classes[] = { 16, 64, 256 };
size_t percentages[] = { 20, 50, 30 }; // sums to 100
pmad_init(classes, 3, percentages, 1024 * 1024); // the only syscall PMAD ever makes
int *x = pmad_alloc(sizeof(int) * 4); // O(1); NULL if exhausted
pmad_free(x); // O(1)
pmad_destroy(); // single munmap, gone
There's no slow path in the code, so there's no slow path to hit. The worst case isn't measured down — it's bounded up, by construction. You can compute the bound before your program runs a single line.
The flat curve
Hot path at 64 B, head-to-head, one harness compiled once per allocator so exactly one variable changes. The column that matters is the last one — how far the tail fans from the median. Flatter = more deterministic.
| Allocator | P50 | P99.9 | P99.9 / P50 |
|---|---|---|---|
| PMAD | 2.59 | 6.50 | 2.5× |
| tcmalloc | 3.91 | 7.81 | 2.0× |
| mimalloc | 2.62 | 9.09 | 3.5× |
| jemalloc | 7.81 | 144.53 | 18.5× |
| system | 15.62 | 239.59 | 15.3× |
PMAD goes median → three-nines and barely moves. jemalloc and the system allocator fan out 15–18×. That spread is the product. (And the median is 2.59 ns at every size from 16 B to 4096 B — O(1) demonstrated, not asserted.)
Under sustained fragmenting churn — 64M ops, the test where general allocators rot over time — PMAD's worst case is 40 µs and stays there. The system allocator's is the 6.95 ms from the top. 174× tighter.
Where I lose
If you only read the wins you'll deploy this wrong, so:
- These numbers are macOS-only. The one place determinism truly matters is Linux with isolated cores — and I don't have those yet. That's the next run, and the honest gap.
- jemalloc and tcmalloc beat me on small-block churn — their sharded caches have better locality than my single global free list.
- Throughput cliffs past ~256 B once the working set spills out of cache.
-
No double-free detection — a second
freesilently corrupts the list. Sharp edge, on the roadmap. - It faults the whole pool upfront — RSS = configured size from boot. A feature for hard real-time, a cost if you over-provision.
It's a special-purpose allocator and it says so.
The interesting part
PMAD is single-threaded — one global free list, no locks. Sounds like a limitation. It isn't, in the system I'm building it for.
quicx is a task-queue engine going shared-nothing per-core: one worker per core, zero shared state. There, "single-threaded" becomes exactly right — one deterministic pool per core, zero contention by construction.
And once you have N independent deterministic pools, a genuinely predictive layer appears: a dispatcher that routes work by per-pool occupancy, steering away from cores near capacity — a reinforcement-learning problem sitting on top of N allocators with provable per-op latency. The allocator is deterministic; the dispatcher is what predicts. That co-design is the part I'm actually chasing.
Try it
MIT, pure C99, no dependencies, Linux + macOS.
git clone https://github.com/anastassow/PMAD.git
cd PMAD/benchmarks/v2
make && make test # 19/19 correctness checks first
./run_all.sh # full head-to-head
Every number is reproducible — raw samples committed, full methodology in the repo.
https://github.com/anastassow/PMAD
If you ship real-time systems on Linux and have opinions on what this benchmark should look like with cores isolated — that's the feedback I want. Tell me where I'm wrong.
Top comments (0)