DEV Community

Cover image for A single malloc took 7 milliseconds. So I deleted the slow path.
Dimitar Anastasov
Dimitar Anastasov

Posted on • Originally published at github.com

A single malloc took 7 milliseconds. So I deleted the slow path.

Every allocator benchmark leads with the median. malloc does 15M ops/sec, the typical call is 15 ns, ship it.

The median never paged me at 3 a.m. The tail did.

Same allocator, same workload, but measuring the worst single call instead of the typical one:

  • Median: ~16 ns
  • Worst case: 6,950,000 ns — almost 7 milliseconds.

One malloc, occasionally, taking seven milliseconds. If you're rendering a frame in 16 ms or quoting a price in a market, that call just ate your entire budget — and the average looked fantastic the whole time.

General allocators are fast on average because they have a slow path. Thread caches, coalescing, tree walks, OS fallback. Most calls dodge it. The ones that don't are your tail.

So I built PMAD, and made the opposite bet: delete the slow path entirely.

The whole idea

One mmap at startup grabs the pool. You declare your block sizes up front. alloc is a lookup-table index + a free-list pop. free is a header read + a push. That's it — no fallback, no coalescing, no growth, no lock.

size_t classes[]     = { 16, 64, 256 };
size_t percentages[] = { 20, 50, 30 };          // sums to 100
pmad_init(classes, 3, percentages, 1024 * 1024); // the only syscall PMAD ever makes

int *x = pmad_alloc(sizeof(int) * 4);            // O(1); NULL if exhausted
pmad_free(x);                                    // O(1)
pmad_destroy();                                  // single munmap, gone
Enter fullscreen mode Exit fullscreen mode

There's no slow path in the code, so there's no slow path to hit. The worst case isn't measured down — it's bounded up, by construction. You can compute the bound before your program runs a single line.

The flat curve

Hot path at 64 B, head-to-head, one harness compiled once per allocator so exactly one variable changes. The column that matters is the last one — how far the tail fans from the median. Flatter = more deterministic.

Allocator P50 P99.9 P99.9 / P50
PMAD 2.59 6.50 2.5×
tcmalloc 3.91 7.81 2.0×
mimalloc 2.62 9.09 3.5×
jemalloc 7.81 144.53 18.5×
system 15.62 239.59 15.3×

PMAD goes median → three-nines and barely moves. jemalloc and the system allocator fan out 15–18×. That spread is the product. (And the median is 2.59 ns at every size from 16 B to 4096 B — O(1) demonstrated, not asserted.)

Under sustained fragmenting churn — 64M ops, the test where general allocators rot over time — PMAD's worst case is 40 µs and stays there. The system allocator's is the 6.95 ms from the top. 174× tighter.

Where I lose

If you only read the wins you'll deploy this wrong, so:

  • These numbers are macOS-only. The one place determinism truly matters is Linux with isolated cores — and I don't have those yet. That's the next run, and the honest gap.
  • jemalloc and tcmalloc beat me on small-block churn — their sharded caches have better locality than my single global free list.
  • Throughput cliffs past ~256 B once the working set spills out of cache.
  • No double-free detection — a second free silently corrupts the list. Sharp edge, on the roadmap.
  • It faults the whole pool upfront — RSS = configured size from boot. A feature for hard real-time, a cost if you over-provision.

It's a special-purpose allocator and it says so.

The interesting part

PMAD is single-threaded — one global free list, no locks. Sounds like a limitation. It isn't, in the system I'm building it for.

quicx is a task-queue engine going shared-nothing per-core: one worker per core, zero shared state. There, "single-threaded" becomes exactly right — one deterministic pool per core, zero contention by construction.

And once you have N independent deterministic pools, a genuinely predictive layer appears: a dispatcher that routes work by per-pool occupancy, steering away from cores near capacity — a reinforcement-learning problem sitting on top of N allocators with provable per-op latency. The allocator is deterministic; the dispatcher is what predicts. That co-design is the part I'm actually chasing.

Try it

MIT, pure C99, no dependencies, Linux + macOS.

git clone https://github.com/anastassow/PMAD.git
cd PMAD/benchmarks/v2
make && make test     # 19/19 correctness checks first
./run_all.sh          # full head-to-head
Enter fullscreen mode Exit fullscreen mode

Every number is reproducible — raw samples committed, full methodology in the repo.

https://github.com/anastassow/PMAD

If you ship real-time systems on Linux and have opinions on what this benchmark should look like with cores isolated — that's the feedback I want. Tell me where I'm wrong.

Top comments (0)