<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dimitar Anastasov</title>
    <description>The latest articles on DEV Community by Dimitar Anastasov (@anastassow).</description>
    <link>https://dev.to/anastassow</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946421%2Fda6418ea-6f20-4330-85ee-11c2a1c4fe25.jpeg</url>
      <title>DEV Community: Dimitar Anastasov</title>
      <link>https://dev.to/anastassow</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anastassow"/>
    <language>en</language>
    <item>
      <title>A single malloc took 7 milliseconds. So I deleted the slow path.</title>
      <dc:creator>Dimitar Anastasov</dc:creator>
      <pubDate>Thu, 18 Jun 2026 19:22:53 +0000</pubDate>
      <link>https://dev.to/anastassow/a-single-malloc-took-7-milliseconds-so-i-deleted-the-slow-path-2h2f</link>
      <guid>https://dev.to/anastassow/a-single-malloc-took-7-milliseconds-so-i-deleted-the-slow-path-2h2f</guid>
      <description>&lt;p&gt;Every allocator benchmark leads with the median. &lt;code&gt;malloc&lt;/code&gt; does 15M ops/sec, the typical call is 15 ns, ship it.&lt;/p&gt;

&lt;p&gt;The median never paged me at 3 a.m. The &lt;strong&gt;tail&lt;/strong&gt; did.&lt;/p&gt;

&lt;p&gt;Same allocator, same workload, but measuring the &lt;em&gt;worst&lt;/em&gt; single call instead of the typical one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Median: &lt;strong&gt;~16 ns&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Worst case: &lt;strong&gt;6,950,000 ns&lt;/strong&gt; — almost &lt;strong&gt;7 milliseconds&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One &lt;code&gt;malloc&lt;/code&gt;, occasionally, taking seven milliseconds. If you're rendering a frame in 16 ms or quoting a price in a market, that call just ate your entire budget — and the average looked &lt;em&gt;fantastic&lt;/em&gt; the whole time.&lt;/p&gt;

&lt;p&gt;General allocators are fast on average &lt;em&gt;because&lt;/em&gt; they have a slow path. Thread caches, coalescing, tree walks, OS fallback. Most calls dodge it. The ones that don't are your tail.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/anastassow/PMAD" rel="noopener noreferrer"&gt;&lt;strong&gt;PMAD&lt;/strong&gt;&lt;/a&gt;, and made the opposite bet: &lt;strong&gt;delete the slow path entirely.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The whole idea
&lt;/h2&gt;

&lt;p&gt;One &lt;code&gt;mmap&lt;/code&gt; at startup grabs the pool. You declare your block sizes up front. &lt;code&gt;alloc&lt;/code&gt; is a lookup-table index + a free-list pop. &lt;code&gt;free&lt;/code&gt; is a header read + a push. That's it — no fallback, no coalescing, no growth, no lock.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;percentages&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;          &lt;span class="c1"&gt;// sums to 100&lt;/span&gt;
&lt;span class="n"&gt;pmad_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;percentages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// the only syscall PMAD ever makes&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pmad_alloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;            &lt;span class="c1"&gt;// O(1); NULL if exhausted&lt;/span&gt;
&lt;span class="n"&gt;pmad_free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                                    &lt;span class="c1"&gt;// O(1)&lt;/span&gt;
&lt;span class="n"&gt;pmad_destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;                                  &lt;span class="c1"&gt;// single munmap, gone&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's no slow path in the code, so there's no slow path to &lt;em&gt;hit&lt;/em&gt;. The worst case isn't measured down — it's bounded up, by construction. You can compute the bound before your program runs a single line.&lt;/p&gt;

&lt;h2&gt;
  
  
  The flat curve
&lt;/h2&gt;

&lt;p&gt;Hot path at 64 B, head-to-head, one harness compiled once per allocator so exactly one variable changes. The column that matters is the last one — how far the tail fans from the median. Flatter = more deterministic.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Allocator&lt;/th&gt;
&lt;th&gt;P50&lt;/th&gt;
&lt;th&gt;P99.9&lt;/th&gt;
&lt;th&gt;P99.9 / P50&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PMAD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.59&lt;/td&gt;
&lt;td&gt;6.50&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.5×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tcmalloc&lt;/td&gt;
&lt;td&gt;3.91&lt;/td&gt;
&lt;td&gt;7.81&lt;/td&gt;
&lt;td&gt;2.0×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mimalloc&lt;/td&gt;
&lt;td&gt;2.62&lt;/td&gt;
&lt;td&gt;9.09&lt;/td&gt;
&lt;td&gt;3.5×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;jemalloc&lt;/td&gt;
&lt;td&gt;7.81&lt;/td&gt;
&lt;td&gt;144.53&lt;/td&gt;
&lt;td&gt;18.5×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;system&lt;/td&gt;
&lt;td&gt;15.62&lt;/td&gt;
&lt;td&gt;239.59&lt;/td&gt;
&lt;td&gt;15.3×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PMAD goes median → three-nines and barely moves. jemalloc and the system allocator fan out 15–18×. That spread &lt;em&gt;is&lt;/em&gt; the product. (And the median is 2.59 ns at &lt;em&gt;every&lt;/em&gt; size from 16 B to 4096 B — O(1) demonstrated, not asserted.)&lt;/p&gt;

&lt;p&gt;Under sustained fragmenting churn — 64M ops, the test where general allocators rot over time — PMAD's worst case is &lt;strong&gt;40 µs and stays there&lt;/strong&gt;. The system allocator's is the 6.95 ms from the top. &lt;strong&gt;174× tighter.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I lose
&lt;/h2&gt;

&lt;p&gt;If you only read the wins you'll deploy this wrong, so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;These numbers are macOS-only.&lt;/strong&gt; The one place determinism truly matters is Linux with isolated cores — and I don't have those yet. That's the next run, and the honest gap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;jemalloc and tcmalloc beat me on small-block churn&lt;/strong&gt; — their sharded caches have better locality than my single global free list.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput cliffs past ~256 B&lt;/strong&gt; once the working set spills out of cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No double-free detection&lt;/strong&gt; — a second &lt;code&gt;free&lt;/code&gt; silently corrupts the list. Sharp edge, on the roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It faults the whole pool upfront&lt;/strong&gt; — RSS = configured size from boot. A feature for hard real-time, a cost if you over-provision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a &lt;em&gt;special-purpose&lt;/em&gt; allocator and it says so.&lt;/p&gt;

&lt;h2&gt;
  
  
  The interesting part
&lt;/h2&gt;

&lt;p&gt;PMAD is single-threaded — one global free list, no locks. Sounds like a limitation. It isn't, in the system I'm building it for.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/anastassow" rel="noopener noreferrer"&gt;&lt;strong&gt;quicx&lt;/strong&gt;&lt;/a&gt; is a task-queue engine going &lt;strong&gt;shared-nothing per-core&lt;/strong&gt;: one worker per core, zero shared state. There, "single-threaded" becomes exactly right — &lt;strong&gt;one deterministic pool per core, zero contention by construction.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And once you have &lt;em&gt;N&lt;/em&gt; independent deterministic pools, a genuinely &lt;em&gt;predictive&lt;/em&gt; layer appears: a dispatcher that routes work by &lt;strong&gt;per-pool occupancy&lt;/strong&gt;, steering away from cores near capacity — a reinforcement-learning problem sitting on top of N allocators with provable per-op latency. The allocator is deterministic; the dispatcher is what predicts. That co-design is the part I'm actually chasing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;MIT, pure C99, no dependencies, Linux + macOS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/anastassow/PMAD.git
&lt;span class="nb"&gt;cd &lt;/span&gt;PMAD/benchmarks/v2
make &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make &lt;span class="nb"&gt;test&lt;/span&gt;     &lt;span class="c"&gt;# 19/19 correctness checks first&lt;/span&gt;
./run_all.sh          &lt;span class="c"&gt;# full head-to-head&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every number is reproducible — raw samples committed, full methodology in the repo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/anastassow/PMAD" rel="noopener noreferrer"&gt;https://github.com/anastassow/PMAD&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you ship real-time systems on Linux and have opinions on what this benchmark &lt;em&gt;should&lt;/em&gt; look like with cores isolated — that's the feedback I want. Tell me where I'm wrong.&lt;/p&gt;

</description>
      <category>c</category>
      <category>performance</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
