<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aleksei Gutikov</title>
    <description>The latest articles on DEV Community by Aleksei Gutikov (@agutikov).</description>
    <link>https://dev.to/agutikov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F299532%2Fdbd09007-92e2-46a6-86da-c025f1b8ef41.jpeg</url>
      <title>DEV Community: Aleksei Gutikov</title>
      <link>https://dev.to/agutikov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/agutikov"/>
    <language>en</language>
    <item>
      <title>VS Code sporadic freeze</title>
      <dc:creator>Aleksei Gutikov</dc:creator>
      <pubDate>Sat, 01 Mar 2025 23:17:52 +0000</pubDate>
      <link>https://dev.to/agutikov/vs-code-sporadic-freeze-38na</link>
      <guid>https://dev.to/agutikov/vs-code-sporadic-freeze-38na</guid>
      <description>&lt;p&gt;Just want to leave it somewhere, I've spent the whole night troubleshooting this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Enable memory monitoring"&lt;/strong&gt; feature of &lt;strong&gt;Konsole&lt;/strong&gt; actually sets it's own &lt;code&gt;cgroup&lt;/code&gt; &lt;code&gt;memory.high&lt;/code&gt; limit.&lt;br&gt;
It could cause significant performance degradation of heavy processes started in terminal.&lt;br&gt;
In my case - it caused VS Code freezes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcvfzil58z7q3bgsw46b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcvfzil58z7q3bgsw46b.png" alt="Image description" width="800" height="678"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, this Konsole setting actually does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;QFile&lt;/span&gt; &lt;span class="nf"&gt;memHighFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_createdAppCGroupPath&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;QDir&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;QStringLiteral&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"memory.high"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="n"&gt;memHighFile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;QStringLiteral&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%1M"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newMemHigh&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;toLocal8Bit&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you start some fancy modern IDE from terminal, for example like I'm used to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone git@github.com:xyz/foo_bar.git
&lt;span class="nb"&gt;cd &lt;/span&gt;foo_bar
code ./
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will run in the same cgroup together with Konsole (and with all other tabs).&lt;/p&gt;

&lt;p&gt;My VS Code with all plugins uses about 1-1.4G RAM while browsing small projects.&lt;br&gt;
So just after 1-2 seconds after start when plugins occupied all available memory UI just freezes. Not completely, but it became very-very sloooooow.&lt;br&gt;
And processes started by plugins (/opt/visual-studio-code/code, cpptools, clang-tidy, gopls, ...) start generating a lot of small disk reads.&lt;/p&gt;

&lt;p&gt;It happens because all those executables and dynamic libraries that VS Code and plugins pull into the memory does not fit the limit and Linux starts swapping pages with the code being executed. &lt;/p&gt;

&lt;h1&gt;
  
  
  Now the story
&lt;/h1&gt;

&lt;p&gt;First I looked at &lt;code&gt;htop&lt;/code&gt; and &lt;code&gt;iotop&lt;/code&gt; - found a bunch of processes, that looked like vscode plugins, trashing my SSD.&lt;br&gt;
I was shocked! Brand new SSD is dying! Or I occasionally installed some malware and now it stealing my (empty :)) bitcoin wallet! Or my OS goes crazy! Bug in Linux kernel! WTF is going on?!&lt;/p&gt;

&lt;p&gt;The most mind-blowing thing I have found after series of experiments was that it worked fine if I start it from Applications menu, or with Alt+F2.&lt;/p&gt;

&lt;p&gt;Just imagine - the same text editor :), when you start it by click on the icon in the Applications menu works fine, and if you start it with command line - it just completely refuses to work.&lt;/p&gt;

&lt;p&gt;First I expected different versions. I'm newbie in Manjaro, so maybe I have installed two different versions somehow. &lt;code&gt;ps&lt;/code&gt;, &lt;code&gt;which&lt;/code&gt;, &lt;code&gt;locate&lt;/code&gt; and &lt;code&gt;pacman&lt;/code&gt; showed it was the same application from a single package.&lt;/p&gt;

&lt;p&gt;So my next guess was env vars. I was looking for some differences in the environment variables.&lt;br&gt;
No results naturally.&lt;/p&gt;

&lt;p&gt;Then tried to disable all plugins.&lt;br&gt;
Of course it started working smoothly :) Of course! So I started enabling and disabling different plugins trying to find the one broken that eats my SSD.&lt;br&gt;
At this point I could get even more confused. Because it obviously affects the behavior, but that is not a root cause.&lt;/p&gt;

&lt;p&gt;By this time I already understood that actually heavy IO load must not affect UI, must not cause freezes.&lt;br&gt;
And also I noticed that terminal application where I start the code from also became irresponsible. So I had to switch between Konsole and Yakuake starting vscode in one window and killing it in another one.&lt;/p&gt;

&lt;p&gt;And after combining those two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lots of small disk reads&lt;/li&gt;
&lt;li&gt;and freeze of the terminal application where I start the IDE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I finally recalled that yesterday I have, just in case, enabled this "memory monitoring" in Konsole.&lt;/p&gt;

&lt;p&gt;Doublefacepalm.&lt;/p&gt;

&lt;h1&gt;
  
  
  I've learned something today. Again
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;If you do not understand what those settings are really doing - be ready for magic things happening.&lt;/li&gt;
&lt;li&gt;In software sporadic issues are not sporadic. Mostly.&lt;/li&gt;
&lt;li&gt;You can handle any problem if apply enough patience and expertise.
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>linux</category>
      <category>konsole</category>
      <category>kde</category>
      <category>cgroup</category>
    </item>
    <item>
      <title>Who is faster: compare_exchange or fetch_add?</title>
      <dc:creator>Aleksei Gutikov</dc:creator>
      <pubDate>Wed, 15 Apr 2020 23:01:59 +0000</pubDate>
      <link>https://dev.to/agutikov/who-is-faster-compareexchange-or-fetchadd-1pjc</link>
      <guid>https://dev.to/agutikov/who-is-faster-compareexchange-or-fetchadd-1pjc</guid>
      <description>&lt;p&gt;Many thanks to the readers who pointed out real issues in the original code and article:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/alexis_payen_d66d3be3214d"&gt;Alexis Payen&lt;/a&gt; — for the &lt;a href="https://dev.to/alexis_payen_d66d3be3214d/comment/2i4ke"&gt;dev.to comment&lt;/a&gt; explaining why &lt;code&gt;std::shared_ptr&lt;/code&gt; was outperforming my implementation (the move constructor was doing an unnecessary &lt;code&gt;acquire&lt;/code&gt; + &lt;code&gt;reset&lt;/code&gt;, and &lt;code&gt;refcount++&lt;/code&gt; was implicitly &lt;code&gt;seq_cst&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/pallas" rel="noopener noreferrer"&gt;Derrick Lyndon Pallas&lt;/a&gt; — for &lt;a href="https://github.com/agutikov/faa_vs_cmpxchg/issues/1" rel="noopener noreferrer"&gt;issue #1&lt;/a&gt; pointing out that &lt;code&gt;compare_exchange&lt;/code&gt; already writes the observed value back into &lt;code&gt;expected&lt;/code&gt; on failure (the explicit &lt;code&gt;load()&lt;/code&gt; inside the retry loop is redundant), and that &lt;code&gt;seq_cst&lt;/code&gt; is stronger than refcount decrement requires.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both pointers led me to revisit the rest of the memory ordering as well.&lt;br&gt;
The code is now both more correct and noticeably faster; the charts below were re-run with the fixes applied.&lt;/p&gt;



&lt;ul&gt;
&lt;li&gt;Differences and similarity&lt;/li&gt;
&lt;li&gt;
Benchmarks

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;shared_ptr&lt;/code&gt; implementaion&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;spinlock&lt;/code&gt; implementation&lt;/li&gt;
&lt;li&gt;Benchmark for &lt;code&gt;shared_ptr&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark for &lt;code&gt;spinlock&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Measurements&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Resulting charts for &lt;code&gt;refcount&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Resulting charts for &lt;code&gt;spinlock&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;C++ has out of the box cross-platfrorm atomic operations since C++11.&lt;/p&gt;

&lt;p&gt;There are several types of &lt;a href="https://en.cppreference.com/w/cpp/atomic" rel="noopener noreferrer"&gt;atomic operations&lt;/a&gt;.&lt;br&gt;
Here I want to compare 2 types of them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;atomic_fetch_add&lt;/code&gt;, &lt;code&gt;atomic_fetch_sub&lt;/code&gt;, etc...&lt;/li&gt;
&lt;li&gt;&lt;code&gt;atomic_compare_exchange&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This kind of atomic ops existed in computer programming long before C++11.&lt;br&gt;
For example &lt;a href="https://en.wikipedia.org/wiki/Compare-and-swap" rel="noopener noreferrer"&gt;Compare-and-swap (CAS)&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Fetch-and-add" rel="noopener noreferrer"&gt;Fetch-and-add (FAA)&lt;/a&gt; where implemented as CPU instructions in Intel 80486 - &lt;a href="https://en.wikipedia.org/wiki/X86_instruction_listings#Added_with_80486" rel="noopener noreferrer"&gt;CMPXCHG and XADD&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here I will not talk about origin of atomic operations and problem they are designed to solve - data races.&lt;/p&gt;

&lt;p&gt;Here I want to concentrate only on next 2 points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Comparison of semantic and typical use cases of atomic CAS and FAA.&lt;/li&gt;
&lt;li&gt;Comparison of performance of CAS and FAA in typical and abnormal cases.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Differences and similarity &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Most understandable description of atomic operation that I know is equivalent code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;atomic_fetch_add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;old&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="nf"&gt;atomic_compare_exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;desired&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;desired&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Difference in behaviour:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;compare_exchange&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;fetch_add&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;can fail&lt;/td&gt;
&lt;td&gt;can't fail - always succeeds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;leaves memory unchanged if fails&lt;/td&gt;
&lt;td&gt;always changes memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;succeeds only within narrow contition - equality of values pointed to by &lt;code&gt;target&lt;/code&gt; and &lt;code&gt;expected&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;rollback of changes requires another memory write operation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;implies a loop of retries&lt;/td&gt;
&lt;td&gt;no loops&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both &lt;code&gt;compare_exchange&lt;/code&gt; and &lt;code&gt;fetch_add&lt;/code&gt; are equivalent, in the sense that it is possible to define (implement) one through another. &lt;br&gt;
But differences are huge and sometimes (you'll see) usage of inappropriate atomic operation leads to significant performance impact up to complete inoperability of program.&lt;/p&gt;

&lt;p&gt;Basic use cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;compare_exchange&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;fetch_add&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;program thread waits for some changes made by the other thread&lt;/td&gt;
&lt;td&gt;program thread can continue progress in any case&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lock, mutual exclusion (mutex)&lt;/td&gt;
&lt;td&gt;atomic counter, refcounter (shared_ptr, ...)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  Benchmarks &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Basically, implemented &lt;code&gt;shared_ptr&lt;/code&gt; and &lt;code&gt;spinlock&lt;/code&gt; each with both &lt;code&gt;compare_exchange&lt;/code&gt; and &lt;code&gt;fetch_add&lt;/code&gt;.&lt;br&gt;
And compared their performance under &lt;strong&gt;aggressive&lt;/strong&gt; &lt;strong&gt;concurrent&lt;/strong&gt; read/write access.&lt;br&gt;
Also compared to standard &lt;a href="https://en.cppreference.com/w/cpp/thread/mutex" rel="noopener noreferrer"&gt;&lt;code&gt;std::mutex&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://en.cppreference.com/w/cpp/memory/shared_ptr" rel="noopener noreferrer"&gt;&lt;code&gt;std::shared_ptr&lt;/code&gt;&lt;/a&gt;.&lt;br&gt;
Code: &lt;a href="https://github.com/agutikov/faa_vs_cmpxchg" rel="noopener noreferrer"&gt;https://github.com/agutikov/faa_vs_cmpxchg&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;code&gt;shared_ptr&lt;/code&gt; implementaion &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Instead of posting complete code of &lt;code&gt;shared_ptr&lt;/code&gt; here, I provide just difference between two implementations.&lt;br&gt;
Main part is &lt;strong&gt;decrement&lt;/strong&gt; of &lt;strong&gt;reference counter&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Implemented with &lt;code&gt;fetch_sub&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;decref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;*&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic_fetch_sub_explicit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_release&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implemented with &lt;code&gt;compare_exchange&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;decref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;*&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic_compare_exchange_weak_explicit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_release&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is no need to reload &lt;code&gt;v&lt;/code&gt; inside the loop — on failure, &lt;code&gt;compare_exchange&lt;/code&gt; already writes the actual value of &lt;code&gt;*r&lt;/code&gt; back into &lt;code&gt;v&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The default memory order is &lt;code&gt;seq_cst&lt;/code&gt;, which is stronger than required. Decrement uses &lt;code&gt;release&lt;/code&gt; so that writes made through the shared object by other threads happen-before the destructor of the managed object. The thread that observes the final decrement (return value &lt;code&gt;1&lt;/code&gt;) then issues an &lt;code&gt;acquire&lt;/code&gt; fence before calling &lt;code&gt;delete&lt;/code&gt; on the block:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic_thread_fence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acquire&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;delete&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the canonical &lt;code&gt;release&lt;/code&gt; / &lt;code&gt;acquire&lt;/code&gt;-fence pattern used by libstdc++ and Boost. A standalone &lt;code&gt;seq_cst&lt;/code&gt; fence here would not have established a synchronizes-with relationship with &lt;code&gt;relaxed&lt;/code&gt; decrements on other threads, and is much more expensive than an &lt;code&gt;acquire&lt;/code&gt; fence besides.&lt;/p&gt;

&lt;p&gt;Benchmark contains both implementations of &lt;code&gt;decref&lt;/code&gt;: with &lt;code&gt;std::atomic_compare_exchange_weak&lt;/code&gt; and &lt;code&gt;std::atomic_compare_exchange_strong&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;spinlock&lt;/code&gt; implementation &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Main part of &lt;code&gt;spinlock&lt;/code&gt; is .&lt;/p&gt;

&lt;p&gt;Implemented with &lt;code&gt;fetch_add&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;spinlock&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;locked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic_fetch_add_explicit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acquire&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fetch_sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fetch_sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_release&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implemented with &lt;code&gt;compare_exchange&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;spinlock&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;locked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic_compare_exchange_weak_explicit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acquire&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(;;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c1"&gt;// relaxed test-load: just a hint that gates the CAS&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic_compare_exchange_weak_explicit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acquire&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_relaxed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="c1"&gt;// benchmarked both variants with "pause" and without&lt;/span&gt;
                &lt;span class="kr"&gt;__asm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"pause"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_release&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: the previous version of the article used the default &lt;code&gt;seq_cst&lt;/code&gt; ordering on every spinlock operation. A lock only needs &lt;code&gt;acquire&lt;/code&gt; on the successful lock-taking RMW and &lt;code&gt;release&lt;/code&gt; on &lt;code&gt;unlock&lt;/code&gt;. The test-load inside the spin loop can be &lt;code&gt;relaxed&lt;/code&gt; — it's purely a hint, the CAS still does the real synchronization. With default &lt;code&gt;seq_cst&lt;/code&gt; every &lt;code&gt;unlock&lt;/code&gt; emits a full memory barrier (&lt;code&gt;mfence&lt;/code&gt; on x86) and every spin iteration causes extra coherency traffic.&lt;/p&gt;

&lt;p&gt;Implemented with &lt;code&gt;fetch_or&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;for_spinlock&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;locked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic_fetch_or_explicit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_acquire&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kr"&gt;__asm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"pause"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;locked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;memory_order_release&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;atomic_fetch_or&lt;/code&gt; is much more suited for &lt;code&gt;spinlock&lt;/code&gt; than &lt;code&gt;fetch_add&lt;/code&gt;: it is &lt;em&gt;idempotent&lt;/em&gt; on a contended lock — repeatedly &lt;code&gt;OR&lt;/code&gt;-ing &lt;code&gt;1&lt;/code&gt; into a value that is already &lt;code&gt;1&lt;/code&gt; leaves it &lt;code&gt;1&lt;/code&gt;, so failing lock attempts don't have to be rolled back. This avoids the destructive ping-pong of &lt;code&gt;fetch_add&lt;/code&gt; + &lt;code&gt;fetch_sub&lt;/code&gt; that ruins the FAA-based spinlock under contention, while keeping the single-instruction simplicity that &lt;code&gt;compare_exchange&lt;/code&gt;'s retry loop doesn't have.&lt;/p&gt;

&lt;p&gt;Benchmark contains all three implementations: &lt;code&gt;fetch_or&lt;/code&gt;, &lt;code&gt;fetch_add&lt;/code&gt;, and &lt;code&gt;compare_exchange&lt;/code&gt; (both &lt;code&gt;weak&lt;/code&gt; and &lt;code&gt;strong&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark for &lt;code&gt;shared_ptr&lt;/code&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Creates single instance of &lt;code&gt;shred_ptr&amp;lt;int&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pass it into variable number of benchmark threads.&lt;/li&gt;
&lt;li&gt;Each thread do copy and move of shared_ptr N times in loop.&lt;/li&gt;
&lt;li&gt;N = Total_N / n_threads. So each run of benchmark with different number of thread do the same total number of lopp iterations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Workload function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt; &lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;typename&lt;/span&gt; &lt;span class="nc"&gt;shared_ptr_t&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;shared_ptr_benchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shared_ptr_t&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int64_t&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;shared_ptr_t&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;shared_ptr_t&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// try to protect code from been optimized out by compiler&lt;/span&gt;
            &lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ERROR %p %p %p&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How to get callable for benchmark threads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;shared_ptr_t&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;thread_work&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;shared_ptr_benchmark&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;shared_ptr_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benchmark for &lt;code&gt;spinlock&lt;/code&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Create single instance of &lt;code&gt;spinlock&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pass it into variable number of benchmark threads.&lt;/li&gt;
&lt;li&gt;Each thread do fast modifications of global variables with holding a lock, N times in loop.&lt;/li&gt;
&lt;li&gt;N = Total_N / n_threads. So each run of benchmark with different number of thread do the same total number of loop iterations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Workload function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;spinlock_benchmark_data&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;value1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;value2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt; &lt;span class="nc"&gt;spinlock_t&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;spinlock_benchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spinlock_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;slock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="kt"&gt;int64_t&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;spinlock_benchmark_data&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;global_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;slock&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;value1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;global_data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;value1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;value2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;global_data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;value2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="n"&gt;slock&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;unlock&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ERROR %lu != %lu&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How to get callable for benchmark threads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;spinlock_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;slock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;spinlock_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;spinlock_benchmark_data&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;global_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;spinlock_benchmark_data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;slock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;global_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kr"&gt;thread&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;hasher&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hasher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;this_thread&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;get_id&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;spinlock_benchmark&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;spinlock_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;global_data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Measurements &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete wall-clock time of execution of all benchmark threads with &lt;a href="https://en.cppreference.com/w/cpp/chrono/steady_clock" rel="noopener noreferrer"&gt;&lt;code&gt;std::chrono::steady_clock&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CPU time spent, by &lt;a href="https://en.cppreference.com/w/cpp/chrono/c/clock" rel="noopener noreferrer"&gt;&lt;code&gt;std::clock&lt;/code&gt;&lt;/a&gt;.
Then results divided by total number of iterations performed to evaluate approximate read and CPU time spent for each iteration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resulting charts for &lt;code&gt;refcount&lt;/code&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Benchmarks ran on a laptop with i9-14900HX (24 cores / 32 threads — 8 P-cores with SMT and 16 E-cores).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;refcount&lt;/code&gt;: average CPU time per loop iteration, nanoseconds, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4eyrra04rog3jdus6ww.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4eyrra04rog3jdus6ww.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All four implementations cluster within ~2× on a log scale. &lt;code&gt;std::shared_ptr&lt;/code&gt; and &lt;code&gt;fetch_sub&lt;/code&gt; lead; the two &lt;code&gt;cmpxchg&lt;/code&gt; variants are consistently slower because of CAS retries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;refcount&lt;/code&gt;: average CPU time per loop iteration, relative to baseline &lt;code&gt;std::shared_ptr&lt;/code&gt;, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flkhcqqa64ndklye98tbe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flkhcqqa64ndklye98tbe.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Relative view: &lt;code&gt;fetch_sub&lt;/code&gt; stays within ±20% of &lt;code&gt;std::shared_ptr&lt;/code&gt; (occasionally even faster); &lt;code&gt;cmpxchg_strong&lt;/code&gt; / &lt;code&gt;cmpxchg_weak&lt;/code&gt; cost roughly 1.4–1.9× the baseline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;refcount&lt;/code&gt;: average wall-clock time per loop iteration, nanoseconds, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fced6j2z86hv6zw6t8g5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fced6j2z86hv6zw6t8g5p.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Wall-clock latency stays bounded under ~70 ns even at 32 threads — the refcount op is short enough that the contended cache line, not the operation count, sets the ceiling.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;refcount&lt;/code&gt;: average wall-clock time per loop iteration, relative to baseline &lt;code&gt;std::shared_ptr&lt;/code&gt;, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzjj4wvz5isgoxj2j1u9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzjj4wvz5isgoxj2j1u9.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Same hierarchy as the CPU view, with similar relative spread.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  A note on thread placement
&lt;/h3&gt;

&lt;p&gt;The relative-to-&lt;code&gt;std::shared_ptr&lt;/code&gt; chart shows clear dips at &lt;strong&gt;6, 13–14, 20, and 28 threads&lt;/strong&gt; — points where my &lt;code&gt;fetch_sub&lt;/code&gt; implementation appears to "speed up" against the baseline. Looking at the raw CPU numbers, what's actually happening is the opposite: &lt;code&gt;std::shared_ptr&lt;/code&gt; &lt;em&gt;worsens&lt;/em&gt; at those specific thread counts while my implementation is more stable.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;N&lt;/th&gt;
&lt;th&gt;fetch_sub CPU&lt;/th&gt;
&lt;th&gt;shared_ptr CPU&lt;/th&gt;
&lt;th&gt;ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;128.97&lt;/td&gt;
&lt;td&gt;154.03&lt;/td&gt;
&lt;td&gt;0.84&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;167.74&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;211.70&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.79&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;193.13&lt;/td&gt;
&lt;td&gt;166.77&lt;/td&gt;
&lt;td&gt;1.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;345.39&lt;/td&gt;
&lt;td&gt;267.74&lt;/td&gt;
&lt;td&gt;1.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;357.01&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;396.60&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.90&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;14&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;397.94&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;423.95&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.94&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;482.43&lt;/td&gt;
&lt;td&gt;355.22&lt;/td&gt;
&lt;td&gt;1.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;530.56&lt;/td&gt;
&lt;td&gt;476.93&lt;/td&gt;
&lt;td&gt;1.11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;20&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;522.37&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;601.92&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.87&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;573.44&lt;/td&gt;
&lt;td&gt;575.32&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;739.54&lt;/td&gt;
&lt;td&gt;842.85&lt;/td&gt;
&lt;td&gt;0.88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;28&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;793.46&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;962.26&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.82&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;736.20&lt;/td&gt;
&lt;td&gt;845.86&lt;/td&gt;
&lt;td&gt;0.87&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dips are spaced ~7–8 threads apart, which lines up with the i9-14900HX's hybrid topology: 8 P-cores (16 SMT slots) followed by 16 E-cores. At particular thread counts, the kernel scheduler ends up placing threads in P/E mixes or SMT-sibling configurations that interact poorly with &lt;code&gt;std::shared_ptr&lt;/code&gt;'s control-block / managed-object layout (allocated separately and prone to false-sharing patterns with the host object), while my hand-rolled &lt;code&gt;shared_ptr_block&lt;/code&gt; uses a different sizing/alignment that is less sensitive to those placements. There is also a single-run measurement component: without averaging across multiple runs and without pinning threads via &lt;code&gt;taskset&lt;/code&gt;, ~10–20% per-point variance is normal on a hybrid laptop CPU.&lt;/p&gt;

&lt;p&gt;To separate the structural component from run-to-run noise: re-run the benchmark 3–5 times and look at whether the dips stay at the same N (structural), or wander (noise). Pinning threads to a fixed core set with &lt;code&gt;taskset&lt;/code&gt; would flatten the topology-induced component if that's the dominant cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resulting charts for &lt;code&gt;spinlock&lt;/code&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average CPU time per loop iteration, nanoseconds, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxctp9i32l9h777nbuf84.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxctp9i32l9h777nbuf84.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;fetch_add&lt;/code&gt; blows up — per-iteration CPU explodes past 10 µs by 6–7 threads (the benchmark caps it at 7 threads to keep total runtime bounded). Everything else groups together at the bottom of the log scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average CPU time per loop iteration, relative to baseline &lt;code&gt;std::mutex&lt;/code&gt;, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqx7xdjw91cn14c3xwy5e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqx7xdjw91cn14c3xwy5e.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;fetch_add&lt;/code&gt; reaches ~35× &lt;code&gt;std::mutex&lt;/code&gt; before the benchmark cuts it off. The remaining variants are within 1× or below the mutex baseline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average CPU time per loop iteration, without &lt;code&gt;fetch_add&lt;/code&gt;, nanoseconds, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb61lvhhcd1vltn0erno9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb61lvhhcd1vltn0erno9.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;With &lt;code&gt;fetch_add&lt;/code&gt; removed from the view, the second-tier story becomes visible: &lt;code&gt;fetch_or&lt;/code&gt; separates from the &lt;code&gt;cmpxchg&lt;/code&gt; / &lt;code&gt;std::mutex&lt;/code&gt; pack above ~10 threads and ends up roughly an order of magnitude slower by 32 threads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average CPU time per loop iteration, without &lt;code&gt;fetch_add&lt;/code&gt;, relative to baseline &lt;code&gt;std::mutex&lt;/code&gt;, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagjgiqoflloi8gg8hddg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagjgiqoflloi8gg8hddg.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Relative view: &lt;code&gt;cmpxchg&lt;/code&gt; variants stay ≤1.0× &lt;code&gt;std::mutex&lt;/code&gt; across nearly the whole range; &lt;code&gt;fetch_or&lt;/code&gt; climbs from ~1× near 10 threads to ~5–7× at 24–32 threads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average wall-clock time per loop iteration, nanoseconds, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdyev9se6d4hnzcd8fuj0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdyev9se6d4hnzcd8fuj0.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Same hierarchy in wall-clock: &lt;code&gt;fetch_add&lt;/code&gt; is off the top, &lt;code&gt;fetch_or&lt;/code&gt; is the worst remaining option past ~10 threads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average wall-clock time per loop iteration, relative to baseline &lt;code&gt;std::mutex&lt;/code&gt;, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftr0dzqmodno6civiidx9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftr0dzqmodno6civiidx9.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Wall-clock relative to &lt;code&gt;std::mutex&lt;/code&gt;: &lt;code&gt;fetch_add&lt;/code&gt; dominates the early thread counts; everything else stays around 1×.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average wall-clock time per loop iteration, without &lt;code&gt;fetch_add&lt;/code&gt;, nanoseconds, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xwdxogkp9eu7d225wp3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xwdxogkp9eu7d225wp3.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Without &lt;code&gt;fetch_add&lt;/code&gt;: &lt;code&gt;cmpxchg&lt;/code&gt; variants run at or below &lt;code&gt;std::mutex&lt;/code&gt; for most thread counts; &lt;code&gt;fetch_or&lt;/code&gt; runs 3–5× slower than the cmpxchg pack at high contention.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;spinlock&lt;/code&gt;: average wall-clock time per loop iteration, without &lt;code&gt;fetch_add&lt;/code&gt;, relative to baseline &lt;code&gt;std::mutex&lt;/code&gt;, less is better:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1szkkitcxqxwxkaogmw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1szkkitcxqxwxkaogmw.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;cmpxchg&lt;/code&gt; wall-clock sits around 0.5–1.0× &lt;code&gt;std::mutex&lt;/code&gt;; &lt;code&gt;fetch_or&lt;/code&gt; ends up 3–5× the mutex baseline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;For reference counters, &lt;code&gt;fetch_sub&lt;/code&gt; is roughly 1.4–1.9× faster than &lt;code&gt;compare_exchange&lt;/code&gt; in CPU time, and stays within ~20% of &lt;code&gt;std::shared_ptr&lt;/code&gt; (occasionally even beating it). &lt;code&gt;std::shared_ptr&lt;/code&gt; is no longer "much faster than my implementation" — the original gap was almost entirely the wrapper bugs, not anything inherent to libstdc++.&lt;/li&gt;
&lt;li&gt;The gap between the first implementation and &lt;code&gt;std::shared_ptr&lt;/code&gt; was caused by mistakes in the wrapper (not in &lt;code&gt;decref&lt;/code&gt; itself):

&lt;ul&gt;
&lt;li&gt;The move constructor / move assignment did an &lt;code&gt;acquire()&lt;/code&gt; followed by &lt;code&gt;reset()&lt;/code&gt; on the source — a pair of atomic RMWs that cancel each other out. A move should just steal the pointer.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;refcount++&lt;/code&gt; defaults to &lt;code&gt;std::memory_order_seq_cst&lt;/code&gt;, while &lt;code&gt;fetch_add(1, std::memory_order_relaxed)&lt;/code&gt; is sufficient for the acquire path of a refcount.&lt;/li&gt;
&lt;li&gt;Decrement was &lt;code&gt;relaxed&lt;/code&gt; followed by a &lt;code&gt;seq_cst&lt;/code&gt; fence before &lt;code&gt;delete&lt;/code&gt;. The fence cannot synchronize with &lt;code&gt;relaxed&lt;/code&gt; RMWs on other threads — it has nothing to pair with. The correct pattern is &lt;code&gt;release&lt;/code&gt; on every decrement and an &lt;code&gt;acquire&lt;/code&gt; fence on the thread that observes the final one. This is both a correctness fix and a performance fix (an &lt;code&gt;acquire&lt;/code&gt; fence is far cheaper than &lt;code&gt;seq_cst&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;The cmpxchg-based &lt;code&gt;decref&lt;/code&gt; did an explicit &lt;code&gt;load()&lt;/code&gt; at the top of every retry iteration, which is redundant — a failing &lt;code&gt;compare_exchange&lt;/code&gt; already writes the current value into the &lt;code&gt;expected&lt;/code&gt; argument.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;The same &lt;code&gt;seq_cst&lt;/code&gt;-by-default trap hits the spinlock: every &lt;code&gt;unlock&lt;/code&gt; was emitting a full memory barrier and every spin iteration was doing a &lt;code&gt;seq_cst&lt;/code&gt; load. Switching to &lt;code&gt;acquire&lt;/code&gt; on the lock-taking RMW, &lt;code&gt;release&lt;/code&gt; on &lt;code&gt;unlock&lt;/code&gt;, and &lt;code&gt;relaxed&lt;/code&gt; on the test-load inside the spin loop is what's actually needed.&lt;/li&gt;

&lt;li&gt;A &lt;code&gt;fetch_add&lt;/code&gt;-based spinlock is unusable: per-iteration CPU climbs roughly two orders of magnitude with even a handful of contending threads, because every failing lock attempt does a destructive increment + rollback that ping-pongs the cache line. The benchmark caps it at 7 threads to keep total runtime bounded.&lt;/li&gt;

&lt;li&gt;A &lt;code&gt;compare_exchange&lt;/code&gt;-based spinlock (test-and-test-and-set) tracks at or slightly below &lt;code&gt;std::mutex&lt;/code&gt; across the whole thread range. This is the right default when you actually need a spinlock.&lt;/li&gt;

&lt;li&gt;A &lt;code&gt;fetch_or&lt;/code&gt;-based spinlock looks attractive in isolation — &lt;code&gt;OR&lt;/code&gt;-ing &lt;code&gt;1&lt;/code&gt; into a locked value is idempotent, so failed attempts cost a single RMW with no rollback — and it wins at low thread counts. But under high contention it scales much worse than &lt;code&gt;compare_exchange&lt;/code&gt;: roughly 3–4× slower in CPU at 10–16 threads, and 5–7× slower at 24–32 threads. The reason is that &lt;code&gt;fetch_or&lt;/code&gt; writes the cache line on every attempt (even when the stored value doesn't change), whereas the cmpxchg test-and-test-and-set only writes when the relaxed test-load suggests the lock is free. For a contended lock, prefer &lt;code&gt;compare_exchange&lt;/code&gt;; reserve &lt;code&gt;fetch_or&lt;/code&gt; for low-contention or short-critical-section scenarios.&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary table
&lt;/h3&gt;

&lt;p&gt;Numbers are from this benchmark on the i9-14900HX (24 cores / 32 threads). "Per-iteration CPU" means total CPU time across all threads divided by total iterations performed — a measure of how much compute the operation actually burns under load.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;fetch_add&lt;/code&gt; / &lt;code&gt;fetch_sub&lt;/code&gt; (FAA)&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;compare_exchange&lt;/code&gt; (CAS)&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;fetch_or&lt;/code&gt; (FOR)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unconditional RMW. Always succeeds, always writes memory.&lt;/td&gt;
&lt;td&gt;Conditional swap. Succeeds only if &lt;code&gt;*target == *expected&lt;/code&gt;; failure writes the observed value into &lt;code&gt;*expected&lt;/code&gt; and the caller normally retries in a loop.&lt;/td&gt;
&lt;td&gt;Unconditional RMW. &lt;em&gt;Idempotent&lt;/em&gt; when the OR-ed bits are already set: re-OR-ing leaves the value unchanged but the cache line is still written.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Single-thread per-iter cost (this CPU)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;spinlock: ~8 ns/iter (2 RMWs per critical section); refcount: ~9 ns/iter&lt;/td&gt;
&lt;td&gt;spinlock: ~5.8 ns/iter; refcount: ~9.8 ns/iter (no contention → CAS doesn't retry)&lt;/td&gt;
&lt;td&gt;spinlock: ~4.7 ns/iter — the fastest uncontended option (single RMW + plain store)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Refcount @ 32 threads (CPU)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~890 ns/iter — within ~5% of &lt;code&gt;std::shared_ptr&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;~1400–1500 ns/iter — 1.5–1.7× slower than FAA (CAS retries cost real work under contention)&lt;/td&gt;
&lt;td&gt;not applicable (a refcount needs &lt;code&gt;+N&lt;/code&gt;, not bitwise OR)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spinlock @ 32 threads (CPU)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;catastrophic — limited to 7 threads in the benchmark; at 7 threads ~18 µs/iter, roughly 30–45× &lt;code&gt;std::mutex&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;~2.4–2.5 µs/iter — at or slightly below &lt;code&gt;std::mutex&lt;/code&gt; across the entire range&lt;/td&gt;
&lt;td&gt;~17.5 µs/iter — 5–7× &lt;code&gt;std::mutex&lt;/code&gt; at 24–32 threads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Major benefit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cheapest deterministic RMW with a useful return value; one instruction, no retry, no branching.&lt;/td&gt;
&lt;td&gt;Only commits a write to the cache line when there's a real chance of acquiring the new state. Scales best under contention.&lt;/td&gt;
&lt;td&gt;Single-instruction idempotent acquire; cheapest of the three in the uncontended case.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Major issue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;If used as a guard (e.g., a lock), every failed attempt destructively modifies memory and must be undone by a second RMW. The destructive ping-pong saturates the coherence fabric.&lt;/td&gt;
&lt;td&gt;Worst-case cost is unbounded — the retry loop can spin under unlucky scheduling. Observed performance is contention-sensitive.&lt;/td&gt;
&lt;td&gt;Always writes the cache line, even when the value doesn't change. Under heavy contention this loses to a &lt;code&gt;cmpxchg&lt;/code&gt; test-and-test-and-set, which only writes when the relaxed test-load suggests success.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory order to use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Counter increment: &lt;code&gt;relaxed&lt;/code&gt;. Counter decrement that publishes data (refcount release): &lt;code&gt;release&lt;/code&gt; + &lt;code&gt;acquire&lt;/code&gt; fence on the final decrement.&lt;/td&gt;
&lt;td&gt;Lock acquire: &lt;code&gt;acquire&lt;/code&gt; on success, &lt;code&gt;relaxed&lt;/code&gt; on failure. Refcount decrement: &lt;code&gt;release&lt;/code&gt; on success, &lt;code&gt;relaxed&lt;/code&gt; on failure. Almost never &lt;code&gt;seq_cst&lt;/code&gt;.&lt;/td&gt;
&lt;td&gt;Lock acquire: &lt;code&gt;acquire&lt;/code&gt;. Unlock store: &lt;code&gt;release&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monotonic counters, refcounts, sequence numbers, throughput stats.&lt;/td&gt;
&lt;td&gt;Locks, lock-free queues/stacks, any algorithm whose progress is conditional on the observed state.&lt;/td&gt;
&lt;td&gt;Single-bit flags or low-contention "test-and-set"-style acquires; quick fast-paths that don't fight over the cache line.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Don't use for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Anything that must be guarded by a condition — e.g., spinlocks. Use &lt;code&gt;compare_exchange&lt;/code&gt; or &lt;code&gt;fetch_or&lt;/code&gt; instead.&lt;/td&gt;
&lt;td&gt;Pure counters (you'd be paying for retries you don't need).&lt;/td&gt;
&lt;td&gt;Heavily contended locks — &lt;code&gt;compare_exchange&lt;/code&gt; test-and-test-and-set is dramatically better.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>cpp</category>
      <category>benchmark</category>
    </item>
  </channel>
</rss>
