DEV Community

Aleksei Gutikov
Aleksei Gutikov

Posted on • Edited on

Who is faster: compare_exchange or fetch_add?

C++ has out of the box cross-platfrorm atomic operations since C++11.

There are several types of atomic operations.
Here I want to compare 2 types of them:

  • atomic_fetch_add, atomic_fetch_sub, etc...
  • atomic_compare_exchange

This kind of atomic ops existed in computer programming long before C++11.
For example Compare-and-swap (CAS) and Fetch-and-add (FAA) where implemented as CPU instructions in Intel 80486 - CMPXCHG and XADD.

Here I will not talk about origin of atomic operations and problem they are designed to solve - data races.

Here I want to concentrate only on next 2 points:

  1. Comparison of semantic and typical use cases of atomic CAS and FAA.
  2. Comparison of performance of CAS and FAA in typical and abnormal cases.

Differences and similarity

Most understandable description of atomic operation that I know is equivalent code:

int atomic_fetch_add(int* target, int add)
{
    int old = *target;
    *target += add;
    return old;
}

bool atomic_compare_exchange(int* target, int* expected, int desired)
{
    if (*target == *expected) {
        *target = desired;
        return true;
    }
    return false;
}
Enter fullscreen mode Exit fullscreen mode

Difference in behaviour:

compare_exchange fetch_add
can fail can't fail - always succeeds
leaves memory unchanged if fails always changes memory
succeeds only within narrow contition - equality of values pointed to by target and expected rollback of changes requires another memory write operation
implies a loop of retries no loops

Both compare_exchange and fetch_add are equivalent, in the sense that it is possible to define (implement) one through another.
But differences are huge and sometimes (you'll see) usage of inappropriate atomic operation leads to significant performance impact up to complete inoperability of program.

Basic use cases:

compare_exchange fetch_add
program thread waits for some changes made by the other thread program thread can continue progress in any case
lock, mutual exclusion (mutex) atomic counter, refcounter (shared_ptr, ...)

Benchmarks

Basically, implemented shared_ptr and spinlock each with both compare_exchange and fetch_add.
And compared their performance under aggressive concurrent read/write access.
Also compared to standard std::mutex and std::shared_ptr.
Code: https://github.com/agutikov/faa_vs_cmpxchg

shared_ptr implementaion

Instead of posting complete code of shared_ptr here, I provide just difference between two implementations.
Main part is decrement of reference counter.

Implemented with fetch_sub:

int decref(std::atomic<int>* r)
{
    return std::atomic_fetch_sub(r, 1);
}
Enter fullscreen mode Exit fullscreen mode

Implemented with compare_exchange:

int decref(std::atomic<int>* r)
{
    int v;
    do {
        v = r->load();
    } while (!std::atomic_compare_exchange_weak(r, &v, v-1));
    return v;
}
Enter fullscreen mode Exit fullscreen mode

Benchmark contains both implementations of decref: with std::atomic_compare_exchange_weak and std::atomic_compare_exchange_strong.

spinlock implementation

Main part of spinlock is .

Implemented with fetch_add:

struct spinlock
{
    std::atomic<int> locked = 0;

    void lock()
    {
        while (std::atomic_fetch_add(&locked, 1) != 0) {
            locked--;
        }
    }

    void unlock()
    {
        locked--;
    }
};
Enter fullscreen mode Exit fullscreen mode

Implemented with compare_exchange:

struct spinlock
{
    std::atomic<bool> locked = false;

    void lock()
    {
        bool v = false;
        if (!std::atomic_compare_exchange_weak(lock, &v, true)) {
            for (;;) {
                if (locked.load() == 0) {
                    if (std::atomic_compare_exchange_weak(lock, &v, true)) {
                        break;
                    }
                }
                // benchmarked both variants with "pause" and without
                __asm("pause");
            }
        }
    }

    void unlock()
    {
        locked = false;
    }
};
Enter fullscreen mode Exit fullscreen mode

Benchmark contains both implementations of spinlock: with std::atomic_compare_exchange_weak and std::atomic_compare_exchange_strong.

Benchmark for shared_ptr

  1. Creates single instance of shred_ptr<int>.
  2. Pass it into variable number of benchmark threads.
  3. Each thread do copy and move of shared_ptr N times in loop.
  4. N = Total_N / n_threads. So each run of benchmark with different number of thread do the same total number of lopp iterations.

Workload function:

template< template<typename T> typename shared_ptr_t >
void shared_ptr_benchmark(shared_ptr_t<int> p, int64_t counter)
{
    while (counter > 0) {
        shared_ptr_t<int> p1(p);
        shared_ptr_t<int> p2(std::move(p1));
        if (!p1 && p2) {
            // try to protect code from been optimized out by compiler
            counter -= *p2;
        } else {
            fprintf(stderr, "ERROR %p %p %p\n", p, p1.block, p2.block);
            std::abort();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

How to get callable for benchmark threads:

shared_ptr_t<int> p(new int(1));

auto thread_work = [p, counter] () {
        shared_ptr_benchmark<shared_ptr_t>(p, counter);
    };
}
Enter fullscreen mode Exit fullscreen mode

Benchmark for spinlock

  1. Create single instance of spinlock.
  2. Pass it into variable number of benchmark threads.
  3. Each thread do fast modifications of global variables with holding a lock, N times in loop.
  4. N = Total_N / n_threads. So each run of benchmark with different number of thread do the same total number of loop iterations.

Workload function:

struct spinlock_benchmark_data
{
    size_t value1 = 0;
    uint8_t padding[8192];
    size_t value2 = 0;
};

template<typename spinlock_t>
void spinlock_benchmark(spinlock_t* slock,
                        int64_t counter,
                        size_t id,
                        spinlock_benchmark_data* global_data)
{
    while (counter > 0) {
        slock->lock();

        size_t v1 = global_data->value1;
        global_data->value1 = id;

        size_t v2 = global_data->value2;
        global_data->value2 = id;

        slock->unlock();

        if (v1 != v2) {
            fprintf(stderr, "ERROR %lu != %lu\n", v1, v2);
            std::abort();
        }

        counter--;
    }
}
Enter fullscreen mode Exit fullscreen mode

How to get callable for benchmark threads:

spinlock_t* slock = new spinlock_t;

spinlock_benchmark_data* global_data = new spinlock_benchmark_data;

return [slock, counter, global_data] () {
    std::hash<std::thread::id> hasher;
    size_t id = hasher(std::this_thread::get_id());
    spinlock_benchmark<spinlock_t>(slock, counter, id, global_data);
};
Enter fullscreen mode Exit fullscreen mode

Measurements

Measured:

  • Complete wall-clock time of execution of all benchmark threads with std::chrono::steady_clock.
  • CPU time spent, by std::clock. Then results divided by total number of iterations performed to evaluate approximate read and CPU time spent for each iteration.

Resulting charts

Benchmarks ran on a laptop with i7-8550U.

shared_ptr: average CPU time per loop iteration, nanoseconds, less is better:
shared_ptr cpu time

shared_ptr: average CPU time per loop iteration, relative to baseline std::shared_ptr, less is better:
shared_ptr cpu time rel

shared_ptr: average wall-clock time per loop iteration, nanoseconds, less is better:
shared_ptr time

shared_ptr: average wall-clock time per loop iteration, relative to baseline std::shared_ptr, less is better:
shared_ptr time rel

spinlock: average CPU time per loop iteration, nanoseconds, less is better:
spinlock cpu time

spinlock: average CPU time per loop iteration, relative to baseline std::mutex, less is better:
spinlock cpu time rel

spinlock: average CPU time per loop iteration, without fetch_add, nanoseconds, less is better:
spinlock cpu time wo faa

spinlock: average CPU time per loop iteration, without fetch_add, relative to baseline std::mutex, less is better:
spinlock cpu time rel wo faa

spinlock: average wall-clock time per loop iteration, nanoseconds, less is better:
spinlock time

spinlock: average wall-clock time per loop iteration, relative to baseline std::mutex, less is better:
spinlock time rel

spinlock: average wall-clock time per loop iteration, without fetch_add, nanoseconds, less is better:
spinlock time wo faa

spinlock: average wall-clock time per loop iteration, without fetch_add, relative to baseline std::mutex, less is better:
spinlock time rel wo faa

Conclusion

  • fetch_sub is twice faster then compare_exchange for reference counters.
  • But standard std::shared_ptr is much faster than my implementation, but I still do not understand why.
  • Spinlock implemented with fetch_add cause so many memory modifications that it drastically (x100) increase time required for lock operation. It shows unacceptable performance while used in 4 threads and more.
  • Simple spinlock implemented with compare_exchange can be as fast as std::mutex and even little-bit faster.

Update

Right after publication I realized that atomic_fetch_or is much more suited for spinlock implementation. Here it is:

struct for_spinlock
{
    std::atomic<int> locked = 0;

    void lock()
    {
        while (std::atomic_fetch_or(&locked, 1) != 0) {
            __asm("pause");
        }
    }

    void unlock()
    {
        locked = 0;
    }
};
Enter fullscreen mode Exit fullscreen mode

And here is benchmark results:

spinlock: average CPU time per loop iteration, nanoseconds, less is better:
Alt Text

spinlock: average CPU time per loop iteration, relative to baseline std::mutex, less is better:
Alt Text

spinlock: average wall-clock time per loop iteration, nanoseconds, less is better:
Alt Text

spinlock: average wall-clock time per loop iteration, relative to baseline std::mutex, less is better:
Alt Text

So, better implementation of spinlock with atomic_fetch_or is only 2 times slower than compare_exchange one.

Top comments (2)

Collapse
 
alexis_payen_d66d3be3214d profile image
Alexis Payen

This was posted a while ago, so I don’t know if you figured it out by now, but the primary reason std::shared_ptr is faster than your implementation lies in the move constructor and move assignment.

github.com/agutikov/faa_vs_cmpxchg...

Your code increments the refcount in Acquire, then decrements it in Reset of the moved shared pointer. So basically you did a temporary increase of the refcount that was immediately reverted, wasting two slow atomic operations.
You only need to move the pointer, the refcount do not change.
Also, you’re writing refcount++ which causes a strong ordering default barrier. Fetch_add(1, relaxed) is more appropriate.

Collapse
 
ysl1311 profile image
ysl1311

Very well written, thanks for sharing