Oleg Goncharov

Posted on Nov 4 • Edited on Nov 6

Multithreading in Modern C++: Lock-Free Programming, Memory Ordering, and Atomics

#cpp #atomic #multithreading #lockfree

Multithreaded programming in C++ has traditionally been associated with mutexes, condition variables, and potential issues like deadlocks and race conditions. However, modern C++ standards (starting with C++11 and beyond) provide tools for writing high-performance multithreaded code without classical locking mechanisms. In this article, we'll explore advanced techniques: lock-free programming, atomic operations, and various memory ordering models.

Why Do We Need Lock-Free Code?

Lock-free data structures allow multiple threads to work with shared data without using mutexes. Key advantages include:

Scalability: the absence of locks means no contention for lock acquisition
No deadlocks: no locks — no mutual blocking
Guaranteed progress: at least one thread always makes forward progress

However, lock-free code is more complex to design and debug. Apply it only after profiling and identifying performance bottlenecks.

Atomic Operations and std::atomic

The foundation of lock-free programming is atomic operations. C++11 introduced std::atomic<T>, which provides:

Atomicity of read and write operations
Guarantees against instruction reordering by the compiler and processor

Basic Example with Atomic Counter

#include <atomic>
#include <thread>
#include <iostream>

std::atomic<int> counter{0};

void incrementCounter() {
    for (int i = 0; i < 100000; ++i) {
        // Atomic increment without using mutexes
        counter.fetch_add(1, std::memory_order_relaxed);
    }
}

int main() {
    std::thread t1(incrementCounter);
    std::thread t2(incrementCounter);

    t1.join();
    t2.join();

    std::cout << "Final value: " << counter.load() << std::endl;
    // Output: 200000 (guaranteed to be correct)

    return 0;
}

Why this works: the fetch_add operation executes atomically at the processor level, using special instructions (e.g., LOCK XADD on x86).

Memory Ordering: Fine-Tuning Synchronization

C++ provides six levels of memory ordering for atomic operations:

memory_order_relaxed — minimal guarantees
memory_order_consume — ordering of dependent loads
memory_order_acquire — barrier for read operations
memory_order_release — barrier for write operations
memory_order_acq_rel — combination of acquire and release
memory_order_seq_cst — sequential consistency (default)

Relaxed Ordering: Maximum Performance

#include <atomic>
#include <thread>

std::atomic<int> x{0};
std::atomic<int> y{0};

void writeValues() {
    x.store(1, std::memory_order_relaxed);
    y.store(1, std::memory_order_relaxed);
}

void readValues() {
    // ❌ ERROR: order of writes is not guaranteed!
    // A thread may see y=1 but x=0
    while (y.load(std::memory_order_relaxed) != 1) {
        // Busy wait
    }

    // NOT guaranteed that x == 1 at this point!
    int value_x = x.load(std::memory_order_relaxed);
}

When to use relaxed: only for independent operations, such as counters with no dependencies.

Release-Acquire: Synchronization Between Threads

#include <atomic>
#include <thread>
#include <cassert>

std::atomic<bool> ready{false};
int data = 0;

void producer() {
    data = 42;  // (1) Regular write
    // Release guarantees all operations before it are visible to other threads
    ready.store(true, std::memory_order_release);  // (2)
}

void consumer() {
    // Acquire guarantees all operations after release are visible
    while (!ready.load(std::memory_order_acquire)) {  // (3)
        // Busy wait
    }
    // (4) Guaranteed: data == 42
    assert(data == 42);  // Always executes correctly
}

int main() {
    std::thread t1(producer);
    std::thread t2(consumer);

    t1.join();
    t2.join();

    return 0;
}

How it works:

memory_order_release in producer creates a "barrier": all operations before store cannot be reordered after it
memory_order_acquire in consumer creates a barrier: all operations after load cannot be reordered before it
This creates a "happens-before" relationship between threads

Sequential Consistency: Strictest Guarantees

#include <atomic>
#include <thread>

std::atomic<bool> x{false};
std::atomic<bool> y{false};
std::atomic<int> z{0};

void write_x() {
    x.store(true, std::memory_order_seq_cst);
}

void write_y() {
    y.store(true, std::memory_order_seq_cst);
}

void read_xy() {
    while (!x.load(std::memory_order_seq_cst)) { }

    if (y.load(std::memory_order_seq_cst)) {
        ++z;
    }
}

void read_yx() {
    while (!y.load(std::memory_order_seq_cst)) { }

    if (x.load(std::memory_order_seq_cst)) {
        ++z;
    }
}

int main() {
    std::thread t1(write_x);
    std::thread t2(write_y);
    std::thread t3(read_xy);
    std::thread t4(read_yx);

    t1.join(); t2.join(); t3.join(); t4.join();

    // z is guaranteed to be > 0
    // With seq_cst there exists a global order of all operations

    return 0;
}

When to use seq_cst: when maximum simplicity of reasoning about code is needed, or for debugging. This is the slowest mode.

Lock-Free Stack: Practical Example

Let's consider a classic lock-free structure — a stack with push and pop operations.

#include <atomic>
#include <memory>

template<typename T>
class LockFreeStack {
private:
    struct Node {
        T data;
        Node* next;

        Node(const T& value) : data(value), next(nullptr) {}
    };

    std::atomic<Node*> head;

public:
    LockFreeStack() : head(nullptr) {}

    void push(const T& value) {
        Node* newNode = new Node(value);

        // Optimistic retry loop
        newNode->next = head.load(std::memory_order_relaxed);

        // CAS (Compare-And-Swap): atomically check and update
        while (!head.compare_exchange_weak(
            newNode->next,           // Expected value
            newNode,                 // New value
            std::memory_order_release,  // On success
            std::memory_order_relaxed   // On failure
        )) {
            // If CAS failed, newNode->next was automatically
            // updated to current head value, retry
        }
    }

    bool pop(T& result) {
        Node* oldHead = head.load(std::memory_order_relaxed);

        do {
            if (oldHead == nullptr) {
                return false;  // Stack is empty
            }

            // Attempt to atomically move head to next element
        } while (!head.compare_exchange_weak(
            oldHead,
            oldHead->next,
            std::memory_order_acquire,
            std::memory_order_relaxed
        ));

        result = oldHead->data;

        // ⚠️ PROBLEM: memory deallocation is unsafe!
        // Another thread may still be reading oldHead
        // delete oldHead;  // Potential use-after-free error

        return true;
    }

    ~LockFreeStack() {
        T dummy;
        while (pop(dummy)) { }
    }
};

Typical Errors and Best Practices

❌ Error 1: ABA Problem

// Thread 1 reads head = A
Node* oldHead = head.load();  // oldHead = A

// Thread 2: pop(A), pop(B), push(A) - now head = A again
// But this is a DIFFERENT object A!

// Thread 1: CAS succeeds, although head changed!
head.compare_exchange_weak(oldHead, oldHead->next);  // ⚠️ ABA problem

✅ Solution: use tagged pointers or hazard pointers for memory management.

❌ Error 2: Incorrect memory order

// ❌ WRONG: using relaxed for CAS without synchronization
void push_wrong(const T& value) {
    Node* newNode = new Node(value);
    newNode->next = head.load(std::memory_order_relaxed);

    while (!head.compare_exchange_weak(
        newNode->next,
        newNode,
        std::memory_order_relaxed,  // ❌ Dangerous!
        std::memory_order_relaxed
    )) { }
}

// ✅ CORRECT: release on success for synchronization
void push_correct(const T& value) {
    Node* newNode = new Node(value);
    newNode->next = head.load(std::memory_order_relaxed);

    while (!head.compare_exchange_weak(
        newNode->next,
        newNode,
        std::memory_order_release,  // ✅ Synchronizes with other threads
        std::memory_order_relaxed
    )) { }
}

❌ Error 3: Race condition during memory deallocation

bool pop_unsafe(T& result) {
    Node* oldHead = head.load();

    while (oldHead && !head.compare_exchange_weak(oldHead, oldHead->next)) { }

    if (oldHead) {
        result = oldHead->data;
        delete oldHead;  // ❌ Another thread may still be reading oldHead!
        return true;
    }
    return false;
}

✅ Solution: use safe memory management techniques:

Reference counting with std::shared_ptr
Hazard pointers
Epoch-based reclamation
Deferred deletion (RCU)

Lock-Free Queue: Producer-Consumer Pattern

SPSC (Single-Producer Single-Consumer) queue is the optimal case for lock-free algorithms.

#include <atomic>
#include <array>

template<typename T, size_t Size>
class SPSCQueue {
private:
    std::array<T, Size> buffer;
    std::atomic<size_t> writePos{0};
    std::atomic<size_t> readPos{0};

public:
    bool push(const T& value) {
        size_t currentWrite = writePos.load(std::memory_order_relaxed);
        size_t nextWrite = (currentWrite + 1) % Size;

        // Check if queue is full
        if (nextWrite == readPos.load(std::memory_order_acquire)) {
            return false;  // Queue is full
        }

        buffer[currentWrite] = value;

        // Release guarantees that write to buffer is visible to consumer
        writePos.store(nextWrite, std::memory_order_release);
        return true;
    }

    bool pop(T& result) {
        size_t currentRead = readPos.load(std::memory_order_relaxed);

        // Check if queue is empty
        if (currentRead == writePos.load(std::memory_order_acquire)) {
            return false;  // Queue is empty
        }

        result = buffer[currentRead];

        // Release synchronizes with producer
        readPos.store((currentRead + 1) % Size, std::memory_order_release);
        return true;
    }
};

Why this is efficient:

memory_order_relaxed for local operations (minimal overhead)
memory_order_acquire when reading another thread's position
memory_order_release when publishing changes

Performance: When is Lock-Free Justified?

#include <atomic>
#include <mutex>
#include <chrono>
#include <iostream>

// Version with mutex
class CounterWithMutex {
    int value = 0;
    std::mutex mtx;
public:
    void increment() {
        std::lock_guard<std::mutex> lock(mtx);
        ++value;
    }
    int get() {
        std::lock_guard<std::mutex> lock(mtx);
        return value;
    }
};

// Lock-free version
class CounterLockFree {
    std::atomic<int> value{0};
public:
    void increment() {
        value.fetch_add(1, std::memory_order_relaxed);
    }
    int get() {
        return value.load(std::memory_order_relaxed);
    }
};

// Benchmark
template<typename Counter>
void benchmark(const std::string& name) {
    Counter counter;
    auto start = std::chrono::high_resolution_clock::now();

    std::vector<std::thread> threads;
    for (int i = 0; i < 4; ++i) {
        threads.emplace_back([&counter]() {
            for (int j = 0; j < 1000000; ++j) {
                counter.increment();
            }
        });
    }

    for (auto& t : threads) {
        t.join();
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << name << ": " << duration.count() << " ms" << std::endl;
}

int main() {
    benchmark<CounterWithMutex>("Mutex");      // ~150-300 ms
    benchmark<CounterLockFree>("Lock-free");   // ~50-100 ms
    return 0;
}

Results show: lock-free code can be 2-3 times faster under high contention.

Practical Recommendations

1. Start with Profiling

// ✅ Correct approach
// 1. Implement with mutexes
// 2. Profile (perf, Valgrind, Intel VTune)
// 3. Optimize hot spots with lock-free techniques

2. Use Existing Libraries

#include <boost/lockfree/queue.hpp>

// Instead of writing your own queue
boost::lockfree::queue<int> queue(128);

// Producer
queue.push(42);

// Consumer
int value;
if (queue.pop(value)) {
    // Process value
}

3. Document Memory Ordering

// ✅ Good: explicit comments about synchronization
void publish_data() {
    compute_data();                           // (1)
    ready.store(true, std::memory_order_release);  // (2)
    // Release: guarantees visibility of (1) after acquire in reader
}

void consume_data() {
    while (!ready.load(std::memory_order_acquire)) { }  // (3)
    // Acquire: synchronizes with (2), sees results of (1)
    process_data();                          // (4)
}

4. Test with Thread Sanitizer

# Compilation with ThreadSanitizer
g++ -fsanitize=thread -g program.cpp -o program

# Run
./program

ThreadSanitizer detects:

Data races
Use of uninitialized atomic variables
Incorrect use of memory order

Conclusion

Lock-free programming in C++ is a powerful tool for creating high-performance multithreaded applications. Key takeaways:

Atomic operations (std::atomic) are the foundation of lock-free code
Memory ordering allows precise control of synchronization and performance
Compare-And-Swap (CAS) is the primary operation for lock-free data structures
Common errors: ABA problem, incorrect memory order, unsafe memory management
Best practice: profile before optimizing, use proven libraries, document carefully

Lock-free code is complex, but when applied correctly, it delivers significant performance gains in high-load systems.

Useful Resources:

"C++ Concurrency in Action" (Anthony Williams)
cppreference.com: std::atomic, std::memory_order
Herb Sutter: "atomic<> Weapons" (CppCon talks)
Boost.Lockfree documentation

DEV Community