Multithreaded programming in C++ has traditionally been associated with mutexes, condition variables, and potential issues like deadlocks and race conditions. However, modern C++ standards (starting with C++11 and beyond) provide tools for writing high-performance multithreaded code without classical locking mechanisms. In this article, we'll explore advanced techniques: lock-free programming, atomic operations, and various memory ordering models.
Why Do We Need Lock-Free Code?
Lock-free data structures allow multiple threads to work with shared data without using mutexes. Key advantages include:
- Scalability: the absence of locks means no contention for lock acquisition
- No deadlocks: no locks — no mutual blocking
- Guaranteed progress: at least one thread always makes forward progress
However, lock-free code is more complex to design and debug. Apply it only after profiling and identifying performance bottlenecks.
Atomic Operations and std::atomic
The foundation of lock-free programming is atomic operations. C++11 introduced std::atomic<T>, which provides:
- Atomicity of read and write operations
- Guarantees against instruction reordering by the compiler and processor
Basic Example with Atomic Counter
#include <atomic>
#include <thread>
#include <iostream>
std::atomic<int> counter{0};
void incrementCounter() {
for (int i = 0; i < 100000; ++i) {
// Atomic increment without using mutexes
counter.fetch_add(1, std::memory_order_relaxed);
}
}
int main() {
std::thread t1(incrementCounter);
std::thread t2(incrementCounter);
t1.join();
t2.join();
std::cout << "Final value: " << counter.load() << std::endl;
// Output: 200000 (guaranteed to be correct)
return 0;
}
Why this works: the fetch_add operation executes atomically at the processor level, using special instructions (e.g., LOCK XADD on x86).
Memory Ordering: Fine-Tuning Synchronization
C++ provides six levels of memory ordering for atomic operations:
- memory_order_relaxed — minimal guarantees
- memory_order_consume — ordering of dependent loads
- memory_order_acquire — barrier for read operations
- memory_order_release — barrier for write operations
- memory_order_acq_rel — combination of acquire and release
- memory_order_seq_cst — sequential consistency (default)
Relaxed Ordering: Maximum Performance
#include <atomic>
#include <thread>
std::atomic<int> x{0};
std::atomic<int> y{0};
void writeValues() {
x.store(1, std::memory_order_relaxed);
y.store(1, std::memory_order_relaxed);
}
void readValues() {
// ❌ ERROR: order of writes is not guaranteed!
// A thread may see y=1 but x=0
while (y.load(std::memory_order_relaxed) != 1) {
// Busy wait
}
// NOT guaranteed that x == 1 at this point!
int value_x = x.load(std::memory_order_relaxed);
}
When to use relaxed: only for independent operations, such as counters with no dependencies.
Release-Acquire: Synchronization Between Threads
#include <atomic>
#include <thread>
#include <cassert>
std::atomic<bool> ready{false};
int data = 0;
void producer() {
data = 42; // (1) Regular write
// Release guarantees all operations before it are visible to other threads
ready.store(true, std::memory_order_release); // (2)
}
void consumer() {
// Acquire guarantees all operations after release are visible
while (!ready.load(std::memory_order_acquire)) { // (3)
// Busy wait
}
// (4) Guaranteed: data == 42
assert(data == 42); // Always executes correctly
}
int main() {
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
return 0;
}
How it works:
-
memory_order_releasein producer creates a "barrier": all operations beforestorecannot be reordered after it -
memory_order_acquirein consumer creates a barrier: all operations afterloadcannot be reordered before it - This creates a "happens-before" relationship between threads
Sequential Consistency: Strictest Guarantees
#include <atomic>
#include <thread>
std::atomic<bool> x{false};
std::atomic<bool> y{false};
std::atomic<int> z{0};
void write_x() {
x.store(true, std::memory_order_seq_cst);
}
void write_y() {
y.store(true, std::memory_order_seq_cst);
}
void read_xy() {
while (!x.load(std::memory_order_seq_cst)) { }
if (y.load(std::memory_order_seq_cst)) {
++z;
}
}
void read_yx() {
while (!y.load(std::memory_order_seq_cst)) { }
if (x.load(std::memory_order_seq_cst)) {
++z;
}
}
int main() {
std::thread t1(write_x);
std::thread t2(write_y);
std::thread t3(read_xy);
std::thread t4(read_yx);
t1.join(); t2.join(); t3.join(); t4.join();
// z is guaranteed to be > 0
// With seq_cst there exists a global order of all operations
return 0;
}
When to use seq_cst: when maximum simplicity of reasoning about code is needed, or for debugging. This is the slowest mode.
Lock-Free Stack: Practical Example
Let's consider a classic lock-free structure — a stack with push and pop operations.
#include <atomic>
#include <memory>
template<typename T>
class LockFreeStack {
private:
struct Node {
T data;
Node* next;
Node(const T& value) : data(value), next(nullptr) {}
};
std::atomic<Node*> head;
public:
LockFreeStack() : head(nullptr) {}
void push(const T& value) {
Node* newNode = new Node(value);
// Optimistic retry loop
newNode->next = head.load(std::memory_order_relaxed);
// CAS (Compare-And-Swap): atomically check and update
while (!head.compare_exchange_weak(
newNode->next, // Expected value
newNode, // New value
std::memory_order_release, // On success
std::memory_order_relaxed // On failure
)) {
// If CAS failed, newNode->next was automatically
// updated to current head value, retry
}
}
bool pop(T& result) {
Node* oldHead = head.load(std::memory_order_relaxed);
do {
if (oldHead == nullptr) {
return false; // Stack is empty
}
// Attempt to atomically move head to next element
} while (!head.compare_exchange_weak(
oldHead,
oldHead->next,
std::memory_order_acquire,
std::memory_order_relaxed
));
result = oldHead->data;
// ⚠️ PROBLEM: memory deallocation is unsafe!
// Another thread may still be reading oldHead
// delete oldHead; // Potential use-after-free error
return true;
}
~LockFreeStack() {
T dummy;
while (pop(dummy)) { }
}
};
Typical Errors and Best Practices
❌ Error 1: ABA Problem
// Thread 1 reads head = A
Node* oldHead = head.load(); // oldHead = A
// Thread 2: pop(A), pop(B), push(A) - now head = A again
// But this is a DIFFERENT object A!
// Thread 1: CAS succeeds, although head changed!
head.compare_exchange_weak(oldHead, oldHead->next); // ⚠️ ABA problem
✅ Solution: use tagged pointers or hazard pointers for memory management.
❌ Error 2: Incorrect memory order
// ❌ WRONG: using relaxed for CAS without synchronization
void push_wrong(const T& value) {
Node* newNode = new Node(value);
newNode->next = head.load(std::memory_order_relaxed);
while (!head.compare_exchange_weak(
newNode->next,
newNode,
std::memory_order_relaxed, // ❌ Dangerous!
std::memory_order_relaxed
)) { }
}
// ✅ CORRECT: release on success for synchronization
void push_correct(const T& value) {
Node* newNode = new Node(value);
newNode->next = head.load(std::memory_order_relaxed);
while (!head.compare_exchange_weak(
newNode->next,
newNode,
std::memory_order_release, // ✅ Synchronizes with other threads
std::memory_order_relaxed
)) { }
}
❌ Error 3: Race condition during memory deallocation
bool pop_unsafe(T& result) {
Node* oldHead = head.load();
while (oldHead && !head.compare_exchange_weak(oldHead, oldHead->next)) { }
if (oldHead) {
result = oldHead->data;
delete oldHead; // ❌ Another thread may still be reading oldHead!
return true;
}
return false;
}
✅ Solution: use safe memory management techniques:
- Reference counting with
std::shared_ptr - Hazard pointers
- Epoch-based reclamation
- Deferred deletion (RCU)
Lock-Free Queue: Producer-Consumer Pattern
SPSC (Single-Producer Single-Consumer) queue is the optimal case for lock-free algorithms.
#include <atomic>
#include <array>
template<typename T, size_t Size>
class SPSCQueue {
private:
std::array<T, Size> buffer;
std::atomic<size_t> writePos{0};
std::atomic<size_t> readPos{0};
public:
bool push(const T& value) {
size_t currentWrite = writePos.load(std::memory_order_relaxed);
size_t nextWrite = (currentWrite + 1) % Size;
// Check if queue is full
if (nextWrite == readPos.load(std::memory_order_acquire)) {
return false; // Queue is full
}
buffer[currentWrite] = value;
// Release guarantees that write to buffer is visible to consumer
writePos.store(nextWrite, std::memory_order_release);
return true;
}
bool pop(T& result) {
size_t currentRead = readPos.load(std::memory_order_relaxed);
// Check if queue is empty
if (currentRead == writePos.load(std::memory_order_acquire)) {
return false; // Queue is empty
}
result = buffer[currentRead];
// Release synchronizes with producer
readPos.store((currentRead + 1) % Size, std::memory_order_release);
return true;
}
};
Why this is efficient:
-
memory_order_relaxedfor local operations (minimal overhead) -
memory_order_acquirewhen reading another thread's position -
memory_order_releasewhen publishing changes
Performance: When is Lock-Free Justified?
#include <atomic>
#include <mutex>
#include <chrono>
#include <iostream>
// Version with mutex
class CounterWithMutex {
int value = 0;
std::mutex mtx;
public:
void increment() {
std::lock_guard<std::mutex> lock(mtx);
++value;
}
int get() {
std::lock_guard<std::mutex> lock(mtx);
return value;
}
};
// Lock-free version
class CounterLockFree {
std::atomic<int> value{0};
public:
void increment() {
value.fetch_add(1, std::memory_order_relaxed);
}
int get() {
return value.load(std::memory_order_relaxed);
}
};
// Benchmark
template<typename Counter>
void benchmark(const std::string& name) {
Counter counter;
auto start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads;
for (int i = 0; i < 4; ++i) {
threads.emplace_back([&counter]() {
for (int j = 0; j < 1000000; ++j) {
counter.increment();
}
});
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << name << ": " << duration.count() << " ms" << std::endl;
}
int main() {
benchmark<CounterWithMutex>("Mutex"); // ~150-300 ms
benchmark<CounterLockFree>("Lock-free"); // ~50-100 ms
return 0;
}
Results show: lock-free code can be 2-3 times faster under high contention.
Practical Recommendations
1. Start with Profiling
// ✅ Correct approach
// 1. Implement with mutexes
// 2. Profile (perf, Valgrind, Intel VTune)
// 3. Optimize hot spots with lock-free techniques
2. Use Existing Libraries
#include <boost/lockfree/queue.hpp>
// Instead of writing your own queue
boost::lockfree::queue<int> queue(128);
// Producer
queue.push(42);
// Consumer
int value;
if (queue.pop(value)) {
// Process value
}
3. Document Memory Ordering
// ✅ Good: explicit comments about synchronization
void publish_data() {
compute_data(); // (1)
ready.store(true, std::memory_order_release); // (2)
// Release: guarantees visibility of (1) after acquire in reader
}
void consume_data() {
while (!ready.load(std::memory_order_acquire)) { } // (3)
// Acquire: synchronizes with (2), sees results of (1)
process_data(); // (4)
}
4. Test with Thread Sanitizer
# Compilation with ThreadSanitizer
g++ -fsanitize=thread -g program.cpp -o program
# Run
./program
ThreadSanitizer detects:
- Data races
- Use of uninitialized atomic variables
- Incorrect use of memory order
Conclusion
Lock-free programming in C++ is a powerful tool for creating high-performance multithreaded applications. Key takeaways:
-
Atomic operations (
std::atomic) are the foundation of lock-free code - Memory ordering allows precise control of synchronization and performance
- Compare-And-Swap (CAS) is the primary operation for lock-free data structures
- Common errors: ABA problem, incorrect memory order, unsafe memory management
- Best practice: profile before optimizing, use proven libraries, document carefully
Lock-free code is complex, but when applied correctly, it delivers significant performance gains in high-load systems.
Useful Resources:
- "C++ Concurrency in Action" (Anthony Williams)
- cppreference.com: std::atomic, std::memory_order
- Herb Sutter: "atomic<> Weapons" (CppCon talks)
- Boost.Lockfree documentation
Top comments (0)