Table of Contents
- Introduction
- Why Locks Slow Us Down - and How Atomics Help
- From Locks to Atomic Operations
- Performance Improvements
- Key Takeaways
- Summary
Introduction
In modern multi-core systems, tiny synchronization decisions can make or break your performance - sometimes a single lock stands between you and real scalability.
Have you ever noticed, while optimizing a multithreaded algorithm, that the locks you put in place actually slow everything down?
This is a familiar scenario in parallel systems: locks protect shared variables and prevent synchronization issues, but under high load, they can become the main bottleneck. Threads queue up, CPUs sit idle - and performance suffers.
This is exactly what happened to me in the PS-MWM project.
We built a real-time streaming algorithm for Weighted Matching with goals to:
- Handle massive amounts of data in real time
- Maintain low memory usage
- Fully utilize multi-core CPUs
Everything worked perfectly - until we discovered that the synchronization mechanism using locks was actually creating the bottleneck:
- Threads were waiting on each other
- CPUs were underutilized
- The algorithm’s performance degraded
Why Locks Slow Us Down - and How Atomics Help
Once we identified the problem, we wanted to understand why.
After all, a lock seems simple: acquire, release, continue. Simple, right?
Well… not quite.
🔒 Lock Occupied - What Happens Now?
Imagine a thread reaching a lock that is already held.
It doesn’t just wait quietly - it is fully blocked, entering a blocked state.
The OS kicks in: context switch occurs, the current thread stops, the CPU moves to another thread - all heavy and expensive.
Think of it as stopping a car on a highway to switch drivers. How many times can this happen per second before the road jams?
🔁 The Queue Starts to Grow
But that’s not all - More threads try to acquire the same lock simultaneously. Some spin, some block, some retry.
This contention slows everything down. Every small delay accumulates, queues grow, and the bottleneck forms.
Here’s a simple illustration of contention: everyone’s waiting, but only one thread can enter the critical section at a time.
🏋️♂️ Accumulated Overhead - System Collapse
As threads wait one after another, the OS has to wake each thread, return them to a runnable state, and manage all the queues.
Under high load, all these actions add up dramatically, and performance suffers.
Example: Simple Mutex in C++
#include <iostream>
#include <mutex>
#include <thread>
std::mutex mtx; // Mutex to protect critical section
// Function executed by each thread
void doWork(int id) {
mtx.lock(); // Acquire the lock
std::cout << "Thread #" << id << " entered the critical section\n";
// Critical section work (quick, just for demonstration)
std::cout << "Thread #" << id << " leaving the critical section\n";
mtx.unlock(); // Release the lock
}
int main() {
// Launch two threads
std::thread t1(doWork, 1);
std::thread t2(doWork, 2);
// Wait for threads to finish
t1.join();
t2.join();
return 0;
}
Conclusion: Locks do protect shared variables, but under heavy load, they can become the main performance limiter.
From Locks to Atomic Operations
At this point, we did what every systems developer does when they smell a bottleneck: we opened Linux perf, set up counters, and measured.
The threads weren’t busy processing data - they were busy waiting on locks.
The solution became clear:
If the update to a shared variable is simple and doesn’t require complex read-modify-write operations, there’s no reason to pay the overhead of a full lock.
Not every operation needs a heavy lock.
This inspired us to explore an alternative - atomic operations - a solution that allows threads to progress without waiting for each other.
For more details on how atomic variables work in C++, see Understanding std::atomic.
Their secret? Small, lightweight updates happen at the hardware level, _without kernel entry, without unnecessary thread contention, and without extra context switches.
For more in-depth discussion on the trade-offs between mutexes and atomics, see CoffeeBeforeArch: Atomic vs Mutex or Stack Overflow discussion.
How Atomics Work
-
Atomic instructions like
XCHG,CMPXCHG, orLOCK ADDupdate a variable in a single, indivisible operation - Nanosecond execution: threads aren’t blocked; each operation happens almost instantly
- On modern CPUs, atomic operations are usually performed at the cache line level, ensuring that no thread or core can modify the variable mid-operation. This also makes the operation very fast, since there is no need to lock the entire bus. Full BUS locking is mostly used in older processors or special cases.
- Natural concurrency: multiple threads can perform different atomics in parallel while maintaining memory consistency
- Memory ordering can be controlled to keep threads seeing consistent information without slowing the system
Example: Atomic Variable in C++
This example demonstrates how a shared variable can be safely updated without using a mutex, thanks to
std::atomic.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<int> counter(0); // Atomic variable, no mutex needed
// Function executed by each thread
void increment(int id) {
counter++; // Atomic increment
std::cout << "Thread #" << id <<
" incremented counter to " << counter.load();
}
int main() {
// Launch two threads
std::thread t1(increment, 1);
std::thread t2(increment, 2);
t1.join();
t2.join();
std::cout << "Final counter value: " << counter.load() << "\n";
return 0;
}
Why this works:
-
std::atomicensures thatcounter++is executed atomically - no thread can interfere mid-operation. - No mutex is needed, so threads don’t block each other.
- This is perfect for simple shared variables like counters or flags, just like in our mutex example, but more efficient.
Optional: Examples of Different Atomic Operations
Short demo of atomic instructions in C++:
#include <atomic>
#include <iostream>
void atomicExamples() {
std::atomic<int> a(0);
a.fetch_add(1); // Atomic add
a.fetch_sub(1); // Atomic subtract
a.exchange(42); // Atomic swap
bool expected = true;
a.compare_exchange_strong(expected, 100);
// Compare-and-swap: set to 100 if equal to expected
}
Notes:
- Each operation is atomic - cannot be interrupted by other threads.
-
fetch_add,fetch_sub,exchange, andcompare_exchange_strongare simple read-modify-write operations. - Ideal for counters, flags, and small shared variables, allowing safe updates without locks.
For more details on atomic operations in C++, see this guide.
Performance Improvements
After moving to atomics:
- System throughput increased by ~30–40% under high contention scenarios, and as we added more threads, the improvement reached 50% or more.
- CPUs were fully utilized instead of sitting idle.
- Code became simpler - fewer lock scopes, less chance of deadlocks.
Here’s a quick comparison of system throughput under different thread counts:
- 2 threads: Mutex ~7.28, Atomics ~5.12
- 4 threads: Mutex ~6.55, Atomics ~3.20
Of course, atomics aren’t a magic bullet. For complex structures, locks or other synchronization mechanisms are still needed. But for counters, flags, and small variable states, they’re revolutionary.
For more on when to use atomics vs locks, see this discussion.
Key Takeaways
- Measure first: don’t assume locks are the bottleneck without profiling.
- Start small: identify critical variable sections before changing everything.
- Understand memory ordering: atomics are powerful, but small mistakes can cause bugs.
- Combine wisely: locks and atomics can coexist. Use each where appropriate.
- Test under load: real multithreading issues appear mainly under heavy stress.
Summary
Switching from locks to atomics transformed our streaming algorithm:
higher throughput, lower latency, and full CPU utilization.
In high-performance systems, every nanosecond matters - and atomics let you reclaim them.
If your multithreaded code still uses locks for simple updates,
try replacing them with atomics and watch your performance scale.


Top comments (3)
this was really interesting -learned a lot! Definitely trying it out. Thanks!
Wow, this is a super interesting post! I really liked how you explained the shift from locks to atomics so clearly. I learned a lot from it ,thanks for sharing!
Interesting, thank you!