Tamar E

Posted on Dec 9, 2025

Breaking the Lock: Boosting Multithreaded Performance with Atomics

#cpp #multithreading #performance #atomic

Introduction
Why Locks Slow Us Down - and How Atomics Help
From Locks to Atomic Operations
Performance Improvements
Key Takeaways
Summary

Introduction

In modern multi-core systems, tiny synchronization decisions can make or break your performance - sometimes a single lock stands between you and real scalability.

Have you ever noticed, while optimizing a multithreaded algorithm, that the locks you put in place actually slow everything down?

This is a familiar scenario in parallel systems: locks protect shared variables and prevent synchronization issues, but under high load, they can become the main bottleneck. Threads queue up, CPUs sit idle - and performance suffers.

This is exactly what happened to me in the PS-MWM project.

We built a real-time streaming algorithm for Weighted Matching with goals to:

Handle massive amounts of data in real time
Maintain low memory usage
Fully utilize multi-core CPUs

Everything worked perfectly - until we discovered that the synchronization mechanism using locks was actually creating the bottleneck:

Threads were waiting on each other
CPUs were underutilized
The algorithm’s performance degraded

Why Locks Slow Us Down - and How Atomics Help

Once we identified the problem, we wanted to understand why.
After all, a lock seems simple: acquire, release, continue. Simple, right?
Well… not quite.

🔒 Lock Occupied - What Happens Now?

Imagine a thread reaching a lock that is already held.

It doesn’t just wait quietly - it is fully blocked, entering a blocked state.
The OS kicks in: context switch occurs, the current thread stops, the CPU moves to another thread - all heavy and expensive.

Think of it as stopping a car on a highway to switch drivers. How many times can this happen per second before the road jams?

🔁 The Queue Starts to Grow

But that’s not all - More threads try to acquire the same lock simultaneously. Some spin, some block, some retry.

This contention slows everything down. Every small delay accumulates, queues grow, and the bottleneck forms.

Here’s a simple illustration of contention: everyone’s waiting, but only one thread can enter the critical section at a time.

🏋️‍♂️ Accumulated Overhead - System Collapse

As threads wait one after another, the OS has to wake each thread, return them to a runnable state, and manage all the queues.

Under high load, all these actions add up dramatically, and performance suffers.

Example: Simple Mutex in C++

#include <iostream>
#include <mutex>
#include <thread>

std::mutex mtx; // Mutex to protect critical section

// Function executed by each thread
void doWork(int id) {
    mtx.lock(); // Acquire the lock
    std::cout << "Thread #" << id << " entered the critical section\n";

    // Critical section work (quick, just for demonstration)

    std::cout << "Thread #" << id << " leaving the critical section\n";
    mtx.unlock(); // Release the lock
}

int main() {
    // Launch two threads
    std::thread t1(doWork, 1);
    std::thread t2(doWork, 2);

    // Wait for threads to finish
    t1.join();
    t2.join();

    return 0;
}

Conclusion: Locks do protect shared variables, but under heavy load, they can become the main performance limiter.

From Locks to Atomic Operations

At this point, we did what every systems developer does when they smell a bottleneck: we opened Linux perf, set up counters, and measured.

The threads weren’t busy processing data - they were busy waiting on locks.

The solution became clear:

If the update to a shared variable is simple and doesn’t require complex read-modify-write operations, there’s no reason to pay the overhead of a full lock.

Not every operation needs a heavy lock.

This inspired us to explore an alternative - atomic operations - a solution that allows threads to progress without waiting for each other.

For more details on how atomic variables work in C++, see Understanding std::atomic.

Their secret? Small, lightweight updates happen at the hardware level, _without kernel entry, without unnecessary thread contention, and without extra context switches.

For more in-depth discussion on the trade-offs between mutexes and atomics, see CoffeeBeforeArch: Atomic vs Mutex or Stack Overflow discussion.

How Atomics Work

Atomic instructions like XCHG, CMPXCHG, or LOCK ADD update a variable in a single, indivisible operation
Nanosecond execution: threads aren’t blocked; each operation happens almost instantly
On modern CPUs, atomic operations are usually performed at the cache line level, ensuring that no thread or core can modify the variable mid-operation. This also makes the operation very fast, since there is no need to lock the entire bus. Full BUS locking is mostly used in older processors or special cases.
Natural concurrency: multiple threads can perform different atomics in parallel while maintaining memory consistency
Memory ordering can be controlled to keep threads seeing consistent information without slowing the system

Example: Atomic Variable in C++

This example demonstrates how a shared variable can be safely updated without using a mutex, thanks to std::atomic.

#include <iostream>
#include <atomic>
#include <thread>

std::atomic<int> counter(0); // Atomic variable, no mutex needed

// Function executed by each thread
void increment(int id) {
    counter++; // Atomic increment
    std::cout << "Thread #" << id <<
    " incremented counter to " << counter.load();
}

int main() {
    // Launch two threads
    std::thread t1(increment, 1);
    std::thread t2(increment, 2);

    t1.join();
    t2.join();

    std::cout << "Final counter value: " << counter.load() << "\n";
    return 0;
}

Why this works:

std::atomic ensures that counter++ is executed atomically - no thread can interfere mid-operation.
No mutex is needed, so threads don’t block each other.
This is perfect for simple shared variables like counters or flags, just like in our mutex example, but more efficient.

Optional: Examples of Different Atomic Operations

Short demo of atomic instructions in C++:

#include <atomic>
#include <iostream>

void atomicExamples() {
    std::atomic<int> a(0);

    a.fetch_add(1);    // Atomic add
    a.fetch_sub(1);    // Atomic subtract
    a.exchange(42);    // Atomic swap
    bool expected = true;
    a.compare_exchange_strong(expected, 100); 
    // Compare-and-swap: set to 100 if equal to expected
}

Notes:

Each operation is atomic - cannot be interrupted by other threads.
fetch_add, fetch_sub, exchange, and compare_exchange_strong are simple read-modify-write operations.
Ideal for counters, flags, and small shared variables, allowing safe updates without locks.

For more details on atomic operations in C++, see this guide.

Performance Improvements

After moving to atomics:

System throughput increased by ~30–40% under high contention scenarios, and as we added more threads, the improvement reached 50% or more.
CPUs were fully utilized instead of sitting idle.
Code became simpler - fewer lock scopes, less chance of deadlocks.

Here’s a quick comparison of system throughput under different thread counts:

2 threads: Mutex ~7.28, Atomics ~5.12

4 threads: Mutex ~6.55, Atomics ~3.20

Of course, atomics aren’t a magic bullet. For complex structures, locks or other synchronization mechanisms are still needed. But for counters, flags, and small variable states, they’re revolutionary.

For more on when to use atomics vs locks, see this discussion.

Key Takeaways

Measure first: don’t assume locks are the bottleneck without profiling.
Start small: identify critical variable sections before changing everything.
Understand memory ordering: atomics are powerful, but small mistakes can cause bugs.
Combine wisely: locks and atomics can coexist. Use each where appropriate.
Test under load: real multithreading issues appear mainly under heavy stress.

Summary

Switching from locks to atomics transformed our streaming algorithm:
higher throughput, lower latency, and full CPU utilization.

In high-performance systems, every nanosecond matters - and atomics let you reclaim them.

If your multithreaded code still uses locks for simple updates,
try replacing them with atomics and watch your performance scale.

Top comments (5)

amram • Dec 9 '25 • Edited

this was really interesting -learned a lot! Definitely trying it out. Thanks!

yaeli-bloch • Dec 10 '25

Amazing content! I really appreciate how you connected the OS-level behavior, CPU operations, and performance results. It’s rare to see such a balanced mix of theory and practical experience. Thanks for sharing!

Yehudit E. • Dec 9 '25

Interesting, thank you!

zipora • Dec 9 '25 • Edited

Wow, this is a super interesting post! I really liked how you explained the shift from locks to atomics so clearly. I learned a lot from it ,thanks for sharing!