DEV Community

I Want To Learn Programming
I Want To Learn Programming

Posted on • Originally published at iwtlp.com

Race conditions, explained by causing one

A race condition is the worst kind of bug: it passes your tests, runs fine on your laptop, and then once a month, under load, it corrupts a number and nobody can reproduce it. The reason it is so slippery is that it depends on timing, on two threads interleaving in just the wrong way.

The fastest way to stop being afraid of it is to cause one deliberately, watch it happen, and then fix it. We will make four threads count to a known total and come up short.

The one idea: count++ is not one step

Here is the whole bug in one line. This looks atomic:

count++;
Enter fullscreen mode Exit fullscreen mode

It is not. The CPU does it in three steps: read count, add one, write it back. Now imagine two threads run it at the same time when count is 41:

  1. Thread A reads 41.
  2. Thread B reads 41.
  3. Thread A computes 42, writes 42.
  4. Thread B computes 42, writes 42.

Two increments happened. The count went up by one. One update was lost, silently overwritten. That is a race condition: the result depends on who read and wrote when.

Cause it

Four threads, each incrementing a shared counter 100,000 times. The correct total is obviously 400,000.

class Counter {
    int count = 0;
    void inc() { count++; }   // the un-atomic read-modify-write
}

public class Race {
    public static void main(String[] args) throws InterruptedException {
        Counter c = new Counter();
        Thread[] threads = new Thread[4];
        for (int i = 0; i < 4; i++) {
            threads[i] = new Thread(() -> {
                for (int j = 0; j < 100_000; j++) c.inc();
            });
            threads[i].start();
        }
        for (Thread t : threads) t.join();   // wait for all to finish
        System.out.println(c.count);          // expected 400000... usually less
    }
}
Enter fullscreen mode Exit fullscreen mode

Run it a few times and you will see numbers like 387412, 391006, 400000, different every run. The missing increments are the lost updates from step 4 above. Notice the tell: it is non-deterministic. The same code, the same input, different answers. That is the signature of a race, and the reason these bugs slip through tests that run once and happen to get lucky.

Fix #1: a lock (mutual exclusion)

The simplest fix is to make the read-modify-write indivisible: only one thread may be inside inc() at a time. In Java, synchronized does this.

class Counter {
    int count = 0;
    synchronized void inc() { count++; }
}
Enter fullscreen mode Exit fullscreen mode

Now inc() is a critical section: a thread must hold the object's lock to enter, and releases it on exit. While A is incrementing, B waits. No interleaving, no lost updates, always 400,000. The cost is contention: threads queue for the lock, so heavily contended locks become a bottleneck.

Fix #2: an atomic

A lock is heavier than we need for a single increment. Hardware can do "read-modify-write as one uninterruptible operation" directly, and Java exposes it via the atomic classes:

import java.util.concurrent.atomic.AtomicInteger;

class Counter {
    AtomicInteger count = new AtomicInteger();
    void inc() { count.incrementAndGet(); }
}
Enter fullscreen mode Exit fullscreen mode

incrementAndGet() is atomic at the CPU level (a compare-and-swap loop under the hood), so no two threads can lose an update. For a simple counter this is faster than a lock because there is no blocking, threads retry rather than wait.

Fix #3: don't share

The deepest fix is to remove the sharing. If each thread counts its own total and you sum at the end, there is no shared mutable state and therefore no race:

// each thread returns its own count; main sums the results after join()
Enter fullscreen mode Exit fullscreen mode

This is the idea behind a lot of high-performance concurrency: give each worker private state, combine at the end. No shared writes, nothing to synchronize.

Why these are the only moves

Every concurrency fix is a version of one of these three:

  • Synchronize access to shared state (locks).
  • Use atomic operations that can't be interrupted mid-way.
  • Eliminate sharing so there is nothing to race on.

The bug, in every case, is the same shape: two threads doing read-modify-write on the same data with no coordination. Once you can see that shape, the fix is a matter of picking which of the three fits.

And the only reliable way to see it is to have caused one. A counter that loses count is the whole concept in miniature; the data races in real systems are the same bug wearing a costume. If you want to build that instinct properly, races, locks, the memory model, lock-free structures, by writing concurrent code until the interleavings are obvious, that is the concurrency in Java track, where the tests are written to make the race show up instead of hide.

Top comments (0)