Harshit Singhal

Posted on Apr 14 • Originally published at harshitsinghal13.github.io

This Concurrency Bug Stayed Hidden for a Year

#csharp #multithreading #concurrency

We had a background job that processed thousands of records in parallel.
Each batch ran concurrently, and we kept track of total successful and failed records.

Everything worked perfectly.

For almost a year.

Then one day, the totals started becoming… wrong.

No exceptions.
No crashes.
Just incorrect numbers.

The Setup

Records processed in chunks
Multiple chunks running concurrently
Shared counters tracking totals
Periodic database updates with progress

All standard parallel batch processing.

And yet — totals drifted.

The Symptom

Some runs showed fewer successful records than expected
Re-running the same data produced different counts
The issue appeared only in one environment

Classic signs of a concurrency issue.

But the tricky part?

We were already using thread-safe collections.

What Was Actually Happening

Imagine two workers updating the same counter:

Initial total = 10

Worker A reads total (10)
Worker B reads total (10)

Worker A increments → 11
Worker B increments → 11  (overwrites A)

Final total = 11  ❌ (should be 12)

No exception.
No crash.
Just a lost update.

This is a race condition.

The Buggy Code

A simplified version looked like this:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        totalSuccess++; // not atomic
    }
});

++ is not atomic. It performs:

Read
Increment
Write

Multiple threads interleaving these steps leads to lost updates.

Why `volatile` Alone Doesn't Fix It

A common attempt is to use volatile:

private static volatile int totalSuccess = 0;

This ensures visibility, but not atomicity.

Two threads can still:

read the same value
increment
overwrite each other

So volatile alone does not solve the race.

Why It Took a Year to Appear

Concurrency bugs are timing dependent.

The race condition existed from the beginning, but it didn’t surface consistently.
In fact, it only appeared in one environment.

Subtle runtime differences — thread scheduling, CPU contention, and execution timing — made overlapping updates more likely there, eventually exposing the issue.

No code changes were required.
Just different timing.

The Fix: Atomic Counters

We replaced non-atomic updates with atomic operations:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        Interlocked.Increment(ref totalSuccess);
    }
});

This guarantees increments are atomic.

The Real-World Fix: Snapshot-Based Progress Reporting

We also had periodic progress updates.
Multiple workers updated counters while one periodically persisted totals.

The correct pattern was:

var finished = Interlocked.Increment(ref completedChunks);

if (finished % maxConcurrency == 0)
{
    var successSnapshot = Volatile.Read(ref totalSuccess);
    var failureSnapshot = Volatile.Read(ref totalFailed);

    job.TotalSuccessfulRecords = successSnapshot;
    job.TotalFailedRecords = failureSnapshot;

    await UpdateJobProgress(job);
}

Why This Works

Interlocked → atomic updates
Volatile.Read → latest visible value
Snapshot → consistent progress reporting
Batched DB updates → reduced contention

This eliminates inconsistent totals.

Additional Improvement: Local Aggregation

To reduce contention further:

Parallel.ForEach(chunks, chunk =>
{
    int localSuccess = 0;
    int localFailure = 0;

    foreach (var record in chunk)
    {
        if (Process(record))
            localSuccess++;
        else
            localFailure++;
    }

    Interlocked.Add(ref totalSuccess, localSuccess);
    Interlocked.Add(ref totalFailed, localFailure);
});

This minimizes shared writes.

Lessons Learned

Thread-safe collections ≠ thread-safe logic
++ is not atomic
volatile ensures visibility, not correctness
Use Interlocked for counters
Snapshot values using Volatile.Read
Reduce shared mutable state
Batch progress updates
Concurrency bugs are timing dependent

Takeaway

If you're running parallel batch jobs and tracking totals:

Use atomic counters
Take snapshot reads for reporting
Avoid frequent shared writes

Otherwise, everything may look fine…

Until it doesn't.

DEV Community

This Concurrency Bug Stayed Hidden for a Year

The Setup

The Symptom

What Was Actually Happening

The Buggy Code

Why `volatile` Alone Doesn't Fix It

Why It Took a Year to Appear

The Fix: Atomic Counters

The Real-World Fix: Snapshot-Based Progress Reporting

Why This Works

Additional Improvement: Local Aggregation

Lessons Learned

Takeaway

Top comments (0)

The Setup

The Symptom

What Was Actually Happening

The Buggy Code

Why volatile Alone Doesn't Fix It

Why It Took a Year to Appear

The Fix: Atomic Counters

The Real-World Fix: Snapshot-Based Progress Reporting

Why This Works

Additional Improvement: Local Aggregation

Lessons Learned

Takeaway

Why `volatile` Alone Doesn't Fix It