DEV Community

Harshit Singhal
Harshit Singhal

Posted on • Originally published at harshitsinghal13.github.io

This Concurrency Bug Stayed Hidden for a Year

We had a background job that processed thousands of records in parallel.
Each batch ran concurrently, and we kept track of total successful and failed records.

Everything worked perfectly.

For almost a year.

Then one day, the totals started becoming… wrong.

No exceptions.
No crashes.
Just incorrect numbers.


The Setup

  • Records processed in chunks
  • Multiple chunks running concurrently
  • Shared counters tracking totals
  • Periodic database updates with progress

All standard parallel batch processing.

And yet — totals drifted.


The Symptom

  • Some runs showed fewer successful records than expected
  • Re-running the same data produced different counts
  • The issue appeared only in one environment

Classic signs of a concurrency issue.

But the tricky part?

We were already using thread-safe collections.


What Was Actually Happening

Imagine two workers updating the same counter:

Initial total = 10

Worker A reads total (10)
Worker B reads total (10)

Worker A increments → 11
Worker B increments → 11  (overwrites A)

Final total = 11  ❌ (should be 12)
Enter fullscreen mode Exit fullscreen mode

No exception.
No crash.
Just a lost update.

This is a race condition.


The Buggy Code

A simplified version looked like this:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        totalSuccess++; // not atomic
    }
});
Enter fullscreen mode Exit fullscreen mode

++ is not atomic. It performs:

  1. Read
  2. Increment
  3. Write

Multiple threads interleaving these steps leads to lost updates.


Why volatile Alone Doesn't Fix It

A common attempt is to use volatile:

private static volatile int totalSuccess = 0;
Enter fullscreen mode Exit fullscreen mode

This ensures visibility, but not atomicity.

Two threads can still:

  • read the same value
  • increment
  • overwrite each other

So volatile alone does not solve the race.


Why It Took a Year to Appear

Concurrency bugs are timing dependent.

The race condition existed from the beginning, but it didn’t surface consistently.
In fact, it only appeared in one environment.

Subtle runtime differences — thread scheduling, CPU contention, and execution timing — made overlapping updates more likely there, eventually exposing the issue.

No code changes were required.
Just different timing.


The Fix: Atomic Counters

We replaced non-atomic updates with atomic operations:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        Interlocked.Increment(ref totalSuccess);
    }
});
Enter fullscreen mode Exit fullscreen mode

This guarantees increments are atomic.


The Real-World Fix: Snapshot-Based Progress Reporting

We also had periodic progress updates.
Multiple workers updated counters while one periodically persisted totals.

The correct pattern was:

var finished = Interlocked.Increment(ref completedChunks);

if (finished % maxConcurrency == 0)
{
    var successSnapshot = Volatile.Read(ref totalSuccess);
    var failureSnapshot = Volatile.Read(ref totalFailed);

    job.TotalSuccessfulRecords = successSnapshot;
    job.TotalFailedRecords = failureSnapshot;

    await UpdateJobProgress(job);
}
Enter fullscreen mode Exit fullscreen mode

Why This Works

  • Interlocked → atomic updates
  • Volatile.Read → latest visible value
  • Snapshot → consistent progress reporting
  • Batched DB updates → reduced contention

This eliminates inconsistent totals.


Additional Improvement: Local Aggregation

To reduce contention further:

Parallel.ForEach(chunks, chunk =>
{
    int localSuccess = 0;
    int localFailure = 0;

    foreach (var record in chunk)
    {
        if (Process(record))
            localSuccess++;
        else
            localFailure++;
    }

    Interlocked.Add(ref totalSuccess, localSuccess);
    Interlocked.Add(ref totalFailed, localFailure);
});
Enter fullscreen mode Exit fullscreen mode

This minimizes shared writes.


Lessons Learned

  • Thread-safe collections ≠ thread-safe logic
  • ++ is not atomic
  • volatile ensures visibility, not correctness
  • Use Interlocked for counters
  • Snapshot values using Volatile.Read
  • Reduce shared mutable state
  • Batch progress updates
  • Concurrency bugs are timing dependent

Takeaway

If you're running parallel batch jobs and tracking totals:

  • Use atomic counters
  • Take snapshot reads for reporting
  • Avoid frequent shared writes

Otherwise, everything may look fine…

Until it doesn't.

Top comments (0)