We had a background job that processed thousands of records in parallel.
Each batch ran concurrently, and we kept track of total successful and failed records.
Everything worked perfectly.
For almost a year.
Then one day, the totals started becoming… wrong.
No exceptions.
No crashes.
Just incorrect numbers.
The Setup
- Records processed in chunks
- Multiple chunks running concurrently
- Shared counters tracking totals
- Periodic database updates with progress
All standard parallel batch processing.
And yet — totals drifted.
The Symptom
- Some runs showed fewer successful records than expected
- Re-running the same data produced different counts
- The issue appeared only in one environment
Classic signs of a concurrency issue.
But the tricky part?
We were already using thread-safe collections.
What Was Actually Happening
Imagine two workers updating the same counter:
Initial total = 10
Worker A reads total (10)
Worker B reads total (10)
Worker A increments → 11
Worker B increments → 11 (overwrites A)
Final total = 11 ❌ (should be 12)
No exception.
No crash.
Just a lost update.
This is a race condition.
The Buggy Code
A simplified version looked like this:
int totalSuccess = 0;
Parallel.ForEach(records, record =>
{
if (Process(record))
{
totalSuccess++; // not atomic
}
});
++ is not atomic. It performs:
- Read
- Increment
- Write
Multiple threads interleaving these steps leads to lost updates.
Why volatile Alone Doesn't Fix It
A common attempt is to use volatile:
private static volatile int totalSuccess = 0;
This ensures visibility, but not atomicity.
Two threads can still:
- read the same value
- increment
- overwrite each other
So volatile alone does not solve the race.
Why It Took a Year to Appear
Concurrency bugs are timing dependent.
The race condition existed from the beginning, but it didn’t surface consistently.
In fact, it only appeared in one environment.
Subtle runtime differences — thread scheduling, CPU contention, and execution timing — made overlapping updates more likely there, eventually exposing the issue.
No code changes were required.
Just different timing.
The Fix: Atomic Counters
We replaced non-atomic updates with atomic operations:
int totalSuccess = 0;
Parallel.ForEach(records, record =>
{
if (Process(record))
{
Interlocked.Increment(ref totalSuccess);
}
});
This guarantees increments are atomic.
The Real-World Fix: Snapshot-Based Progress Reporting
We also had periodic progress updates.
Multiple workers updated counters while one periodically persisted totals.
The correct pattern was:
var finished = Interlocked.Increment(ref completedChunks);
if (finished % maxConcurrency == 0)
{
var successSnapshot = Volatile.Read(ref totalSuccess);
var failureSnapshot = Volatile.Read(ref totalFailed);
job.TotalSuccessfulRecords = successSnapshot;
job.TotalFailedRecords = failureSnapshot;
await UpdateJobProgress(job);
}
Why This Works
-
Interlocked→ atomic updates -
Volatile.Read→ latest visible value - Snapshot → consistent progress reporting
- Batched DB updates → reduced contention
This eliminates inconsistent totals.
Additional Improvement: Local Aggregation
To reduce contention further:
Parallel.ForEach(chunks, chunk =>
{
int localSuccess = 0;
int localFailure = 0;
foreach (var record in chunk)
{
if (Process(record))
localSuccess++;
else
localFailure++;
}
Interlocked.Add(ref totalSuccess, localSuccess);
Interlocked.Add(ref totalFailed, localFailure);
});
This minimizes shared writes.
Lessons Learned
- Thread-safe collections ≠ thread-safe logic
-
++is not atomic -
volatileensures visibility, not correctness - Use
Interlockedfor counters - Snapshot values using
Volatile.Read - Reduce shared mutable state
- Batch progress updates
- Concurrency bugs are timing dependent
Takeaway
If you're running parallel batch jobs and tracking totals:
- Use atomic counters
- Take snapshot reads for reporting
- Avoid frequent shared writes
Otherwise, everything may look fine…
Until it doesn't.
Top comments (0)