DEV Community: Harshit Singhal

This Concurrency Bug Stayed Hidden for a Year

Harshit Singhal — Tue, 14 Apr 2026 19:42:59 +0000

We had a background job that processed thousands of records in parallel.
Each batch ran concurrently, and we kept track of total successful and failed records.

Everything worked perfectly.

For almost a year.

Then one day, the totals started becoming… wrong.

No exceptions.
No crashes.
Just incorrect numbers.

The Setup

Records processed in chunks
Multiple chunks running concurrently
Shared counters tracking totals
Periodic database updates with progress

All standard parallel batch processing.

And yet — totals drifted.

The Symptom

Some runs showed fewer successful records than expected
Re-running the same data produced different counts
The issue appeared only in one environment

Classic signs of a concurrency issue.

But the tricky part?

We were already using thread-safe collections.

What Was Actually Happening

Imagine two workers updating the same counter:

Initial total = 10

Worker A reads total (10)
Worker B reads total (10)

Worker A increments → 11
Worker B increments → 11  (overwrites A)

Final total = 11  ❌ (should be 12)

No exception.
No crash.
Just a lost update.

This is a race condition.

The Buggy Code

A simplified version looked like this:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        totalSuccess++; // not atomic
    }
});

++ is not atomic. It performs:

Read
Increment
Write

Multiple threads interleaving these steps leads to lost updates.

Why `volatile` Alone Doesn't Fix It

A common attempt is to use volatile:

private static volatile int totalSuccess = 0;

This ensures visibility, but not atomicity.

Two threads can still:

read the same value
increment
overwrite each other

So volatile alone does not solve the race.

Why It Took a Year to Appear

Concurrency bugs are timing dependent.

The race condition existed from the beginning, but it didn’t surface consistently.
In fact, it only appeared in one environment.

Subtle runtime differences — thread scheduling, CPU contention, and execution timing — made overlapping updates more likely there, eventually exposing the issue.

No code changes were required.
Just different timing.

The Fix: Atomic Counters

We replaced non-atomic updates with atomic operations:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        Interlocked.Increment(ref totalSuccess);
    }
});

This guarantees increments are atomic.

The Real-World Fix: Snapshot-Based Progress Reporting

We also had periodic progress updates.
Multiple workers updated counters while one periodically persisted totals.

The correct pattern was:

var finished = Interlocked.Increment(ref completedChunks);

if (finished % maxConcurrency == 0)
{
    var successSnapshot = Volatile.Read(ref totalSuccess);
    var failureSnapshot = Volatile.Read(ref totalFailed);

    job.TotalSuccessfulRecords = successSnapshot;
    job.TotalFailedRecords = failureSnapshot;

    await UpdateJobProgress(job);
}

Why This Works

Interlocked → atomic updates
Volatile.Read → latest visible value
Snapshot → consistent progress reporting
Batched DB updates → reduced contention

This eliminates inconsistent totals.

Additional Improvement: Local Aggregation

To reduce contention further:

Parallel.ForEach(chunks, chunk =>
{
    int localSuccess = 0;
    int localFailure = 0;

    foreach (var record in chunk)
    {
        if (Process(record))
            localSuccess++;
        else
            localFailure++;
    }

    Interlocked.Add(ref totalSuccess, localSuccess);
    Interlocked.Add(ref totalFailed, localFailure);
});

This minimizes shared writes.

Lessons Learned

Thread-safe collections ≠ thread-safe logic
++ is not atomic
volatile ensures visibility, not correctness
Use Interlocked for counters
Snapshot values using Volatile.Read
Reduce shared mutable state
Batch progress updates
Concurrency bugs are timing dependent

Takeaway

If you're running parallel batch jobs and tracking totals:

Use atomic counters
Take snapshot reads for reporting
Avoid frequent shared writes

Otherwise, everything may look fine…

Until it doesn't.

SSE vs WebSocket for One-Way Push: Runtime and Operational Tradeoffs

Harshit Singhal — Thu, 05 Mar 2026 05:34:08 +0000

When scaling server-to-client push systems, the choice between Server-Sent Events (SSE) and WebSocket determines your operational burden under production load. This analysis examines runtime behavior, memory patterns, and tail latency tradeoffs for unidirectional real-time delivery.

For one-way server push, the protocol decision is mostly about failure shape, memory behavior, and latency tail control under runtime limits, not feature parity.

1. Executive Framing

Teams often pick WebSocket by default, then spend quarters containing emergent behavior at high concurrency: memory ballooning, opaque backpressure, and unstable p99 latency during noisy-neighbor periods. For one-way server-to-client streams, SSE frequently delivers a lower operational burden with more predictable saturation behavior.

Note

The question is not which protocol is more capable. The question is which runtime path remains debuggable when runtimes are CPU-throttled and connection counts are high.

2. Protocol and Runtime Architecture

In async server stacks, SSE is an HTTP response stream over a long-lived request lifecycle. That keeps load balancer behavior, middleware, and instrumentation aligned with existing HTTP control planes. WebSocket introduces an upgrade lifecycle with explicit connection state, heartbeat policy, and bidirectional framing mechanics.

SSE lifecycle: accept request, attach stream producer, flush events, close on client disconnect or server policy.
WebSocket lifecycle: HTTP upgrade, maintain protocol state machine, negotiate keepalive semantics, manage outbound and inbound frame buffers.
Event loop impact: SSE workloads are mostly write-path scheduling; WebSocket adds more state transitions per connection and tighter coupling to heartbeat cadence.

Under pressure, the simpler lifecycle tends to produce clearer failure modes. Complexity is not free when thousands of sockets share one loop.

3. Resource Cost Analysis: CPU, Memory, Connection State

CPU and memory costs diverge meaningfully once concurrency rises beyond a modest baseline.

CPU: WebSocket commonly pays additional protocol-management overhead, especially with active keepalive and bidirectional framing. SSE cost is usually concentrated in serialization and write flush cadence.
Memory: WebSocket risk is memory amplification through per-connection buffers and unsignaled backlog growth. SSE can still leak memory if queues are unbounded, but the model is easier to constrain.
State surface: More connection state means more edge cases during deploy rollouts, proxy restarts, and intermittent packet loss.

In production environments with resource isolation, these differences become visible quickly because hard memory and CPU limits turn soft inefficiencies into hard failure behavior.

4. Behavior Under Scale

Throughput can remain stable while user experience degrades. The earliest signal is usually latency tail expansion, not median latency drift.

p95 often starts rising before alerting thresholds fire on raw request volume.
p99 is where buffer growth and scheduler contention become obvious.
When CPU throttling increases, jitter in flush cadence widens, and tail latency becomes noisy.

The practical target is graceful degradation: bounded queues, deterministic shedding, and fast recovery when pressure falls.

5. Operational Risks and When Not to Use This Approach

Prefer SSE only when traffic is truly one-way. Do not force SSE into workflows that require low-latency client-to-server signaling, custom binary framing, or tightly coupled duplex control loops.

Do not use SSE for interactive bidirectional sessions where client events are first-class.
Do not rely on either protocol without explicit bounded buffering per connection.
Do not ignore intermediary timeout behavior; default idle timers can cause reconnection storms.
Do not scale solely on CPU for push workloads; memory and queue pressure must participate.

6. Decision Matrix

Criterion	SSE	WebSocket	Preferred
Server-to-client only feed	Simple lifecycle, HTTP-native ops	Works, but higher state surface	SSE
Bidirectional interactive control	Awkward and fragmented	Native duplex channel	WebSocket
High concurrency under strict memory limits	Typically easier to bound	Higher risk of buffer-driven instability	SSE
Protocol/tooling operability in HTTP infra	Strong alignment	Requires explicit tuning and observability work	SSE

7. Monitoring and SLO Implications

Instrument around saturation mechanics, not only request counters.

Latency: track p50/p95/p99 for end-to-end event delivery, not only server response times.
Connection health: active connections, reconnect rate, disconnect reasons, and reconnection storm detection.
Backpressure: per-connection queue depth, dropped-event counters, and write timeout rates.
Runtime pressure: memory usage, OOM events, CPU throttled time, and GC pause distribution.

SLO design should include tail-latency error budgets and controlled degradation policy. If p99 delivery drifts while queue depth climbs, treat it as user-visible failure even if throughput appears healthy.

Minimal Dependency-Free Pattern with Bounded Queue

The key property is bounded buffering with deterministic drop policy. This keeps memory predictable under transient spikes.

import asyncio
import json
from typing import AsyncIterator

async def event_stream(client_disconnected: asyncio.Event) -> AsyncIterator[str]:
    q: asyncio.Queue[dict] = asyncio.Queue(maxsize=256)

    async def producer() -> None:
        while True:
            event = {"type": "heartbeat"}
            if q.full():
                _ = q.get_nowait()  # Drop oldest to preserve bounded memory.
            q.put_nowait(event)
            await asyncio.sleep(1)

    task = asyncio.create_task(producer())
    try:
        while not client_disconnected.is_set():
            item = await q.get()
            # SSE frame format: one event per double newline.
            yield f"data: {json.dumps(item)}\\n\\n"
    finally:
        task.cancel()

8. Engineering Conclusion

For one-way push, SSE is often the stronger default because it reduces state complexity, constrains memory behavior more naturally, and keeps operational visibility aligned with HTTP tooling. Choose WebSocket when duplex interaction is a hard requirement, not as a reflex.

The protocol decision should be validated against tail latency, runtime pressure, and recovery behavior under bursty load. Build for graceful failure first. Peak throughput follows from disciplined runtime design.