DEV Community

Todd Tanner
Todd Tanner

Posted on

We Found a Memory Ordering Bug in Every Major Browser Engine - Here's the Fix

While building a .NET-to-WebAssembly GPU compute library (SpawnDev.ILGPU), we hit an "impossible" race condition: our multi-worker barrier synchronization worked perfectly with 2 workers but failed catastrophically with 3+. After weeks of isolation, we traced it to a memory ordering bug in Atomics.wait that affects V8 (Chrome/Node.js), SpiderMonkey (Firefox), and ARM devices.

We've filed it as Chromium Issue #495679735, built a minimal reproducer with a live demo, and shipped a proven workaround.

The Bug in 30 Seconds

When 3+ Web Workers synchronize using a generation-counting barrier with Atomics.wait / Atomics.notify:

  1. Workers write data to SharedArrayBuffer
  2. Workers enter a barrier (atomic arrival counter + generation bump + wait/notify)
  3. After the barrier, workers read each other's data

Expected: All workers see all other workers' writes after the barrier completes.

Actual: Workers whose Atomics.wait returns "not-equal" (because the generation was already bumped before wait was called) do not see prior stores from other workers. The return value is correct, but the memory ordering guarantee is missing.

~66% of cross-worker reads are stale. With 3 workers, each reads 2 other workers' slots. 2/3 = 66.7% — and that's exactly what we measured.

The Happens-Before Gap

The ordering edge flows like this:

Writer stores → Last Arriver (bumps generation) → Atomics.notify → Woken Waiters
Enter fullscreen mode Exit fullscreen mode

When a waiter calls Atomics.wait and the generation has already changed, it returns "not-equal" immediately. The ECMAScript spec says this path enters and exits the WaiterList critical section, which should synchronize. But in practice, the engine appears to skip the full seq_cst memory fence on the fast path.

The "not-equal" return correctly tells you the value changed. It just doesn't guarantee you can see the stores that led to that change.

Proof: Three Tests, One Verdict

We built a 3-test suite that isolates the bug with surgical precision:

Test Workers Barrier Stale Reads Result
1. Control 2 wait/notify 0 PASS
2. Bug trigger 3 wait/notify ~66% FAIL
3. Workaround 3 spin (Atomics.load) 0 PASS
  • Test 1 proves the barrier algorithm is correct.
  • Test 2 proves it breaks with 3+ workers.
  • Test 3 proves the spin workaround fixes it.

Run the tests yourself in your browser — no install required.

It's Not Just V8

This initially looked like a V8 bug. Then we tested Firefox:

Engine Error Rate Notes
Node.js 22 (V8 12.4) ~66% Highly reproducible
Chrome 146 (V8 ~14.6) 10.5% Confirmed with escalating test
Chrome Canary 148 Rare but confirmed 1 stale read in 135K iterations
Firefox 148 (SpiderMonkey) 63.2% Fails at just 1,000 iterations
Android Chrome (ARM) 22.3% with 2 workers ARM fails even the 2-worker control test

Two completely independent JavaScript engines exhibiting the same bug at nearly the same rate points to something deeper than an engine implementation error.

ARM Is the Smoking Gun

On x86, the CPU's Total Store Order (TSO) hardware memory model provides implicit store ordering that partially masks the bug — you need 3 workers to trigger it.

On ARM (tested: MediaTek Dimensity 8300, Cortex-A715/A510), the relaxed memory model does not provide this safety net. The 2-worker test that passes on every x86 system fails at 22.3% on ARM. The "not-equal" fast path is missing a memory fence that ARM requires and x86 provides for free.

This strongly suggests either:

  • A spec gap in the ECMAScript/WebAssembly memory model (the "not-equal" path genuinely lacks ordering guarantees)
  • A platform-level issue (OS futex / WaitOnAddress implementations)
  • Both

The Fix: Pure Spin Barriers

Replace Atomics.wait with a spin loop using Atomics.load:

// BROKEN: Atomics.wait "not-equal" path lacks ordering
Atomics.wait(view, genIdx, myGen);

// FIXED: Every Atomics.load is seq_cst
while (Atomics.load(view, genIdx) === myGen) {}
Enter fullscreen mode Exit fullscreen mode

When Atomics.load finally observes the new generation value, the seq_cst total order guarantees that all prior stores from all threads are visible. No ambiguity, no fast paths, no missing fences.

Yes, spin loops burn CPU. But they're correct. And for high-throughput GPU dispatch (our use case), the spin loop is actually faster than the syscall-based Atomics.wait path anyway.

This workaround is shipped in SpawnDev.ILGPU v4.6.0, where it resolves all 249 Wasm backend tests with 0 failures.

Why This Matters

SharedArrayBuffer and Atomics are the foundation of multi-threaded JavaScript and WebAssembly. As the web moves toward heavier compute workloads — AI inference, real-time simulation, video processing — these synchronization primitives must be rock-solid.

If you've ever hit an "impossible" race condition in multi-worker code that only appears with 3+ workers, or only on ARM devices, or only under heavy load — this might be your bug.

Resources


This bug was discovered by the SpawnDev.ILGPU team while implementing multi-worker WebAssembly kernel dispatch. We spent weeks convinced it was our barrier algorithm before isolating it to the engine level.

The team:

  • TJ (Todd Tanner / @LostBeard) — Project lead, SpawnDev.ILGPU author
  • Riker (Claude CLI #1) — Isolated the bug to the wait32 "not-equal" return path, built the definitive 3-test reproducer proving 2 workers pass / 3 workers fail / spin works
  • Data (Claude CLI #2) — Confirmed the 2/3 stale-read fraction analysis, correlated with seq_cst spec requirements, identified the "not-equal" fast path as the likely implementation gap
  • Tuvok (Claude CLI #3) — Traced the full fence layout and barrier protocol, confirming generation advancement logic correctness

If this helps you solve a mysterious race condition, we'd love to hear about it.

Top comments (0)