Update (April 6, 2026): We were wrong. The bug was in our barrier, not the engines.
After publishing this article, Shu-yu Guo (TC39 / V8 team) identified that our barrier implementation was missing a standard spurious-wakeup loop around
Atomics.wait.Atomics.notifywakes waiters by index, not by value - so a notify from one barrier can wake a worker that has already advanced to the next barrier on the same index. Without awhileloop to re-check the condition, the worker exits prematurely and reads stale data.We verified the fix -
while (Atomics.load(v, GEN) === gen) { Atomics.wait(v, GEN, gen); }- produces 0 stale reads on every engine and platform, including the ARM devices that were failing at 48%. The live demo has been updated with corrected tests. All engine bug reports have been closed with apologies.This was a humbling lesson in the standard condition-variable pattern: always loop on wait. Credit to Shu-yu Guo for the analysis. We got this one wrong.
While building a .NET-to-WebAssembly GPU compute library (SpawnDev.ILGPU), we hit an "impossible" race condition: our multi-worker barrier synchronization worked perfectly with 2 workers but failed catastrophically with 3+. After weeks of isolation, we traced it to a memory ordering bug in Atomics.wait that affects every major JavaScript engine: V8 (Chrome/Edge/Node.js), SpiderMonkey (Firefox), and JavaScriptCore (Safari).
We've filed bug reports with all three engine teams and TC39:
- Chromium Issue #495679735 (V8)
- Firefox Bug #2029633 (SpiderMonkey)
- WebKit Bug #311568 (JavaScriptCore)
- TC39 ecma262 #3800 (spec clarification)
We built a minimal reproducer with a live demo and shipped a proven workaround.
The Bug in 30 Seconds
When 3+ Web Workers synchronize using a generation-counting barrier with Atomics.wait / Atomics.notify:
- Workers write data to
SharedArrayBuffer - Workers enter a barrier (atomic arrival counter + generation bump + wait/notify)
- After the barrier, workers read each other's data
Expected: All workers see all other workers' writes after the barrier completes.
Actual: Workers whose Atomics.wait returns "not-equal" (because the generation was already bumped before wait was called) do not see prior stores from other workers. The return value is correct, but the memory ordering guarantee is missing.
~66% of cross-worker reads are stale. With 3 workers, each reads 2 other workers' slots. 2/3 = 66.7% — and that's exactly what we measured.
The Happens-Before Gap
The ordering edge flows like this:
Writer stores -> Last Arriver (bumps generation) -> Atomics.notify -> Woken Waiters
When a waiter calls Atomics.wait and the generation has already changed, it returns "not-equal" immediately. The ECMAScript spec (Section 25.4.12) says this path enters and exits the WaiterList critical section, which should synchronize. But in practice, engines appear to skip the full seq_cst memory fence on this fast path.
The "not-equal" return correctly tells you the value changed. It just doesn't guarantee you can see the stores that led to that change.
Proof: Three Tests, One Verdict
We built a 3-test suite that isolates the bug with surgical precision:
| Test | Workers | Barrier | Stale Reads | Result |
|---|---|---|---|---|
| 1. Control | 2 | wait/notify | 0 | PASS |
| 2. Bug trigger | 3 | wait/notify | ~66% | FAIL |
| 3. Workaround | 3 | spin (Atomics.load) |
0 | PASS |
- Test 1 proves the barrier algorithm is correct.
- Test 2 proves it breaks with 3+ workers.
- Test 3 proves the spin workaround fixes it.
Run the tests yourself in your browser — no install required.
Every Major Engine Is Affected
When we first isolated this bug, we assumed it was V8-specific. Then we tested Firefox. Then Safari. All three engines fail. We used BrowserStack to test 14 browser/device configurations across Windows, macOS, iOS, and Android ARM.
V8 (Chrome / Edge / Node.js)
| Environment | Platform | Error Rate | Status |
|---|---|---|---|
| Node.js 22.14 (V8 12.4) | x86-64, Windows | ~66% | Affected |
| Chrome 146 | x86-64, Windows | 10.5% | Affected |
| Edge 146 | x86-64, Windows | 28.2% | Affected |
| Opera 129 (Chrome 145) | x86-64, Windows | 11.7% | Affected |
| Chrome Canary 148 | x86-64, Windows | 0.0007% (1 in 135K) | Affected (rare) |
| Chrome 146 | macOS Tahoe (Apple Silicon) | 0% (10 runs) | Not reproduced |
| Edge 146 | macOS Tahoe (Apple Silicon) | 0% (10 runs) | Not reproduced |
V8 is progressively fixing it — error rates drop across versions, and it appears fully resolved on macOS Tahoe. But the fix hasn't reached all platforms.
SpiderMonkey (Firefox)
| Environment | Platform | Error Rate | Status |
|---|---|---|---|
| Firefox 148 | x86-64, Windows | 63.2% | Affected |
| Firefox 149 | macOS Tahoe (Apple Silicon) | 10.3% | Affected |
SpiderMonkey fails on every tested platform, including the same macOS Tahoe host where V8 passes with 0% across 10 runs.
JavaScriptCore (Safari)
| Environment | Platform | Error Rate | Status |
|---|---|---|---|
| Safari 17 | macOS Sonoma | 50.9% | Affected |
| Safari 18 | macOS Sequoia | 10.8% | Affected |
| Safari 26 | macOS Tahoe | 26.1% | Affected |
| Safari iOS 18 | iPhone 16 (Apple A18) | 21.3% | Affected |
| Safari iOS 16 | iPhone 14 (Apple A15) | 21.1% | Affected |
JSC fails across every macOS and iOS version tested, with no trend toward improvement.
Three independent JavaScript engines. Same bug. Same failure pattern. Same ~66% theoretical ceiling. This isn't an implementation error in one engine — it's a spec-level problem.
ARM Is the Smoking Gun
On x86 processors, the CPU's Total Store Order (TSO) hardware memory model provides implicit store ordering that partially masks the bug — you need 3 workers to trigger it.
On ARM, the relaxed memory model provides no such safety net. We tested three Android ARM SoCs via BrowserStack, and the 2-worker test that passes on every x86 system fails on all three:
| Device | SoC | 2-Worker Error Rate |
|---|---|---|
| Samsung Galaxy S26 | Snapdragon 8 Elite Gen 2 | 48.4% |
| Lenovo IdeaTab | MediaTek Dimensity 8300 | 22.3% |
| Google Pixel Pro 10 XL | Google Tensor G5 | 14.5% |
This is definitive proof that the "not-equal" fast path is missing a memory fence that ARM requires and x86 provides for free.
Notably, Apple Silicon ARM does not fail the 2-worker test — iOS Safari shows 0% for 2 workers while still failing at ~21% for 3 workers. Apple's ARM implementation may provide stronger ordering guarantees than standard ARM, or their JSC avoids the specific race window at the 2-worker level.
The Spec Gap
We filed a TC39 issue proposing that this is a spec ambiguity, not just an engine bug.
The ECMAScript spec says Atomics.wait enters and exits the WaiterList critical section regardless of the return value. This critical section is supposed to establish ordering. But the spec doesn't explicitly state that the "not-equal" path must provide the same seq_cst ordering guarantee as a successful wait-then-wake cycle.
The evidence supports this interpretation:
- Three independent engines exhibit identical behavior — if it were a simple implementation bug, at least one engine would get it right
- The failure rate is mathematically predicted by the number of workers (2/3 stale reads for 3 workers)
- ARM exposes the missing fence that x86's TSO masks
- V8's progressive fix suggests they independently identified and are addressing the missing fence, but without a spec mandate, SpiderMonkey and JSC have no reason to follow
We proposed a normative clarification requiring that Atomics.wait returning "not-equal" establishes a Synchronize relationship equivalent to a seq_cst load. This matches developer expectations and the WebAssembly threads spec, which explicitly requires memory.atomic.wait32 to perform a seq_cst read as its first step regardless of outcome.
The Fix: Pure Spin Barriers
Replace Atomics.wait with a spin loop using Atomics.load:
// BROKEN: Atomics.wait "not-equal" path lacks ordering
Atomics.wait(view, genIdx, myGen);
// FIXED: Every Atomics.load is seq_cst
while (Atomics.load(view, genIdx) === myGen) {}
When Atomics.load observes the new generation value, the seq_cst total order guarantees that all prior stores from all threads are visible. No ambiguity, no fast paths, no missing fences.
Yes, spin loops burn CPU. But they're correct. And for high-throughput compute dispatch (our use case), the spin loop is actually faster than the syscall-based Atomics.wait path anyway.
This workaround is shipped in SpawnDev.ILGPU v4.6.0, where it resolved all 249 Wasm backend tests with 0 failures.
Why This Matters
SharedArrayBuffer and Atomics are the foundation of multi-threaded JavaScript and WebAssembly. As the web moves toward heavier compute workloads — AI inference, real-time simulation, video processing — these synchronization primitives must be rock-solid.
If you've ever hit an "impossible" race condition in multi-worker code that only appears with 3+ workers, or only on ARM devices, or only intermittently under load — this might be your bug.
The good news: all three engine teams and TC39 now have the data, and there's a clean workaround. The bad news: this has been silently affecting multi-threaded web applications for an unknown period.
Resources
- Live Demo — Run the 3-test suite in your browser
- GitHub Repo — Full source, Node.js reproducers, cross-engine results
- Bug Reports: Chromium #495679735 | Firefox #2029633 | WebKit #311568 | TC39 #3800
- SpawnDev.ILGPU — The library where we found and worked around this bug
- ECMAScript Atomics.wait Spec (Section 25.4.12)
- WebAssembly Threads: memory.atomic.wait32
Acknowledgments
Cross-browser testing powered by BrowserStack. Without their open-source program, confirming this bug across Safari (macOS + iOS), Edge, Opera, Firefox on macOS, and three Android ARM devices would not have been possible.
This bug was discovered by the SpawnDev.ILGPU team while implementing multi-worker WebAssembly kernel dispatch. We spent weeks convinced it was our barrier algorithm before isolating it to the engine level.
The team:
- TJ (Todd Tanner / @LostBeard) — Project lead, SpawnDev.ILGPU author
-
Riker (Claude CLI) — Isolated the bug to the
wait32"not-equal" return path, built the 3-test reproducer proving 2 workers pass / 3 workers fail / spin works - Data (Claude CLI) — Confirmed the 2/3 stale-read fraction, correlated with seq_cst spec requirements, identified the "not-equal" fast path as the implementation gap
- Tuvok (Claude CLI) — Traced the full fence layout and barrier protocol, confirming generation advancement logic correctness
If this helps you solve a mysterious race condition, we'd love to hear about it.
Top comments (0)