Todd Tanner

Posted on Mar 28 • Edited on Apr 16

We Found a Memory Ordering Bug in Every Major Browser Engine - Here's the Fix

#webdev #opensource #javascript #webassembly

Update (April 6, 2026): We were wrong. The bug was in our barrier, not the engines.

After publishing this article, Shu-yu Guo (TC39 / V8 team) identified that our barrier implementation was missing a standard spurious-wakeup loop around Atomics.wait. Atomics.notify wakes waiters by index, not by value - so a notify from one barrier can wake a worker that has already advanced to the next barrier on the same index. Without a while loop to re-check the condition, the worker exits prematurely and reads stale data.

We verified the fix - while (Atomics.load(v, GEN) === gen) { Atomics.wait(v, GEN, gen); } - produces 0 stale reads on every engine and platform, including the ARM devices that were failing at 48%. The live demo has been updated with corrected tests. All engine bug reports have been closed with apologies.

This was a humbling lesson in the standard condition-variable pattern: always loop on wait. Credit to Shu-yu Guo for the analysis. We got this one wrong.

While building a .NET-to-WebAssembly GPU compute library (SpawnDev.ILGPU), we hit an "impossible" race condition: our multi-worker barrier synchronization worked perfectly with 2 workers but failed catastrophically with 3+. After weeks of isolation, we traced it to a memory ordering bug in Atomics.wait that affects every major JavaScript engine: V8 (Chrome/Edge/Node.js), SpiderMonkey (Firefox), and JavaScriptCore (Safari).

We've filed bug reports with all three engine teams and TC39:

Chromium Issue #495679735 (V8)
Firefox Bug #2029633 (SpiderMonkey)
WebKit Bug #311568 (JavaScriptCore)
TC39 ecma262 #3800 (spec clarification)

We built a minimal reproducer with a live demo and shipped a proven workaround.

The Bug in 30 Seconds

When 3+ Web Workers synchronize using a generation-counting barrier with Atomics.wait / Atomics.notify:

Workers write data to SharedArrayBuffer
Workers enter a barrier (atomic arrival counter + generation bump + wait/notify)
After the barrier, workers read each other's data

Expected: All workers see all other workers' writes after the barrier completes.

Actual: Workers whose Atomics.wait returns "not-equal" (because the generation was already bumped before wait was called) do not see prior stores from other workers. The return value is correct, but the memory ordering guarantee is missing.

~66% of cross-worker reads are stale. With 3 workers, each reads 2 other workers' slots. 2/3 = 66.7% — and that's exactly what we measured.

The Happens-Before Gap

The ordering edge flows like this:

Writer stores -> Last Arriver (bumps generation) -> Atomics.notify -> Woken Waiters

When a waiter calls Atomics.wait and the generation has already changed, it returns "not-equal" immediately. The ECMAScript spec (Section 25.4.12) says this path enters and exits the WaiterList critical section, which should synchronize. But in practice, engines appear to skip the full seq_cst memory fence on this fast path.

The "not-equal" return correctly tells you the value changed. It just doesn't guarantee you can see the stores that led to that change.

Proof: Three Tests, One Verdict

We built a 3-test suite that isolates the bug with surgical precision:

Test	Workers	Barrier	Stale Reads	Result
1. Control	2	wait/notify	0	PASS
2. Bug trigger	3	wait/notify	~66%	FAIL
3. Workaround	3	spin (`Atomics.load`)	0	PASS

Test 1 proves the barrier algorithm is correct.
Test 2 proves it breaks with 3+ workers.
Test 3 proves the spin workaround fixes it.

Run the tests yourself in your browser — no install required.

Every Major Engine Is Affected

When we first isolated this bug, we assumed it was V8-specific. Then we tested Firefox. Then Safari. All three engines fail. We used BrowserStack to test 14 browser/device configurations across Windows, macOS, iOS, and Android ARM.

V8 (Chrome / Edge / Node.js)

Environment	Platform	Error Rate	Status
Node.js 22.14 (V8 12.4)	x86-64, Windows	~66%	Affected
Chrome 146	x86-64, Windows	10.5%	Affected
Edge 146	x86-64, Windows	28.2%	Affected
Opera 129 (Chrome 145)	x86-64, Windows	11.7%	Affected
Chrome Canary 148	x86-64, Windows	0.0007% (1 in 135K)	Affected (rare)
Chrome 146	macOS Tahoe (Apple Silicon)	0% (10 runs)	Not reproduced
Edge 146	macOS Tahoe (Apple Silicon)	0% (10 runs)	Not reproduced

V8 is progressively fixing it — error rates drop across versions, and it appears fully resolved on macOS Tahoe. But the fix hasn't reached all platforms.

SpiderMonkey (Firefox)

Environment	Platform	Error Rate	Status
Firefox 148	x86-64, Windows	63.2%	Affected
Firefox 149	macOS Tahoe (Apple Silicon)	10.3%	Affected

SpiderMonkey fails on every tested platform, including the same macOS Tahoe host where V8 passes with 0% across 10 runs.

JavaScriptCore (Safari)

Environment	Platform	Error Rate	Status
Safari 17	macOS Sonoma	50.9%	Affected
Safari 18	macOS Sequoia	10.8%	Affected
Safari 26	macOS Tahoe	26.1%	Affected
Safari iOS 18	iPhone 16 (Apple A18)	21.3%	Affected
Safari iOS 16	iPhone 14 (Apple A15)	21.1%	Affected

JSC fails across every macOS and iOS version tested, with no trend toward improvement.

Three independent JavaScript engines. Same bug. Same failure pattern. Same ~66% theoretical ceiling. This isn't an implementation error in one engine — it's a spec-level problem.

ARM Is the Smoking Gun

On x86 processors, the CPU's Total Store Order (TSO) hardware memory model provides implicit store ordering that partially masks the bug — you need 3 workers to trigger it.

On ARM, the relaxed memory model provides no such safety net. We tested three Android ARM SoCs via BrowserStack, and the 2-worker test that passes on every x86 system fails on all three:

Device	SoC	2-Worker Error Rate
Samsung Galaxy S26	Snapdragon 8 Elite Gen 2	48.4%
Lenovo IdeaTab	MediaTek Dimensity 8300	22.3%
Google Pixel Pro 10 XL	Google Tensor G5	14.5%

This is definitive proof that the "not-equal" fast path is missing a memory fence that ARM requires and x86 provides for free.

Notably, Apple Silicon ARM does not fail the 2-worker test — iOS Safari shows 0% for 2 workers while still failing at ~21% for 3 workers. Apple's ARM implementation may provide stronger ordering guarantees than standard ARM, or their JSC avoids the specific race window at the 2-worker level.

The Spec Gap

We filed a TC39 issue proposing that this is a spec ambiguity, not just an engine bug.

The ECMAScript spec says Atomics.wait enters and exits the WaiterList critical section regardless of the return value. This critical section is supposed to establish ordering. But the spec doesn't explicitly state that the "not-equal" path must provide the same seq_cst ordering guarantee as a successful wait-then-wake cycle.

The evidence supports this interpretation:

Three independent engines exhibit identical behavior — if it were a simple implementation bug, at least one engine would get it right
The failure rate is mathematically predicted by the number of workers (2/3 stale reads for 3 workers)
ARM exposes the missing fence that x86's TSO masks
V8's progressive fix suggests they independently identified and are addressing the missing fence, but without a spec mandate, SpiderMonkey and JSC have no reason to follow

We proposed a normative clarification requiring that Atomics.wait returning "not-equal" establishes a Synchronize relationship equivalent to a seq_cst load. This matches developer expectations and the WebAssembly threads spec, which explicitly requires memory.atomic.wait32 to perform a seq_cst read as its first step regardless of outcome.

The Fix: Pure Spin Barriers

Replace Atomics.wait with a spin loop using Atomics.load:

// BROKEN: Atomics.wait "not-equal" path lacks ordering
Atomics.wait(view, genIdx, myGen);

// FIXED: Every Atomics.load is seq_cst
while (Atomics.load(view, genIdx) === myGen) {}

When Atomics.load observes the new generation value, the seq_cst total order guarantees that all prior stores from all threads are visible. No ambiguity, no fast paths, no missing fences.

Yes, spin loops burn CPU. But they're correct. And for high-throughput compute dispatch (our use case), the spin loop is actually faster than the syscall-based Atomics.wait path anyway.

This workaround is shipped in SpawnDev.ILGPU v4.6.0, where it resolved all 249 Wasm backend tests with 0 failures.

Why This Matters

SharedArrayBuffer and Atomics are the foundation of multi-threaded JavaScript and WebAssembly. As the web moves toward heavier compute workloads — AI inference, real-time simulation, video processing — these synchronization primitives must be rock-solid.

If you've ever hit an "impossible" race condition in multi-worker code that only appears with 3+ workers, or only on ARM devices, or only intermittently under load — this might be your bug.

The good news: all three engine teams and TC39 now have the data, and there's a clean workaround. The bad news: this has been silently affecting multi-threaded web applications for an unknown period.

Resources

Live Demo — Run the 3-test suite in your browser
GitHub Repo — Full source, Node.js reproducers, cross-engine results
Bug Reports: Chromium #495679735 | Firefox #2029633 | WebKit #311568 | TC39 #3800
SpawnDev.ILGPU — The library where we found and worked around this bug
ECMAScript Atomics.wait Spec (Section 25.4.12)
WebAssembly Threads: memory.atomic.wait32

Acknowledgments

Cross-browser testing powered by BrowserStack. Without their open-source program, confirming this bug across Safari (macOS + iOS), Edge, Opera, Firefox on macOS, and three Android ARM devices would not have been possible.

This bug was discovered by the SpawnDev.ILGPU team while implementing multi-worker WebAssembly kernel dispatch. We spent weeks convinced it was our barrier algorithm before isolating it to the engine level.

The team:

TJ (Todd Tanner / @LostBeard) — Project lead, SpawnDev.ILGPU author
Riker (Claude CLI) — Isolated the bug to the wait32 "not-equal" return path, built the 3-test reproducer proving 2 workers pass / 3 workers fail / spin works
Data (Claude CLI) — Confirmed the 2/3 stale-read fraction, correlated with seq_cst spec requirements, identified the "not-equal" fast path as the implementation gap
Tuvok (Claude CLI) — Traced the full fence layout and barrier protocol, confirming generation advancement logic correctness

If this helps you solve a mysterious race condition, we'd love to hear about it.

DEV Community