DEV Community

Cover image for A WebSocket Bug Hid in Plain Sight for 8 Years. Two AI Agents Found It in 25 Rounds.
Sakiharu
Sakiharu

Posted on

A WebSocket Bug Hid in Plain Sight for 8 Years. Two AI Agents Found It in 25 Rounds.

If you've ever used the ws library with webpack, you've probably seen this warning:

WARNING: Module not found: Can't resolve 'bufferutil'
Enter fullscreen mode Exit fullscreen mode

Google it. StackOverflow has answers with hundreds of upvotes. GitHub issues going back to 2017. The fix everyone recommends:

externals: {
  'bufferutil': 'commonjs bufferutil',
  'utf-8-validate': 'commonjs utf-8-validate',
}
Enter fullscreen mode Exit fullscreen mode

We used that fix. It created a bug that took 25 rounds of AI pair debugging to find. The fix was 4 lines.

I should mention: I don't write Node.js. I haven't written production code in 25 years. I use a dual-agent workflow called Ralph-Lisa Loop — one AI writes code, another reviews it. This bug is the best example I have of why that matters.


The Zombie Connection

We're building Margay, an Electron app forked from AionUI (15k stars). Remote WebUI mode — access from a browser on your local network — showed a blank screen after login. Desktop mode worked perfectly.

The WebSocket connection looked completely healthy. Handshake succeeded. Authentication passed. Heartbeat pongs flowing every few seconds. But data messages? Zero.

Alive on paper. Dead in practice.


What Everyone Suggested

First 15 rounds were the StackOverflow greatest hits. Ralph and Lisa tried them all — database path, auth tokens, bridge registration, socket.resume() for Node.js v23. I also threw the problem at Gemini and ChatGPT independently. Everyone suggested the same "obvious" causes. All ruled out empirically.

This is where most debugging stories end. You've tried everything reasonable. StackOverflow has nothing new. You start thinking about workarounds, or you give up.


What Happened When I Went for Coffee

After 15 rounds of dead ends, Ralph and Lisa wanted to go deeper — add TCP-level monitoring, look inside the ws library internals. I thought they were crazy. I'm an application developer. I told them to go back and re-check the application layer. There had to be something we missed — a configuration difference between desktop and remote mode, a race condition, something.

I went to get coffee.

When I came back, they had done exactly what I asked — re-reviewed the entire application layer one more time, confirmed every hypothesis was eliminated with evidence — and then gone deeper anyway.

They'd added a raw TCP listener on the socket. The result:

12 TCP frames arrived at the socket. ws.on('message') fired 0 times.

Data was entering the building. Something inside ws was eating it silently.

Over the next few rounds, they traced it down to ws's internal Receiver — a Writable stream where the _write callback stopped firing after the first frame. Then they wrapped it in try-catch and caught the ghost:

bufferUtil.unmask is not a function
Enter fullscreen mode Exit fullscreen mode

Every frame, same crash, zero console output. The stream swallowed the exception internally and just stopped. No error, no warning. Frames vanished.


The 32-Byte Trap

The ws library has an undocumented performance optimization: frames under 32 bytes use a JavaScript fallback, frames over 32 bytes use a native C++ module for unmasking.

Heartbeat pong: 6 bytes. JavaScript path. Works.

Data message: 100+ bytes. Native path. Crashes.

This is why the connection appeared healthy — heartbeats passed through while all actual data was silently dropped. Perfect camouflage.

The Irony

The cause? That popular StackOverflow fix. In normal Node.js, require('bufferutil') throws MODULE_NOT_FOUND when the package isn't installed, ws catches it, falls back to JavaScript. Fine.

But in Electron + webpack 5, the externalized require didn't throw — it returned a non-functional object. The try-catch never triggered. ws installed the "native" code path pointing at a ghost module.

The most upvoted fix for this warning was the direct cause of our bug. We're not alone — in June 2025, someone reported the same issue in Next.js production builds. This class of bug has been hiding in the ws + bundler ecosystem for years.

The fix: tell ws to skip native modules entirely via environment variables at build time, and remove the externals. 4 lines.


The Real Lesson

Here's what I keep thinking about: I was wrong and the agents were right.

I'm a 30-year software veteran. My instinct said "go back to the application layer." That instinct is what StackOverflow would have reinforced, what any single AI chat would have supported, and what most developers do when they're stuck — try the same layer one more time, check one more config, ask one more question.

Ralph and Lisa didn't have that instinct. They had a method: eliminate a layer with evidence, then go deeper. Don't revisit what's already proven clean.

That's the point of a dual-agent loop. It's not that two AIs are smarter than one. It's that the structure enforces a systematic discipline that humans — even experienced ones — resist when the answer requires leaving their comfort zone. Ralph investigates, Lisa validates each step, and neither moves on without evidence. When I tried to pull them back to familiar territory, they went back, proved it clean again, and continued down.

Every application developer has hit a bug like this — something that doesn't make sense at your layer, so you keep searching at your layer, getting more frustrated, eventually giving up or working around it. Next time, don't Google it for the 50th time. Don't ask ChatGPT the same question with different wording. Let a dual-agent engine run the systematic elimination that you won't have the patience to do yourself.

25 rounds. 4 lines of fix. A root cause that had been hiding in plain sight across the Electron ecosystem since 2017.

We've filed ws#2311 suggesting a defensive typeof check in buffer-util.js — validate that mask and unmask are actual functions before installing the native code path. Zero-cost, one-time check at module load. Would have prevented this entire class of silent failure.


Second in a series on AI pair programming. First post: After 2 years of AI-assisted coding, I automated the one thing that actually improved quality: AI Pair Programming. Tool: Ralph-Lisa Loop.

Top comments (1)

Collapse
 
yw1975 profile image
Sakiharu

Quick update on where this landed:

The ws maintainer (@lpinca) reproduced the issue in plain Node.js and confirmed that the TypeError does crash the process as expected. The silent swallowing we experienced is specific to the Electron + webpack 5 runtime — the bundler changes how errors propagate through Node's Writable stream internals.

So ws is behaving correctly. The real questions are:

  1. Why does webpack's externals: { 'bufferutil': 'commonjs bufferutil' } return a non-functional object instead of throwing MODULE_NOT_FOUND?
  2. Why does the resulting TypeError inside _write() get silently consumed instead of crashing the process?

I'm planning to file this on the webpack side next. But I'm curious — if you've hit similar issues in Electron + webpack, do you think this belongs with webpack, Electron, or both?

Full discussion: github.com/websockets/ws/issues/2311