DEV Community: Sakiharu

A WebSocket Bug Hid in Plain Sight for 8 Years. Two AI Agents Found It in 25 Rounds.

Sakiharu — Fri, 20 Feb 2026 13:09:36 +0000

If you've ever used the ws library with webpack, you've probably seen this warning:

WARNING: Module not found: Can't resolve 'bufferutil'

Google it. StackOverflow has answers with hundreds of upvotes. GitHub issues going back to 2017. The fix everyone recommends:

externals: {
  'bufferutil': 'commonjs bufferutil',
  'utf-8-validate': 'commonjs utf-8-validate',
}

We used that fix. It created a bug that took 25 rounds of AI pair debugging to find. The fix was 4 lines.

I should mention: I don't write Node.js. I haven't written production code in 25 years. I use a dual-agent workflow called Ralph-Lisa Loop — one AI writes code, another reviews it. This bug is the best example I have of why that matters.

The Zombie Connection

We're building Margay, an Electron app forked from AionUI (15k stars). Remote WebUI mode — access from a browser on your local network — showed a blank screen after login. Desktop mode worked perfectly.

The WebSocket connection looked completely healthy. Handshake succeeded. Authentication passed. Heartbeat pongs flowing every few seconds. But data messages? Zero.

Alive on paper. Dead in practice.

What Everyone Suggested

First 15 rounds were the StackOverflow greatest hits. Ralph and Lisa tried them all — database path, auth tokens, bridge registration, socket.resume() for Node.js v23. I also threw the problem at Gemini and ChatGPT independently. Everyone suggested the same "obvious" causes. All ruled out empirically.

This is where most debugging stories end. You've tried everything reasonable. StackOverflow has nothing new. You start thinking about workarounds, or you give up.

What Happened When I Went for Coffee

After 15 rounds of dead ends, Ralph and Lisa wanted to go deeper — add TCP-level monitoring, look inside the ws library internals. I thought they were crazy. I'm an application developer. I told them to go back and re-check the application layer. There had to be something we missed — a configuration difference between desktop and remote mode, a race condition, something.

I went to get coffee.

When I came back, they had done exactly what I asked — re-reviewed the entire application layer one more time, confirmed every hypothesis was eliminated with evidence — and then gone deeper anyway.

They'd added a raw TCP listener on the socket. The result:

12 TCP frames arrived at the socket. ws.on('message') fired 0 times.

Data was entering the building. Something inside ws was eating it silently.

Over the next few rounds, they traced it down to ws's internal Receiver — a Writable stream where the _write callback stopped firing after the first frame. Then they wrapped it in try-catch and caught the ghost:

bufferUtil.unmask is not a function

Every frame, same crash, zero console output. The stream swallowed the exception internally and just stopped. No error, no warning. Frames vanished.

The 32-Byte Trap

The ws library has an undocumented performance optimization: frames under 32 bytes use a JavaScript fallback, frames over 32 bytes use a native C++ module for unmasking.

Heartbeat pong: 6 bytes. JavaScript path. Works.

Data message: 100+ bytes. Native path. Crashes.

This is why the connection appeared healthy — heartbeats passed through while all actual data was silently dropped. Perfect camouflage.

The Irony

The cause? That popular StackOverflow fix. In normal Node.js, require('bufferutil') throws MODULE_NOT_FOUND when the package isn't installed, ws catches it, falls back to JavaScript. Fine.

But in Electron + webpack 5, the externalized require didn't throw — it returned a non-functional object. The try-catch never triggered. ws installed the "native" code path pointing at a ghost module.

The most upvoted fix for this warning was the direct cause of our bug. We're not alone — in June 2025, someone reported the same issue in Next.js production builds. This class of bug has been hiding in the ws + bundler ecosystem for years.

The fix: tell ws to skip native modules entirely via environment variables at build time, and remove the externals. 4 lines.

The Real Lesson

Here's what I keep thinking about: I was wrong and the agents were right.

I'm a 30-year software veteran. My instinct said "go back to the application layer." That instinct is what StackOverflow would have reinforced, what any single AI chat would have supported, and what most developers do when they're stuck — try the same layer one more time, check one more config, ask one more question.

Ralph and Lisa didn't have that instinct. They had a method: eliminate a layer with evidence, then go deeper. Don't revisit what's already proven clean.

That's the point of a dual-agent loop. It's not that two AIs are smarter than one. It's that the structure enforces a systematic discipline that humans — even experienced ones — resist when the answer requires leaving their comfort zone. Ralph investigates, Lisa validates each step, and neither moves on without evidence. When I tried to pull them back to familiar territory, they went back, proved it clean again, and continued down.

Every application developer has hit a bug like this — something that doesn't make sense at your layer, so you keep searching at your layer, getting more frustrated, eventually giving up or working around it. Next time, don't Google it for the 50th time. Don't ask ChatGPT the same question with different wording. Let a dual-agent engine run the systematic elimination that you won't have the patience to do yourself.

25 rounds. 4 lines of fix. A root cause that had been hiding in plain sight across the Electron ecosystem since 2017.

We've filed ws#2311 suggesting a defensive typeof check in buffer-util.js — validate that mask and unmask are actual functions before installing the native code path. Zero-cost, one-time check at module load. Would have prevented this entire class of silent failure.

Second in a series on AI pair programming. First post: After 2 years of AI-assisted coding, I automated the one thing that actually improved quality: AI Pair Programming. Tool: Ralph-Lisa Loop.

After 2 years of AI-assisted coding, I automated the one thing that actually improved quality: AI Pair Programming

Sakiharu — Mon, 16 Feb 2026 09:14:05 +0000

After nearly 2 years of AI-assisted development — from ChatGPT 3.5 to Claude Code — I kept hitting the same problem: every model makes mistakes it can't catch. Inspired by pair programming and the Ralph Loop, I built a dual-agent workflow where one agent writes and another reviews. Last week, a PR written entirely by the two agents got merged into a 15k-star open source Electron project after 3 rounds of maintainer feedback. I don't write TypeScript.

The problems I kept finding

I've been doing AI-assisted programming for almost 2 years now. Started with ChatGPT 3.5 generating snippets, moved through Claude, Cursor, TRAE, and eventually fell in love with Claude Code.
From the very beginning, I noticed every model and every agent has its own characteristic problems. Not random bugs — consistent patterns of failure:

Claude Code skips error handling when context gets long. It's brilliant at architecture but gets sloppy on defensive code in later turns.
Codex over-engineers abstractions but catches edge cases Claude misses.
Gemini struggles with complex multi-file changes.
Cursor has context dependency issues — works great in small scope, gets confused across files.

The severity varies, but the pattern is the same: a single agent can't reliably catch its own mistakes. It writes code AND judges whether that code is good — like grading your own exam.
Every developer knows this problem has a name. It's called "why we do code review."

Pair programming, but with AIs

Pair programming was formalized by Kent Beck as part of Extreme Programming (XP) in the late 1990s — one of the most influential practices to come out of the agile movement. The core idea is simple: two developers at one workstation, one drives, one navigates. The navigator catches mistakes in real time, questions design decisions, and keeps the big picture in focus. Research has consistently shown it produces fewer defects and better designs, despite appearing to "waste" half your developers.
The same principle applies to AI agents. If one agent writes and another watches, you catch more bugs.
So that's what I started doing — manually. Way back when I was using Claude (the chat version, before Claude Code), I would take Claude's output, paste it into ChatGPT, ask ChatGPT to review it, then bring the feedback back. Primitive, but it worked better than trusting either one alone.
When Claude Code and Codex CLI came along, the workflow got more serious. Claude Code writes code, I copy the diffs to Codex, Codex reviews and flags issues, I bring the feedback back to Claude Code. Rinse and repeat.
This manual cross-agent coordination worked. But it was slow, repetitive, and cognitively draining. The worst part: it was easy to skip when tired. You tell yourself "this change looks fine, I'll skip the review step" — and that's always the change that bites you.

Automating the loop

Then I discovered the Ralph Loop (by Geoffrey Huntley) — the concept of wrapping a coding agent in an external loop so it keeps iterating. Powerful idea, and it gave me the push to automate my dual-agent workflow.
But the Ralph Loop team has been transparent about some limitations. It works great for greenfield projects with clear completion criteria. It's harder with legacy codebases, complex refactoring, or multi-step tasks where you need checkpoints along the way.
That matched my experience. I wasn't building new projects from scratch — I was forking and deeply modifying an existing large Electron app. I needed something that could handle ambiguity, maintainer feedback, and incremental consensus.
So I built a structured loop: one agent (Claude Code) writes, another (Codex) reviews, they take turns, and neither moves forward until both agree. I sit in the middle as tech lead — setting scope, making architecture calls, breaking ties.
The efficiency jumped immediately. Not because the agents got smarter, but because the review discipline became automatic instead of depending on my willpower at 2am.

The real test: my first open source PR

I'd been using this workflow to fork AionUI (~15k ⭐ Electron + React app) into an internal AI assistant for my company. 30 commits, zero manual code. Full rebrand, core engine rewrite, database migration, CI/CD rebuild — the whole thing done through the dual-agent loop.
During that work, the agents found a real upstream bug: orphan CLI processes that linger when you kill a conversation using ACP agents. I submitted a PR back to AionUI.
The maintainer reviewed it and came back with 3 issues:

Double super.kill() race condition — needed an idempotent guard
Swallowed errors — .catch(() => {}) should log warnings
treeKill discrepancy — the PR description didn't match upstream's actual implementation

I pointed the two agents at the maintainer's feedback and let them work. The author agent analyzed the issues, wrote the fixes, ran tests (133/133 passing). The reviewer agent reviewed the diffs, verified correctness, confirmed types were clean. A few rounds of back-and-forth. I watched but didn't write code.
Merged. "LGTM — all three review feedback items properly addressed."
This was my first ever PR submitted and merged into someone else's project. I'm a 30-year software veteran — but I spent the last 25 years on product and business, not writing code. I don't write TypeScript. AI tools pulled me back into development, and the dual-agent loop made it possible for me to contribute real fixes to a real project.

Independent convergence

After I posted about this, another developer (Hwee-Boon Yar, indie dev, also 30 years experience) shared a similar approach — a skill that shells out to a second agent for review, loops until the reviewer has nothing left to flag. Lighter than mine, works within a single session. Different trade-off, same core insight.
Multiple people are independently arriving at this: one agent is not enough. You need a second pair of eyes.

Limitations

This is not a magic solution. Here's what doesn't work:
Agent crashes have no auto-recovery. When an agent dies mid-session, the loop stops. You restart manually. No self-healing yet.
Wasted rounds. Sometimes the agents ping-pong — a fix introduces a new issue, review catches it, the next fix introduces another issue. You have to step in and reset scope.
Context window — but with a twist. Quality degrades in long sessions, and when an agent compresses its context, information gets lost. But here's where the dual-agent setup actually helps: when one agent's context is compressed and loses details, the other agent still remembers. They don't share the same context window, so they don't lose the same information at the same time. This is an unexpected architectural advantage. I'm thinking about building shared memory management across agents in future versions — so they can explicitly share what each has forgotten.
Two AIs can happily agree on a bad design. Without domain judgment from a human, this is just two agents rubber-stamping each other. The human arbiter is not optional.
This is not autonomous development. It is structured AI-assisted development. The distinction matters.

The deeper question

The AI coding conversation is too focused on generation and not enough on review. Everyone's benchmarking how fast and how much code models can produce. Nobody's asking: who checks it?
If AI code needs structured critique — the same way human code has always needed code review — then the question is: how do you build review discipline into AI workflows?

Just shipped v0.3.0

I've incorporated what I learned from the AionUI PR process and released a new version. Key stuff:

npm i -g ralph-lisa-loop
Works with Claude Code (Ralph) + Codex CLI (Lisa)
Turn control, tag system, consensus protocol, policy checks
Auto mode via tmux (experimental)
Agent-agnostic in principle — any two CLI agents can fill the roles

Early stage. Using daily for real work, not demos.
Repo:
If you've been doing AI coding and hitting that frustrating "almost right, but not quite" problem — you're not alone. This might help, or at least give you ideas for your own approach.
Happy to discuss. The failure modes are more interesting than the successes.