Ivan Maksimov

Posted on Jun 22

How I tracked down a 36GB memory leak in a Claude Code memory server

#debugging #node #webassembly #sqlite

A debugging story about heap snapshots, native memory that --max-old-space-size can't touch, and a WebAssembly filesystem quietly hoarding files.

The setup

I run a small service that gives a team of Claude Code users one shared memory store. Mechanically it's a Node/Express proxy that wraps a stdio MCP server (ruflo) and exposes it over HTTP. You don't need the product to follow the bug — just one fact: a long-lived Node process serves memory operations, and underneath it uses sql.js (SQLite compiled to WebAssembly) to hold the store.

One instance in production kept growing. Not spiking — creeping. ~36 GB RSS over six weeks, then the cgroup OOM-killer would reap it and the clock reset. Classic leak shape.

Step 1: is it even my code?

The proxy and the wrapped MCP child are separate processes. ps settled it fast: the proxy sat flat at ~60 MB; the ruflo mcp start child was the one ballooning. So the leak was below my code, in the wrapped process. Good — narrower problem.

Step 2: the heap that wasn't

First instinct on a Node leak is the V8 heap. So I looked at process.memoryUsage() on the live child:

rss            1385 MB
heapTotal        24 MB
heapUsed         21 MB
external       1286 MB
arrayBuffers    995 MB

This is the whole story in five numbers. heapTotal — the V8 JS heap — is flat at 24 MB. The growth is entirely in external / arrayBuffers: native memory backing ArrayBuffers, outside the GC'd JS heap.

That immediately kills two "obvious" fixes:

--max-old-space-size does nothing — it bounds the old space (21 MB here), not native buffers.
Forcing GC does nothing either, if something still references those buffers.

So: what holds ~1 GB of ArrayBuffers?

Step 3: the heap snapshot

I opened the inspector on the live process (kill -USR1 <pid>, then connected over the WebSocket — Node 22 has a global WebSocket, so a 30-line script does it) and took a HeapProfiler.takeHeapSnapshot. The snapshot was only ~18 MB, which is itself a clue: if the leak were hundreds of thousands of small JS objects, the graph would be huge. A small graph holding a lot of bytes means a few big buffers.

Parsing the snapshot (the format is just nodes / edges / strings arrays), the top retained objects were unambiguous:

203 × native:system / JSArrayBufferData @ 11.0 MB = 2233 MB

203 buffers, 11 MB each. And 11 MB was exactly the size of the on-disk memory.db. The retainer chain:

JSArrayBufferData (11 MB)
  <- ArrayBuffer
  <- Buffer
  <- (MEMFS file node).contents
  <- FS.nodes  (an Array)
  <- Context  (the sql.js Emscripten module — has WebAssembly.Memory, HEAPF32, createNode, /dev/tty…)
  <- SqlJsBackend.db

That Context with createNode, /dev/tty, and a WebAssembly.Memory is the tell: it's Emscripten's in-memory filesystem (MEMFS). The file names confirmed it — each buffer was a MEMFS file called dbfile_<random>, and there were ~200 of them, each a full copy of the database.

Step 4: the root cause

Here's the mechanism. sql.js's Database constructor writes its input bytes into a MEMFS file (dbfile_<random>) via FS.createDataFile. Database.prototype.close() is what removes it (FS.unlink). And the sql.js module is a process-wide singleton — one MEMFS shared by every Database you ever open.

The backend opened the database like this, per operation path, with no caching:

this.db = new SQL.Database(fs.readFileSync(path)); // loads the whole 11MB image
// ...used, then the wrapper goes out of scope

When that JS Database wrapper is dropped, V8 garbage-collects the wrapper object — but GC has no idea about the MEMFS file it created inside the WASM module. Only an explicit close() unlinks it. No close() → the 11 MB dbfile_<random> lives in MEMFS forever. One leaked DB image per open. Multiply by traffic and you get 36 GB.

This is the trap in one sentence: garbage-collecting a JS handle does not free native/WASM memory it allocated. The GC sees a tiny wrapper; the cost is in a buffer the GC doesn't manage.

Step 5: two fixes

Containment (ship today). I added an RSS watchdog to the proxy: it reads the child's RSS from /proc/<pid>/status, and when it crosses a threshold it gracefully respawns the child once it's idle (reusing an existing single-flight reconnect path — kill the old child, spawn a fresh one). A respawn drops the entire bloated MEMFS at once. Symptomatic, but it bounds memory with zero dropped requests.

Root cause (fix it properly). Cache the backend per database path so the DB opens once and is reused, instead of a fresh SQL.Database per call. No repeated loads → no new dbfile_*. I bake this as a build-time patch into the image and filed it upstream with the snapshot.

The bonus disaster: a corrupted database

The earlier hard OOM-kills had interrupted a sql.js write mid-flight and left one memory.db corrupted — database disk image is malformed, busted overflow pages in the B-tree. Recovery turned into its own adventure:

.recover (SQLite's salvage mode) reconstructed the bulk of the rows by walking the B-tree fragments.
But the newest writes weren't in the main file — they lived in the WAL (-wal), which .recover doesn't replay, and some sat on the corrupted pages. I ended up parsing WAL frames by hand (apply page images by page number) and carving SQLite leaf-page records directly to recover the rest.

Lesson burned in: a WAL-mode SQLite backup is three files — db + -wal + -shm. Copy only the .db and you get exactly that "malformed" error, because the latest committed state is still in the WAL.

Takeaways

Split RSS by origin first. heapTotal flat + external/arrayBuffers rising = native leak. Don't reach for --max-old-space-size; it can't help.
GC ≠ free for native/WASM memory. Anything backed by a WASM heap, an Emscripten MEMFS, or a native addon needs an explicit close/free. Dropping the JS handle isn't enough.
Heap snapshots find native retainers too. The JSArrayBufferData nodes and their retainer chain pointed straight at the owning structure. A small snapshot holding big bytes = few large buffers.
WAL backups are three files. Or your backup is unrecoverable.
Open source means you can actually fix the dependency. The leak was three layers down in someone else's package. I snapshotted it, found the cause, patched it locally, and sent it upstream — instead of filing a ticket into a void.

Upstream writeup with the full retainer trace: ruvnet/ruflo#2432. The wrapper itself, if you're curious: jazz-max/ruflo-hub.

DEV Community