Ashish Barmaiya

Posted on Apr 7

Surviving Node.js Clusters: Graceful Teardowns, Windows Quirks, and Black-Box Testing

#node #devops #backend #testing

When I set out to build Torus, a reverse proxy in Node.js, the node:cluster module looked like the perfect solution to my biggest architectural hurdle. Because Node.js is natively single-threaded, this core module is the standard way to achieve multi-core scaling - it lets a single application spawn multiple worker processes that all share the same server port.

It solved the problem immediately. But it quietly introduced a serious vulnerability I didn't catch until it was almost too late.

The Trap: Blind Resurrection

After scaffolding the cluster, I wrote the standard if (cluster.isPrimary) block, looped through CPU cores, called cluster.fork(), and attached an exit listener to respawn workers on crash:

cluster.on('exit', (worker, code, signal) => {
  logger.warn(`Worker ${worker.process.pid} died. Booting a replacement...`);
  cluster.fork();
});

On the surface, this is exactly right. If a worker blows up with an Out-Of-Memory exception or a rogue regex tears down the V8 engine, the Master catches the exit event and spawns a fresh replacement. Capacity heals itself.

The problem is that this code is completely blind. It doesn't know why a worker died. It only knows that a process stopped, and its only response is to immediately restart it.

This breaks the moment you attempt a routine deployment.

The Kubernetes Tug-of-War

Imagine this code running in a Docker container orchestrated by Kubernetes. You push version 1.1 of your proxy. Kubernetes initiates a rolling update and sends a SIGTERM to your pod: "Finish your active requests and shut down cleanly."

Your workers receive the signal, drain their active TCP sockets, and exit with code 0.

But the millisecond the first worker exits, your Master's .on('exit') listener fires. It doesn't see a coordinated graceful shutdown - it sees a dead process, and it immediately forks a replacement.

While Kubernetes is trying to peacefully drain your pod, your Master is frantically spawning new workers to replace the ones shutting down. You're now locked in a fight with your own infrastructure.

Eventually Kubernetes runs out of patience, fires a SIGKILL, and brutally terminates the Master along with every zombie worker it just spawned. Any clients that were mid-stream have their TCP connections severed instantly.

The "zero-downtime deployment" was a lie.

The Fix: A State-Gated Lifecycle

To fix this, the Master process needs to distinguish between two very different events:

An unexpected crash (a V8 segfault, an OOM exception) → resurrect immediately
An intentional shutdown (a Kubernetes eviction, a SIGTERM) → stand down and let workers die cleanly

That requires three things: a global state lock, an IPC broadcast, and a lifecycle manager with a hard timeout.

1. The Global State Lock

In the Master process, a single boolean flag acts as the guillotine switch:

let isShuttingDown = false;

The .on('exit') listener is rewritten to check this flag before doing anything:

cluster.on("exit", (worker, code, signal) => {
  if (isShuttingDown) {
    logger.info(`Worker ${worker.process.pid} exited cleanly during shutdown.`);

    if (Object.keys(cluster.workers || {}).length === 0) {
      logger.info("All workers stopped. Master exiting with code 0.");
      process.exit(0);
    }
  } else {
    logger.warn(`Worker ${worker.process.pid} crashed. Forking a replacement...`);
    cluster.fork();
  }
});

If the cluster is shutting down, the Master lets workers die in peace and waits until the pool is empty before exiting cleanly. If the flag is false, a worker crashed unexpectedly and gets resurrected immediately.

2. The Teardown Broadcast via IPC

When the OS sends SIGTERM, the Master intercepts it, flips the lock, and broadcasts a SHUTDOWN command to every worker over Node's native IPC channel - rather than killing them instantly and dropping active payloads:

const initiateClusterTeardown = (signal: string) => {
  if (isShuttingDown) return;
  isShuttingDown = true;

  logger.info({ signal }, "Master received termination signal. Broadcasting shutdown to workers...");

  for (const id in cluster.workers) {
    cluster.workers[id]?.send({ type: "SHUTDOWN" });
  }
};

process.on("SIGTERM", () => initiateClusterTeardown("SIGTERM"));
process.on("SIGINT",  () => initiateClusterTeardown("SIGINT"));

3. The LifecycleManager and the 10-Second Guillotine

Inside each worker, the SHUTDOWN message is handled by a LifecycleManager singleton - a single object that owns the worker's entire teardown sequence and prevents the race conditions that come from ad-hoc async cleanup code.

When it receives the command, it executes a strict sequence:

Stop the bleeding - immediately reject new incoming TCP connections
Destroy idle sockets - aggressively close idle Keep-Alive connections to free OS file descriptors
Drain active payloads - wait for in-progress streams to finish naturally

Step 3 has a fatal edge case: a slow or malicious client trickling 1 byte per second will keep server.close() waiting indefinitely, hanging the deployment pipeline forever.

The LifecycleManager solves this with a hard 10-second timeout:

// Inside LifecycleManager.executeTeardown()
const drainPromise = Promise.all(
  this.tasks.map(async (task) => {
    try {
      await task;
    } catch (err) {
      logger.error({ err }, "A teardown task failed during shutdown.");
    }
  }),
);

const timeoutPromise = new Promise((_, reject) => {
  setTimeout(() => {
    reject(new Error(`Shutdown timed out after ${this.SHUTDOWN_TIMEOUT_MS}ms`));
  }, this.SHUTDOWN_TIMEOUT_MS).unref(); // unref() prevents the timer from keeping the event loop alive
});

try {
  await Promise.race([drainPromise, timeoutPromise]);
  logger.info("All systems cleanly drained. Exiting process gracefully.");
  process.exit(0);
} catch (error: any) {
  logger.error( err: error.message },
  "Graceful shutdown aborted. Forcing exit.");
  process.exit(1);
}

The worker handles the IPC message like this:

process.on("message", (msg: any) => {
  if (msg && msg.type === "SHUTDOWN") {
    Lifecycle.executeTeardown("IPC_SHUTDOWN");
  }
});

With this in place: if a worker segfaults, it heals in milliseconds. If Kubernetes sends SIGTERM, the proxy drains active connections cleanly, brutally severs any hanging connections after 10 seconds, and exits with code 0.

The theoretical architecture was solid. Proving it worked was a different problem entirely.

The Testing Nightmare: Why Jest Can't Test This

Standard Jest unit tests are fundamentally incapable of testing a multi-process cluster. Jest runs inside a single Node.js process. Calling cluster.fork() or process.exit() from inside a test will either crash the test runner or leave orphan processes silently eating RAM in the background.

To prove Torus could survive a crash and execute a graceful teardown, I abandoned unit testing entirely and built a black-box integration test. The test treats the compiled proxy as a hostile external artifact: it boots it, wiretaps its output, attacks it, and evaluates how it responds.

Step 1: Boot the Cluster in Isolation

Using child_process.spawn, the test starts the compiled binary exactly as production would:

const entryPoint = path.resolve(__dirname, "../../dist/index.js");

masterProcess = spawn("node", ["--env-file=.env", entryPoint], {
  stdio: ["pipe", "pipe", "pipe", "ipc"],
});

The ipc channel in stdio is important - it's how the test sends commands to the Master later.

Step 2: Wiretap stdout and Assassinate a Worker

Because the proxy runs in an isolated background process, internal variables aren't accessible. Instead, the test intercepts raw stdout. When a worker logs that it's ready, the test extracts its PID and sends it a SIGKILL - bypassing any graceful shutdown handler entirely, simulating a fatal OOM crash:

masterProcess.stdout!.on("data", (data) => {
  const output = data.toString();

  if (!targetWorkerPid && output.includes("listening for Secure HTTPS traffic")) {
    const match = output.match(/Worker\s+(\d+)\s+listening/);
    if (match) {
      targetWorkerPid = parseInt(match[1], 10);
      process.kill(targetWorkerPid, "SIGKILL"); // Instant, uninterceptable death
    }
  }
  // ...
});

SIGKILL cannot be caught by any signal handler. This is the correct way to simulate catastrophic process death.

Step 3: Verify Resurrection and Trigger Teardown

The test continues monitoring stdout. Once it sees the Master log a resurrection, it locks a boolean and sends a shutdown command via IPC:

  if (targetWorkerPid && output.includes("crashed. Forking a replacement") && !workerResurrected) {
    workerResurrected = true;

    setTimeout(() => {
      masterProcess.send({ type: "TEST_SHUTDOWN" });
    }, 500); // Give the cluster 500ms to stabilize before triggering teardown
  }

Step 4: The Final Verdict

When the last worker exits, the Master closes. The test uses the close event as its final assertion gate:

masterProcess.on("close", (code) => {
  try {
    expect(workerResurrected).toBe(true); // Cluster self-healed after crash
    expect(shutdownInitiated).toBe(true); // IPC teardown broadcast worked
    expect(code).toBe(0);                 // Master exited cleanly
    resolve();
  } catch (error) {
    reject(error);
  }
});

The test was flawless. It spawned the cluster, assassinated the worker, verified the resurrection, and cleanly exited.

But notice that TEST_SHUTDOWN command in Step 3? I didn't write it that way originally. Originally, I just told the test to send a standard SIGINT to the Master. But when I ran it, my Windows machine silently choked to death.

The Windows Problem: POSIX Signals Are a Lie

The black-box test spawned the cluster, killed a worker, and watched the Master resurrect it. The final step was to prove the LifecycleManager could execute a zero-downtime graceful shutdown.

To simulate a Kubernetes pod eviction or a developer hitting Ctrl+C, the integration test needed to send a termination signal to the Master process.

masterProcess.kill("SIGINT");

I ran the test. It failed instantly.

The Master didn't gracefully drain the TCP sockets. It didn't trigger the 10-second timeout. It just died on the spot. The logs went completely silent.

On Linux and macOS, sending a termination signal to the Master process works perfectly. SIGINT is a native POSIX signal that politely knocks on the process's door and allows the Node.js event loop to execute its .on('SIGINT') handler.

Windows doesn't have native POSIX signals.

When a Node.js process programmatically calls .kill() with SIGINT on Windows, the OS can't route it through the event loop. Instead, it bypasses the signal handler entirely and terminates the target process immediately. My Master process wasn't failing to drain its sockets - Windows was killing it before the LifecycleManager could even start.

The fix was to bypass the OS signal layer entirely. I added a dedicated IPC backdoor directly in the Master's message handler:

// Testing Backdoor for Windows OS limitations
process.on("message", (msg: any) => {
  if (msg && msg.type === "TEST_SHUTDOWN") {
    initiateClusterTeardown("TEST_SHUTDOWN");
  }
});

And in the test, .kill("SIGINT") became:

setTimeout(() => {
  masterProcess.send({ type: "TEST_SHUTDOWN" });
}, 500);

This sends a plain JSON payload over the IPC channel - no OS signal routing required. The Master receives it, flips the isShuttingDown flag, broadcasts teardown to the workers, and the LifecycleManager executes the full drain sequence.

The CI/CD Trap: Environment Drift

With the Windows fix in place, the local test suite passed cleanly. I pushed to GitHub and waited for the green checkmark.

The pipeline spun for 15 seconds and failed silently.

Diagnosing the Silence

When the black-box test's wiretap never hears the expected log line, it just sits there until Jest's timeout guillotine drops. To find out what was actually happening in the CI runner, I temporarily dumped raw stdout:

masterProcess.stdout!.on("data", (data) => {
  console.log(data.toString()); // Raw output — for CI diagnostics only
  // ...
});

On the next run, the pipeline printed the truth:

ENOENT: no such file or directory, open '/home/runner/work/torus-proxy/certs/key.pem'

Missing Certificates

TLS certificates belong in .gitignore. On my local machine they existed. On GitHub Actions' naked Ubuntu runner, they didn't. Workers hit the TLS configuration block, threw an ENOENT, and died before ever binding to a port.

Because the proxy runs as a real detached process, jest.mock('fs') isn't an option - the isolated binary needs real files on disk. The fix was to generate dummy self-signed certificates in the CI pipeline before the test runs:

- name: Generate Dummy TLS Certificates
  run: |
    mkdir -p certs
    openssl req -nodes -new -x509 \
      -keyout certs/key.pem \
      -out certs/cert.pem \
      -days 365 \
      -subj "/CN=localhost"

Missing Environment Variables

The pipeline failed again. This time:

Error: JWT Secret is required to boot JWT Authenticator.

The .env file was also gitignored. Rather than writing a fake .env to the runner disk, I injected the dummy secret directly into the spawn environment:

const envPath = path.resolve(process.cwd(), ".env");

const nodeArgs = fs.existsSync(envPath)
  ? ["--env-file=.env", entryPoint]
  : [entryPoint];

masterProcess = spawn("node", nodeArgs, {
  stdio: ["pipe", "pipe", "pipe", "ipc"],
  env: {
    ...process.env,
    JWT_SECRET: "dummy_test_secret_for_ci_pipeline_only",
  },
});

I pushed the final commit. The pipeline generated the certificates, injected the secret, booted the Master, assassinated a worker, watched it resurrect, triggered teardown, drained sockets, and exited with code 0.

Green checkmark.

Conclusion: Trust Nothing

The standard Node.js cluster tutorials are dangerously incomplete. Blindly resurrecting workers on every exit event creates a system that will fight your deployment pipelines and drop active TCP connections mid-stream.

Building something production-grade requires:

Deterministic state machines - not reflexive code that acts without knowing why things happened
Structured teardown sequences - not ad-hoc async cleanup that races itself
Skepticism about the OS - signals behave differently across platforms in ways the documentation doesn't always make clear
Black-box integration tests - not unit tests that mock away the exact failure modes you're trying to protect against
CI environment parity - assume the runner has nothing your .gitignore excluded

The cluster module is powerful. But its standard usage pattern gives you the illusion of resilience, not the real thing.

If you'd like to see the full implementation, the source code for Torus Proxy is on GitHub.

DEV Community