<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ashish Barmaiya</title>
    <description>The latest articles on DEV Community by Ashish Barmaiya (@ashishbarmaiya).</description>
    <link>https://dev.to/ashishbarmaiya</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3599039%2F8c260876-69bb-46eb-b816-1161303af61c.jpg</url>
      <title>DEV Community: Ashish Barmaiya</title>
      <link>https://dev.to/ashishbarmaiya</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ashishbarmaiya"/>
    <language>en</language>
    <item>
      <title>Surviving Node.js Clusters: Graceful Teardowns, Windows Quirks, and Black-Box Testing</title>
      <dc:creator>Ashish Barmaiya</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:40:34 +0000</pubDate>
      <link>https://dev.to/ashishbarmaiya/surviving-nodejs-clusters-graceful-teardowns-windows-quirks-and-black-box-testing-1aom</link>
      <guid>https://dev.to/ashishbarmaiya/surviving-nodejs-clusters-graceful-teardowns-windows-quirks-and-black-box-testing-1aom</guid>
      <description>&lt;p&gt;When I set out to build &lt;strong&gt;Torus&lt;/strong&gt;, a reverse proxy in Node.js, the &lt;code&gt;node:cluster&lt;/code&gt; module looked like the perfect solution to my biggest architectural hurdle. Because Node.js is natively single-threaded, this core module is the standard way to achieve multi-core scaling - it lets a single application spawn multiple worker processes that all share the same server port.&lt;/p&gt;

&lt;p&gt;It solved the problem immediately. But it quietly introduced a serious vulnerability I didn't catch until it was almost too late.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Trap: Blind Resurrection
&lt;/h2&gt;

&lt;p&gt;After scaffolding the cluster, I wrote the standard &lt;code&gt;if (cluster.isPrimary)&lt;/code&gt; block, looped through CPU cores, called &lt;code&gt;cluster.fork()&lt;/code&gt;, and attached an &lt;code&gt;exit&lt;/code&gt; listener to respawn workers on crash:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;exit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; died. Booting a replacement...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the surface, this is exactly right. If a worker blows up with an Out-Of-Memory exception or a rogue regex tears down the V8 engine, the Master catches the exit event and spawns a fresh replacement. Capacity heals itself.&lt;/p&gt;

&lt;p&gt;The problem is that this code is completely blind. It doesn't know &lt;em&gt;why&lt;/em&gt; a worker died. It only knows that a process stopped, and its only response is to immediately restart it.&lt;/p&gt;

&lt;p&gt;This breaks the moment you attempt a routine deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Kubernetes Tug-of-War
&lt;/h3&gt;

&lt;p&gt;Imagine this code running in a Docker container orchestrated by Kubernetes. You push version 1.1 of your proxy. Kubernetes initiates a rolling update and sends a &lt;code&gt;SIGTERM&lt;/code&gt; to your pod: "Finish your active requests and shut down cleanly."&lt;/p&gt;

&lt;p&gt;Your workers receive the signal, drain their active TCP sockets, and exit with code &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But the millisecond the first worker exits, your Master's &lt;code&gt;.on('exit')&lt;/code&gt; listener fires. It doesn't see a coordinated graceful shutdown - it sees a dead process, and it immediately forks a replacement.&lt;/p&gt;

&lt;p&gt;While Kubernetes is trying to peacefully drain your pod, your Master is frantically spawning new workers to replace the ones shutting down. You're now locked in a fight with your own infrastructure.&lt;/p&gt;

&lt;p&gt;Eventually Kubernetes runs out of patience, fires a &lt;code&gt;SIGKILL&lt;/code&gt;, and brutally terminates the Master along with every zombie worker it just spawned. Any clients that were mid-stream have their TCP connections severed instantly.&lt;/p&gt;

&lt;p&gt;The "zero-downtime deployment" was a lie.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: A State-Gated Lifecycle
&lt;/h2&gt;

&lt;p&gt;To fix this, the Master process needs to distinguish between two very different events:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;An &lt;strong&gt;unexpected crash&lt;/strong&gt; (a V8 segfault, an OOM exception) → resurrect immediately&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An &lt;strong&gt;intentional shutdown&lt;/strong&gt; (a Kubernetes eviction, a &lt;code&gt;SIGTERM&lt;/code&gt;) → stand down and let workers die cleanly&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That requires three things: a &lt;strong&gt;global state lock&lt;/strong&gt;, an &lt;strong&gt;IPC broadcast&lt;/strong&gt;, and a &lt;strong&gt;lifecycle manager with a hard timeout&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Global State Lock
&lt;/h3&gt;

&lt;p&gt;In the Master process, a single boolean flag acts as the guillotine switch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;isShuttingDown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.on('exit')&lt;/code&gt; listener is rewritten to check this flag before doing anything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;exit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isShuttingDown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; exited cleanly during shutdown.`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workers&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;All workers stopped. Master exiting with code 0.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; crashed. Forking a replacement...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the cluster is shutting down, the Master lets workers die in peace and waits until the pool is empty before exiting cleanly. If the flag is &lt;code&gt;false&lt;/code&gt;, a worker crashed unexpectedly and gets resurrected immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Teardown Broadcast via IPC
&lt;/h3&gt;

&lt;p&gt;When the OS sends &lt;code&gt;SIGTERM&lt;/code&gt;, the Master intercepts it, flips the lock, and broadcasts a &lt;code&gt;SHUTDOWN&lt;/code&gt; command to every worker over Node's native IPC channel - rather than killing them instantly and dropping active payloads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;initiateClusterTeardown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isShuttingDown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;isShuttingDown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;signal&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Master received termination signal. Broadcasting shutdown to workers...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SHUTDOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SIGTERM&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;initiateClusterTeardown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SIGTERM&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SIGINT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;initiateClusterTeardown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SIGINT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. The LifecycleManager and the 10-Second Guillotine
&lt;/h3&gt;

&lt;p&gt;Inside each worker, the &lt;code&gt;SHUTDOWN&lt;/code&gt; message is handled by a &lt;code&gt;LifecycleManager&lt;/code&gt; singleton - a single object that owns the worker's entire teardown sequence and prevents the race conditions that come from ad-hoc async cleanup code.&lt;/p&gt;

&lt;p&gt;When it receives the command, it executes a strict sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stop the bleeding&lt;/strong&gt; - immediately reject new incoming TCP connections&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Destroy idle sockets&lt;/strong&gt; - aggressively close idle Keep-Alive connections to free OS file descriptors&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Drain active payloads&lt;/strong&gt; - wait for in-progress streams to finish naturally&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 3 has a fatal edge case: a slow or malicious client trickling 1 byte per second will keep &lt;code&gt;server.close()&lt;/code&gt; waiting indefinitely, hanging the deployment pipeline forever.&lt;/p&gt;

&lt;p&gt;The LifecycleManager solves this with a hard 10-second timeout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Inside LifecycleManager.executeTeardown()&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;drainPromise&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;A teardown task failed during shutdown.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;timeoutPromise&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Shutdown timed out after &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SHUTDOWN_TIMEOUT_MS&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;ms`&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SHUTDOWN_TIMEOUT_MS&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;unref&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// unref() prevents the timer from keeping the event loop alive&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;race&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;drainPromise&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;timeoutPromise&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;All systems cleanly drained. Exiting process gracefully.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Graceful shutdown aborted. Forcing exit.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worker handles the IPC message like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SHUTDOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Lifecycle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;executeTeardown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;IPC_SHUTDOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this in place: if a worker segfaults, it heals in milliseconds. If Kubernetes sends &lt;code&gt;SIGTERM&lt;/code&gt;, the proxy drains active connections cleanly, brutally severs any hanging connections after 10 seconds, and exits with code &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The theoretical architecture was solid. Proving it worked was a different problem entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Testing Nightmare: Why Jest Can't Test This
&lt;/h2&gt;

&lt;p&gt;Standard Jest unit tests are fundamentally incapable of testing a multi-process cluster. Jest runs inside a single Node.js process. Calling &lt;code&gt;cluster.fork()&lt;/code&gt; or &lt;code&gt;process.exit()&lt;/code&gt; from inside a test will either crash the test runner or leave orphan processes silently eating RAM in the background.&lt;/p&gt;

&lt;p&gt;To prove Torus could survive a crash and execute a graceful teardown, I abandoned unit testing entirely and built a &lt;strong&gt;black-box integration test&lt;/strong&gt;. The test treats the compiled proxy as a hostile external artifact: it boots it, wiretaps its output, attacks it, and evaluates how it responds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Boot the Cluster in Isolation
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;child_process.spawn&lt;/code&gt;, the test starts the compiled binary exactly as production would:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;entryPoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;../../dist/index.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;masterProcess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--env-file=.env&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;entryPoint&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stdio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pipe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pipe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pipe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ipc&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ipc&lt;/code&gt; channel in &lt;code&gt;stdio&lt;/code&gt; is important - it's how the test sends commands to the Master later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Wiretap stdout and Assassinate a Worker
&lt;/h3&gt;

&lt;p&gt;Because the proxy runs in an isolated background process, internal variables aren't accessible. Instead, the test intercepts raw stdout. When a worker logs that it's ready, the test extracts its PID and sends it a &lt;code&gt;SIGKILL&lt;/code&gt; - bypassing any graceful shutdown handler entirely, simulating a fatal OOM crash:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;masterProcess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;data&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;targetWorkerPid&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;listening for Secure HTTPS traffic&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/Worker&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;(\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)\s&lt;/span&gt;&lt;span class="sr"&gt;+listening/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;targetWorkerPid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;targetWorkerPid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SIGKILL&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Instant, uninterceptable death&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;SIGKILL&lt;/code&gt; cannot be caught by any signal handler. This is the correct way to simulate catastrophic process death.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Verify Resurrection and Trigger Teardown
&lt;/h3&gt;

&lt;p&gt;The test continues monitoring stdout. Once it sees the Master log a resurrection, it locks a boolean and sends a shutdown command via IPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;targetWorkerPid&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;crashed. Forking a replacement&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;workerResurrected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;workerResurrected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;masterProcess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TEST_SHUTDOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Give the cluster 500ms to stabilize before triggering teardown&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: The Final Verdict
&lt;/h3&gt;

&lt;p&gt;When the last worker exits, the Master closes. The test uses the &lt;code&gt;close&lt;/code&gt; event as its final assertion gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;masterProcess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;close&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;workerResurrected&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Cluster self-healed after crash&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shutdownInitiated&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// IPC teardown broadcast worked&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                 &lt;span class="c1"&gt;// Master exited cleanly&lt;/span&gt;
    &lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test was flawless. It spawned the cluster, assassinated the worker, verified the resurrection, and cleanly exited.&lt;/p&gt;

&lt;p&gt;But notice that &lt;code&gt;TEST_SHUTDOWN&lt;/code&gt; command in Step 3? I didn't write it that way originally. Originally, I just told the test to send a standard SIGINT to the Master. But when I ran it, my Windows machine silently choked to death.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Windows Problem: POSIX Signals Are a Lie
&lt;/h2&gt;

&lt;p&gt;The black-box test spawned the cluster, killed a worker, and watched the Master resurrect it. The final step was to prove the &lt;code&gt;LifecycleManager&lt;/code&gt; could execute a zero-downtime graceful shutdown.&lt;/p&gt;

&lt;p&gt;To simulate a Kubernetes pod eviction or a developer hitting Ctrl+C, the integration test needed to send a termination signal to the Master process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;masterProcess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SIGINT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran the test. It failed instantly.&lt;/p&gt;

&lt;p&gt;The Master didn't gracefully drain the TCP sockets. It didn't trigger the 10-second timeout. It just died on the spot. The logs went completely silent.&lt;/p&gt;

&lt;p&gt;On Linux and macOS, sending a termination signal to the Master process works perfectly. &lt;code&gt;SIGINT&lt;/code&gt; is a native POSIX signal that politely knocks on the process's door and allows the Node.js event loop to execute its &lt;code&gt;.on('SIGINT')&lt;/code&gt; handler.&lt;/p&gt;

&lt;p&gt;Windows doesn't have native POSIX signals.&lt;/p&gt;

&lt;p&gt;When a Node.js process programmatically calls &lt;code&gt;.kill()&lt;/code&gt; with &lt;code&gt;SIGINT&lt;/code&gt; on Windows, the OS can't route it through the event loop. Instead, it bypasses the signal handler entirely and terminates the target process immediately. My Master process wasn't failing to drain its sockets - Windows was killing it before the &lt;code&gt;LifecycleManager&lt;/code&gt; could even start.&lt;/p&gt;

&lt;p&gt;The fix was to bypass the OS signal layer entirely. I added a dedicated IPC backdoor directly in the Master's message handler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Testing Backdoor for Windows OS limitations&lt;/span&gt;
&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TEST_SHUTDOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;initiateClusterTeardown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TEST_SHUTDOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in the test, &lt;code&gt;.kill("SIGINT")&lt;/code&gt; became:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;masterProcess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TEST_SHUTDOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sends a plain JSON payload over the IPC channel - no OS signal routing required. The Master receives it, flips the &lt;code&gt;isShuttingDown&lt;/code&gt; flag, broadcasts teardown to the workers, and the LifecycleManager executes the full drain sequence.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI/CD Trap: Environment Drift
&lt;/h2&gt;

&lt;p&gt;With the Windows fix in place, the local test suite passed cleanly. I pushed to GitHub and waited for the green checkmark.&lt;/p&gt;

&lt;p&gt;The pipeline spun for 15 seconds and failed silently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagnosing the Silence
&lt;/h3&gt;

&lt;p&gt;When the black-box test's wiretap never hears the expected log line, it just sits there until Jest's timeout guillotine drops. To find out what was actually happening in the CI runner, I temporarily dumped raw stdout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;masterProcess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;data&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="c1"&gt;// Raw output — for CI diagnostics only&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the next run, the pipeline printed the truth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ENOENT: no such file or directory, open '/home/runner/work/torus-proxy/certs/key.pem'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Missing Certificates
&lt;/h3&gt;

&lt;p&gt;TLS certificates belong in &lt;code&gt;.gitignore&lt;/code&gt;. On my local machine they existed. On GitHub Actions' naked Ubuntu runner, they didn't. Workers hit the TLS configuration block, threw an &lt;code&gt;ENOENT&lt;/code&gt;, and died before ever binding to a port.&lt;/p&gt;

&lt;p&gt;Because the proxy runs as a real detached process, &lt;code&gt;jest.mock('fs')&lt;/code&gt; isn't an option - the isolated binary needs real files on disk. The fix was to generate dummy self-signed certificates in the CI pipeline before the test runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate Dummy TLS Certificates&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;mkdir -p certs&lt;/span&gt;
    &lt;span class="s"&gt;openssl req -nodes -new -x509 \&lt;/span&gt;
      &lt;span class="s"&gt;-keyout certs/key.pem \&lt;/span&gt;
      &lt;span class="s"&gt;-out certs/cert.pem \&lt;/span&gt;
      &lt;span class="s"&gt;-days 365 \&lt;/span&gt;
      &lt;span class="s"&gt;-subj "/CN=localhost"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Missing Environment Variables
&lt;/h3&gt;

&lt;p&gt;The pipeline failed again. This time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: JWT Secret is required to boot JWT Authenticator.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.env&lt;/code&gt; file was also gitignored. Rather than writing a fake &lt;code&gt;.env&lt;/code&gt; to the runner disk, I injected the dummy secret directly into the &lt;code&gt;spawn&lt;/code&gt; environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;envPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cwd&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.env&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;nodeArgs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;existsSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;envPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--env-file=.env&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;entryPoint&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;entryPoint&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="nx"&gt;masterProcess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;nodeArgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stdio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pipe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pipe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pipe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ipc&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;JWT_SECRET&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;dummy_test_secret_for_ci_pipeline_only&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I pushed the final commit. The pipeline generated the certificates, injected the secret, booted the Master, assassinated a worker, watched it resurrect, triggered teardown, drained sockets, and exited with code &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Green checkmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: Trust Nothing
&lt;/h2&gt;

&lt;p&gt;The standard Node.js cluster tutorials are dangerously incomplete. Blindly resurrecting workers on every &lt;code&gt;exit&lt;/code&gt; event creates a system that will fight your deployment pipelines and drop active TCP connections mid-stream.&lt;/p&gt;

&lt;p&gt;Building something production-grade requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deterministic state machines&lt;/strong&gt; - not reflexive code that acts without knowing why things happened&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structured teardown sequences&lt;/strong&gt; - not ad-hoc async cleanup that races itself&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Skepticism about the OS&lt;/strong&gt; - signals behave differently across platforms in ways the documentation doesn't always make clear&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Black-box integration tests&lt;/strong&gt; - not unit tests that mock away the exact failure modes you're trying to protect against&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CI environment parity&lt;/strong&gt; - assume the runner has nothing your &lt;code&gt;.gitignore&lt;/code&gt; excluded&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cluster module is powerful. But its standard usage pattern gives you the illusion of resilience, not the real thing.&lt;/p&gt;

&lt;p&gt;If you'd like to see the full implementation, the source code for Torus Proxy is on &lt;a href="https://github.com/Ashish-Barmaiya/torus-proxy" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>node</category>
      <category>devops</category>
      <category>backend</category>
      <category>testing</category>
    </item>
    <item>
      <title>Why I Ripped stream.pipe() Out of My Node.js API Gateway</title>
      <dc:creator>Ashish Barmaiya</dc:creator>
      <pubDate>Fri, 27 Mar 2026 16:09:13 +0000</pubDate>
      <link>https://dev.to/ashishbarmaiya/why-i-ripped-streampipe-out-of-my-nodejs-api-gateway-2ekl</link>
      <guid>https://dev.to/ashishbarmaiya/why-i-ripped-streampipe-out-of-my-nodejs-api-gateway-2ekl</guid>
      <description>&lt;p&gt;When I started building Torus, a multi-core Layer 7 Edge API Gateway from scratch in Node.js, I handled incoming network requests the way I had always seen it done in standard web applications:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;TypeScript&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;end&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;forwardToBackend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It worked perfectly for lightweight tests. But as I started pushing concurrent loads and larger payloads through the proxy, my server began to choke. CPU usage spiked to 100%, the event loop lagged, and memory consumption grew uncontrollably until the process crashed.&lt;/p&gt;

&lt;p&gt;I had fallen into a classic architectural trap: I was dragging raw TCP payload bytes directly into the V8 JavaScript engine's memory heap.&lt;/p&gt;

&lt;p&gt;Because the V8 heap has a strict memory limit, pulling massive payloads into user-space memory forces the Node.js Garbage Collector (GC) to work overtime. The GC halts the single-threaded event loop to clean up the allocated memory, effectively stalling every other active network connection in the proxy.&lt;/p&gt;

&lt;p&gt;I learned a fundamental rule of proxy engineering the hard way: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Proxies shouldn't read data; they should just move it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To build a production-grade gateway, I realized I had to bypass the V8 heap entirely. I needed to keep the data in raw C++ memory blocks and move it to the Operating System level. But as I refactored my routing logic to achieve this, I stumbled into a silent, catastrophic flaw in the standard Node.js stream API that brought my entire test suite to a halt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The First Evolution: Bypassing V8 with &lt;code&gt;.pipe()&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The architectural fix required a fundamental shift in how I viewed the data. I had to stop treating payloads as static variables and start treating them as flowing water.&lt;/p&gt;

&lt;p&gt;I didn't need to load an entire 50MB file into memory before forwarding it. I only needed to hold a few kilobytes in a temporary buffer, flush it to the destination, and reuse that memory space.&lt;/p&gt;

&lt;p&gt;In Node.js, this is exactly what the &lt;code&gt;node:stream&lt;/code&gt; module and &lt;code&gt;Buffer&lt;/code&gt; objects are designed for.&lt;/p&gt;

&lt;p&gt;A Buffer in Node.js allocates memory outside the V8 JavaScript engine. It utilizes raw C++ memory blocks mapped directly to the OS. By keeping the network chunks as raw Buffers, the payload never enters the JavaScript heap. Because it never enters the heap, the V8 Garbage Collector completely ignores it, leaving the event loop free to handle other connections.&lt;/p&gt;

&lt;p&gt;To wire this up, I utilized the native &lt;code&gt;.pipe()&lt;/code&gt; method to connect the readable client stream to the writable backend stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;TypeScript&lt;/span&gt;

&lt;span class="c1"&gt;// Connects the incoming ReadableStream directly to the outgoing WritableStream&lt;/span&gt;
&lt;span class="nx"&gt;clientReq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;proxyReq&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single line of code acts as an OS-level plumbing system. It takes the incoming TCP stream, reads the raw C++ buffers, automatically manages the backpressure (ensuring a fast client doesn't overwhelm a slow backend connection), and pushes the bytes directly out to the routing pool.&lt;/p&gt;

&lt;p&gt;My CPU usage plummeted. The memory footprint stayed flat, regardless of how large the incoming payloads were. It felt like I had solved the scaling problem entirely.&lt;/p&gt;

&lt;p&gt;But &lt;code&gt;.pipe()&lt;/code&gt; was hiding a massive, silent vulnerability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plot Twist: The Silent Socket Leak
&lt;/h2&gt;

&lt;p&gt;I thought I had engineered the perfect solution. The proxy was fast, the CPU was idle, and the memory footprint stayed completely flat, regardless of payload size.&lt;/p&gt;

&lt;p&gt;Then I ran my integration test suite.&lt;/p&gt;

&lt;p&gt;All the assertions passed. I got the green checkmarks. But the terminal just froze. Jest refused to exit, eventually spitting out that infuriating warning: "Jest did not exit one second after the test run has completed."&lt;/p&gt;

&lt;p&gt;My initial reaction was to treat it like a standard web app bug. I meticulously checked my teardown logic, making sure &lt;code&gt;proxyServer.close()&lt;/code&gt; was being called and my Redis clients were fully disconnected. I ran the tests again. It still hung.&lt;/p&gt;

&lt;p&gt;I had to drop down to the OS level to understand what was actually happening. The Node.js event loop is mathematically programmed to never exit as long as there is an active I/O handle (like a &lt;code&gt;net.Socket&lt;/code&gt;) in its queue. Something was keeping a socket alive.&lt;/p&gt;

&lt;p&gt;The culprit was &lt;code&gt;.pipe().&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;When my Jest test fired a dummy request through the proxy and then disconnected, the client-side socket closed gracefully. But I learned a lesson about Node.js streams: &lt;code&gt;.pipe()&lt;/code&gt; blindly pushes data; it does not propagate lifecycle events.&lt;/p&gt;

&lt;p&gt;When the client dropped, &lt;code&gt;.pipe()&lt;/code&gt; did not send an error or close event to the destination stream. It left the backend connection completely open. The proxy was sitting there holding a dead connection to the backend, waiting for network bytes that would never arrive.&lt;/p&gt;

&lt;p&gt;I had built a machine that generated half-open sockets. In a production environment, this would silently exhaust the Operating System's File Descriptors (FDs). Every dropped client connection would permanently lock up an FD until the OS hit its limit and violently crash the Node process with an &lt;code&gt;EMFILE&lt;/code&gt; error.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: stream.pipeline()
&lt;/h2&gt;

&lt;p&gt;The Node.js core maintainers knew &lt;code&gt;.pipe()&lt;/code&gt; was dangerously naive for production infrastructure. That is exactly why they introduced &lt;code&gt;stream.pipeline()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead of blindly shoving data from one socket to another, &lt;code&gt;.pipeline()&lt;/code&gt; acts as a unified state machine that monitors the entire stream chain. It pushes the responsibility of socket teardown back to the Node.js core networking stack where it belongs.&lt;/p&gt;

&lt;p&gt;If any stream in the pipeline fails, throws an error, or abruptly closes (like a client dropping off with an &lt;code&gt;ECONNRESET&lt;/code&gt;), &lt;code&gt;.pipeline()&lt;/code&gt; automatically intercepts it. It destroys all other connected streams in that specific chain and bubbles up a single error for you to catch.&lt;/p&gt;

&lt;p&gt;I removed every instance of &lt;code&gt;.pipe()&lt;/code&gt; and the dozens of lines of manual &lt;code&gt;.on('error')&lt;/code&gt; spaghetti I had written. Because raw TCP proxying requires bidirectional data flow, I replaced it with two parallel, sequential pipelines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;TypeScript&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;clientSocket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;backendSocket&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;backendSocket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;clientSocket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// If either side drops, the pipeline throws, and we clean up natively.&lt;/span&gt;
  &lt;span class="nx"&gt;clientSocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;backendSocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment I swapped to this architecture and ran my integration suite, the terminal didn't hang. Jest executed all 18 network tests and exited flawlessly in 2.7 seconds. The event loop was instantly cleared. The silent socket leak was completely eradicated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fercx9tbcdrgcidzs0u7m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fercx9tbcdrgcidzs0u7m.png" alt="Terminal screenshot showing 18 Jest network tests passing flawlessly in 2.7 seconds."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Trust Nothing, Understand Everything
&lt;/h2&gt;

&lt;p&gt;Building a multi-core Edge Gateway from scratch taught me that I cannot blindly trust high-level abstractions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stream.pipe()&lt;/code&gt; looks elegant in a standard web tutorial, but in the trenches of raw TCP networking, it is a massive liability. If you are building infrastructure that handles thousands of concurrent connections, you must understand the Operating System-level lifecycle of your File Descriptors and your sockets. If you don't, your system will slowly bleed to death under load, and logs won't even tell you why.&lt;/p&gt;

&lt;p&gt;If you want to see the exact implementation of this bidirectional TCP routing, you can check out the raw source code for &lt;a href="https://github.com/Ashish-Barmaiya/torus-proxy" rel="noopener noreferrer"&gt;Torus Proxy on my GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>node</category>
      <category>backend</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
