<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Empellio.com</title>
    <description>The latest articles on DEV Community by Empellio.com (@empellio).</description>
    <link>https://dev.to/empellio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3800458%2Fd0f73321-a9b0-4bdf-be6a-1cb38ddd5dea.png</url>
      <title>DEV Community: Empellio.com</title>
      <link>https://dev.to/empellio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/empellio"/>
    <language>en</language>
    <item>
      <title>Health Checks for Node.js Apps — What They Are, Why They Matter, and How to Build Them</title>
      <dc:creator>Empellio.com</dc:creator>
      <pubDate>Mon, 23 Mar 2026 10:14:28 +0000</pubDate>
      <link>https://dev.to/empellio/health-checks-for-nodejs-apps-what-they-are-why-they-matter-and-how-to-build-them-42p0</link>
      <guid>https://dev.to/empellio/health-checks-for-nodejs-apps-what-they-are-why-they-matter-and-how-to-build-them-42p0</guid>
      <description>&lt;p&gt;A health check is how your infrastructure answers the question: &lt;em&gt;"Is this instance actually working right now?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Without one, a load balancer will happily route traffic to an instance whose database connection pool is exhausted, whose memory is full, or whose app started but can't reach any external services. The process is running — but it's not healthy.&lt;/p&gt;

&lt;p&gt;Health checks fix this by giving infrastructure a reliable signal to act on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Uses Health Checks
&lt;/h2&gt;

&lt;p&gt;Three pieces of infrastructure rely on your health endpoint:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Process managers&lt;/strong&gt; (Oxmgr, PM2) — use health checks to determine when a newly spawned process is ready to receive traffic during rolling restarts. Without this, the manager might route traffic to a process that started but isn't ready. See &lt;a href="https://oxmgr.empellio.com/blog/zero-downtime-deployment" rel="noopener noreferrer"&gt;Zero-Downtime Deployment&lt;/a&gt; for the full rolling restart flow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load balancers&lt;/strong&gt; (Nginx, HAProxy, AWS ALB) — poll health endpoints continuously. If an instance fails, the load balancer stops routing to it and marks it as down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Container orchestrators&lt;/strong&gt; (Kubernetes, ECS) — use liveness probes (is it running?) and readiness probes (is it ready for traffic?) to decide when to restart containers or route traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Minimal Health Endpoint
&lt;/h2&gt;

&lt;p&gt;At minimum, a health check returns 200 when the app is serving requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is better than nothing — it confirms the process is alive and Express is responding — but it doesn't tell you if the app can actually do useful work.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Production-Grade Health Check
&lt;/h2&gt;

&lt;p&gt;A real health check verifies that dependencies are reachable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;express&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./db.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./cache.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;express&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;overallStatus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// 1. Database check&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT 1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="nx"&gt;overallStatus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;degraded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// 2. Redis check (if applicable)&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="nx"&gt;overallStatus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;degraded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// 3. Disk space check (for log-heavy apps)&lt;/span&gt;
  &lt;span class="c1"&gt;// Optional — add if relevant&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;responseTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;statusCode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;overallStatus&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overallStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;uptime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uptime&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
      &lt;span class="na"&gt;responseTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;responseTime&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;ms`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;npm_package_version&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unknown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;nodeVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example response when healthy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ok"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"checks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ok"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"redis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ok"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"pid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12847&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"uptime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"responseTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4ms"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.3.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nodeVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v20.11.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When degraded (database unreachable):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"degraded"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"checks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"connect ECONNREFUSED 127.0.0.1:5432"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"redis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ok"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HTTP status code 503 tells the load balancer to stop routing to this instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Liveness vs Readiness
&lt;/h2&gt;

&lt;p&gt;Kubernetes (and good health check design in general) separates two questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Liveness:&lt;/strong&gt; Is the process alive and not deadlocked?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If this fails, restart the container&lt;/li&gt;
&lt;li&gt;Should almost always return 200 (even a degraded app is alive)&lt;/li&gt;
&lt;li&gt;Should never check external dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Readiness:&lt;/strong&gt; Is the process ready to handle traffic?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If this fails, stop routing traffic to this instance (but don't restart it)&lt;/li&gt;
&lt;li&gt;Should check all dependencies needed to serve requests&lt;/li&gt;
&lt;li&gt;What you typically mean by "health check"
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Liveness — just confirms the event loop is running&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health/live&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;alive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Readiness — confirms app can serve useful requests&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health/ready&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT 1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;ready&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;ready&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Kubernetes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health/live&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

&lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health/ready&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Startup Probe
&lt;/h2&gt;

&lt;p&gt;For slow-starting apps (heavy module loading, DB migrations on startup), add a startup probe. This gives the app time to initialize without triggering liveness failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health/live&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;    &lt;span class="c1"&gt;# 30 × 10s = 5 minutes to start&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the startup probe passes, liveness and readiness probes take over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuring Health Checks in Oxmgr
&lt;/h2&gt;

&lt;p&gt;Oxmgr uses health checks to gate rolling restarts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"node dist/server.js"&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="py"&gt;restart_on_exit&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="nn"&gt;[processes.api.health_check]&lt;/span&gt;
&lt;span class="py"&gt;endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:3000/health"&lt;/span&gt;
&lt;span class="py"&gt;interval_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;        &lt;span class="c"&gt;# poll every 10 seconds&lt;/span&gt;
&lt;span class="py"&gt;timeout_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;          &lt;span class="c"&gt;# fail if no response within 5s&lt;/span&gt;
&lt;span class="py"&gt;healthy_threshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;     &lt;span class="c"&gt;# must pass 2 consecutive checks to be considered healthy&lt;/span&gt;
&lt;span class="py"&gt;unhealthy_threshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="c"&gt;# must fail 3 consecutive checks to be considered unhealthy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During a rolling restart, Oxmgr:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Starts a new instance&lt;/li&gt;
&lt;li&gt;Polls the health endpoint every &lt;code&gt;interval_secs&lt;/code&gt; seconds&lt;/li&gt;
&lt;li&gt;Waits for &lt;code&gt;healthy_threshold&lt;/code&gt; consecutive successes&lt;/li&gt;
&lt;li&gt;Only then stops the old instance and moves to the next&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the new instance never passes health checks, the rollout stops and you get an error report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Health Check Anti-Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Checking things that don't affect request serving:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad — CPU temperature isn't your app's responsibility&lt;/span&gt;
&lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cpuTemp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getCpuTemp&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Good — check things your app actually needs&lt;/span&gt;
&lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkDatabase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Slow health checks:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your health endpoint should respond in under 100ms. If your database check takes 2 seconds, something's wrong — and more importantly, your load balancer's request will time out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Add timeouts to dependency checks&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dbCheck&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;race&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT 1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;DB check timed out&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Caching health check results:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some teams cache health responses to avoid hammering the database on every poll. The problem: you might serve stale "ok" results while the database is down.&lt;/p&gt;

&lt;p&gt;If your database can't handle 1 &lt;code&gt;SELECT 1&lt;/code&gt; query per 10 seconds, your database has a problem that a health check cache won't fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Making health checks require authentication:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Health endpoints should be publicly accessible (to load balancers and process managers that don't have credentials). Don't put them behind auth middleware.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Fine for other routes&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;authenticate&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Health endpoint should be registered before auth middleware&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;healthHandler&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;authenticate&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use a separate server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Main app on 3000&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Health check on 3001 — no auth, internal network only&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;healthApp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;express&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;healthApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;healthHandler&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;healthApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3001&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Nginx Load Balancer Configuration
&lt;/h2&gt;

&lt;p&gt;Configure Nginx to use your health endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3001&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3002&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;# Nginx Plus / open-source with health_check module&lt;/span&gt;
    &lt;span class="c1"&gt;# health_check interval=10s fails=3 passes=2 uri=/health;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://api&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_next_upstream&lt;/span&gt; &lt;span class="s"&gt;error&lt;/span&gt; &lt;span class="s"&gt;timeout&lt;/span&gt; &lt;span class="s"&gt;http_502&lt;/span&gt; &lt;span class="s"&gt;http_503&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_next_upstream_tries&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For active health checking with Nginx (requires Nginx Plus or the &lt;code&gt;ngx_http_upstream_module&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;zone&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt; &lt;span class="mi"&gt;64k&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3001&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="s"&gt;api_health&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;status&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;header&lt;/span&gt; &lt;span class="s"&gt;Content-Type&lt;/span&gt; &lt;span class="p"&gt;~&lt;/span&gt; &lt;span class="sr"&gt;application/json;&lt;/span&gt;
    &lt;span class="s"&gt;body&lt;/span&gt; &lt;span class="p"&gt;~&lt;/span&gt; &lt;span class="sr"&gt;'"status":"ok"';&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="s"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://api&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;health_check&lt;/span&gt; &lt;span class="s"&gt;interval=10&lt;/span&gt; &lt;span class="s"&gt;fails=3&lt;/span&gt; &lt;span class="s"&gt;passes=2&lt;/span&gt; &lt;span class="s"&gt;uri=/health&lt;/span&gt; &lt;span class="s"&gt;match=api_health&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing Your Health Check
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Basic test&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; http://localhost:3000/health

&lt;span class="c"&gt;# Test with timeout (simulates load balancer poll)&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;--max-time&lt;/span&gt; 5 http://localhost:3000/health

&lt;span class="c"&gt;# Continuous polling (simulates process manager)&lt;/span&gt;
watch &lt;span class="nt"&gt;-n&lt;/span&gt; 2 &lt;span class="s1"&gt;'curl -s http://localhost:3000/health | jq .'&lt;/span&gt;

&lt;span class="c"&gt;# Test what happens when DB is down&lt;/span&gt;
&lt;span class="c"&gt;# Stop your database, then:&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; http://localhost:3000/health
&lt;span class="c"&gt;# Should return 503&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;A production health check should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Return 200 only when the app can actually serve requests&lt;/li&gt;
&lt;li&gt;Return 503 when dependencies are unavailable&lt;/li&gt;
&lt;li&gt;Respond in under 100ms&lt;/li&gt;
&lt;li&gt;Be accessible without authentication&lt;/li&gt;
&lt;li&gt;Check real dependencies (database, cache) — not just process liveness&lt;/li&gt;
&lt;li&gt;Expose useful metadata (version, uptime, PID)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wire it into your process manager, load balancer, and deployment pipeline — and your infrastructure will know exactly when your app is ready and when it isn't.&lt;/p&gt;

&lt;p&gt;See the &lt;a href="https://oxmgr.empellio.com/docs#health-checks" rel="noopener noreferrer"&gt;Oxmgr docs&lt;/a&gt; for process manager health check configuration.&lt;/p&gt;

</description>
      <category>node</category>
      <category>pm2</category>
      <category>devops</category>
    </item>
    <item>
      <title>Zero-Downtime Deployment — How to Deploy Without Dropping a Single Request</title>
      <dc:creator>Empellio.com</dc:creator>
      <pubDate>Fri, 20 Mar 2026 08:23:02 +0000</pubDate>
      <link>https://dev.to/empellio/zero-downtime-deployment-how-to-deploy-without-dropping-a-single-request-5245</link>
      <guid>https://dev.to/empellio/zero-downtime-deployment-how-to-deploy-without-dropping-a-single-request-5245</guid>
      <description>&lt;p&gt;Every time you deploy, you have a choice: take the app down briefly or keep it up. For most production systems, downtime — even 5 seconds — is unacceptable.&lt;/p&gt;

&lt;p&gt;This guide covers the techniques that make zero-downtime deployments possible, from the simplest single-server setup to multi-server strategies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Deployments Cause Downtime
&lt;/h2&gt;

&lt;p&gt;The naive deploy sequence looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bad — causes downtime&lt;/span&gt;
pm2 stop api
git pull
npm run build
pm2 start api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Between &lt;code&gt;stop&lt;/code&gt; and the new process starting and passing health checks, your app is down. Connections get refused. Users see errors. Load balancers fail health checks and trigger alerts.&lt;/p&gt;

&lt;p&gt;Even if the downtime is 2 seconds, it shows up in your error rate and p99 latency graphs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Prerequisites
&lt;/h2&gt;

&lt;p&gt;Zero-downtime deployment requires three things from your application:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Graceful shutdown&lt;/strong&gt; — the app must finish in-flight requests before exiting. If it doesn't, requests that hit the dying instance get cut off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Fast startup&lt;/strong&gt; — the new instance must be able to pass health checks quickly. An app that takes 30 seconds to initialize creates a 30-second window of reduced capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Stateless design&lt;/strong&gt; — if session state lives in-memory, it dies with the process. Use Redis or a database for state so any instance can handle any request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Graceful Shutdown
&lt;/h2&gt;

&lt;p&gt;This is non-negotiable. Your app must handle SIGTERM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;express&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createServer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:http&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;express&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Track active connections&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;connections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;connection&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;connections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;close&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;connections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Simulate some work&lt;/span&gt;
  &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;world&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; listening on :3000`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Graceful shutdown handler&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;shutdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; received, shutting down gracefully...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Stop accepting new connections&lt;/span&gt;
  &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Error closing server:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// All connections closed — safe to exit&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Server closed. Exiting.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Destroy idle connections immediately&lt;/span&gt;
  &lt;span class="c1"&gt;// (active connections will close after request completes)&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;socket&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;connections&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Force exit if graceful shutdown takes too long&lt;/span&gt;
  &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Graceful shutdown timed out, forcing exit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SIGTERM&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SIGTERM&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SIGINT&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SIGINT&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test your graceful shutdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the app, note the PID&lt;/span&gt;
node server.js &amp;amp;
&lt;span class="nv"&gt;PID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$!&lt;/span&gt;

&lt;span class="c"&gt;# Send a slow request&lt;/span&gt;
curl http://localhost:3000/ &amp;amp;

&lt;span class="c"&gt;# Immediately send SIGTERM&lt;/span&gt;
&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nv"&gt;$PID&lt;/span&gt;

&lt;span class="c"&gt;# The slow request should still complete&lt;/span&gt;
&lt;span class="c"&gt;# The server should exit cleanly after&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Rolling Restart
&lt;/h2&gt;

&lt;p&gt;A rolling restart replaces instances one-by-one rather than all at once. While one instance shuts down gracefully, the others continue serving traffic. When the new instance passes health checks, the next old instance shuts down.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1: [Worker 0: old] [Worker 1: old] [Worker 2: old]
                                                       ↓ (start new Worker 0)
Step 2: [Worker 0: NEW] [Worker 1: old] [Worker 2: old]   ← health check passes
                                         ↓ (stop old Worker 1, start new Worker 1)
Step 3: [Worker 0: NEW] [Worker 1: NEW] [Worker 2: old]   ← health check passes
                                                       ↓ (stop old Worker 2)
Step 4: [Worker 0: NEW] [Worker 1: NEW] [Worker 2: NEW]   ← deploy complete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Oxmgr:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# After pulling new code and building&lt;/span&gt;
oxmgr reload api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Oxmgr handles the rollout automatically. If any new instance fails its health check, the rollout stops and remaining instances keep running the old code. You can rollback manually or fix the issue and re-deploy.&lt;/p&gt;

&lt;p&gt;The health check endpoint is the key to making this work — see &lt;a href="https://oxmgr.empellio.com/blog/nodejs-health-checks" rel="noopener noreferrer"&gt;Health Checks for Node.js Apps&lt;/a&gt; for how to build one that actually catches failures.&lt;/p&gt;

&lt;p&gt;Configure the health check that guards each step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"node dist/server.js"&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="py"&gt;restart_on_exit&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="nn"&gt;[processes.api.health_check]&lt;/span&gt;
&lt;span class="py"&gt;endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:3000/health"&lt;/span&gt;
&lt;span class="py"&gt;interval_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;       &lt;span class="c"&gt;# check frequently during rollout&lt;/span&gt;
&lt;span class="py"&gt;timeout_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="py"&gt;healthy_threshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="c"&gt;# must pass 2 consecutive checks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With PM2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pm2 reload api    &lt;span class="c"&gt;# rolling restart&lt;/span&gt;
&lt;span class="c"&gt;# NOT: pm2 restart api  (that restarts all at once)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Blue-Green Deployment
&lt;/h2&gt;

&lt;p&gt;Blue-green keeps two complete environments — "blue" (current) and "green" (new). You deploy to green while blue serves traffic, then switch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    Load Balancer
                         │
              ┌──────────┴──────────┐
              │                     │
           [BLUE]                [GREEN]
       (running v1.2)        (running v1.3)
        ← serving traffic     ← being deployed to
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When green is ready and passes health checks, flip the load balancer. Rollback = flip back to blue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nginx upstream swap:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# /etc/nginx/conf.d/upstream.conf&lt;/span&gt;
&lt;span class="c1"&gt;# Before deploy: points to blue&lt;/span&gt;
&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;# blue&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# After deploy: points to green&lt;/span&gt;
&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3001&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;# green&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy to green (port 3001)&lt;/span&gt;
&lt;span class="nv"&gt;PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3001 oxmgr start &lt;span class="nt"&gt;--config&lt;/span&gt; oxfile.green.toml

&lt;span class="c"&gt;# Wait for green health check&lt;/span&gt;
&lt;span class="k"&gt;until &lt;/span&gt;curl &lt;span class="nt"&gt;-sf&lt;/span&gt; http://localhost:3001/health&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Swap nginx upstream&lt;/span&gt;
&lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s1"&gt;'s/3000/3001/'&lt;/span&gt; /etc/nginx/conf.d/upstream.conf
nginx &lt;span class="nt"&gt;-s&lt;/span&gt; reload

&lt;span class="c"&gt;# Old blue is now free — keep it warm for rollback&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✓ Instant rollback (flip the load balancer back)&lt;/li&gt;
&lt;li&gt;✓ No in-flight requests lost during the switch&lt;/li&gt;
&lt;li&gt;✗ Requires 2× the resources during deployment&lt;/li&gt;
&lt;li&gt;✗ More complex setup&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Deploy Script
&lt;/h2&gt;

&lt;p&gt;A production-grade deploy script that combines rolling restart with automatic rollback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;APP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;
&lt;span class="nv"&gt;HEALTH_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:3000/health"&lt;/span&gt;
&lt;span class="nv"&gt;MAX_WAIT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;60

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"=== Deploy started at &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; ==="&lt;/span&gt;

&lt;span class="c"&gt;# 1. Pull latest code&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Pulling code..."&lt;/span&gt;
git fetch origin main
git reset &lt;span class="nt"&gt;--hard&lt;/span&gt; origin/main

&lt;span class="c"&gt;# 2. Install dependencies if lockfile changed&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;git diff HEAD@&lt;span class="o"&gt;{&lt;/span&gt;1&lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="nt"&gt;--name-only&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"package-lock.json"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Installing dependencies..."&lt;/span&gt;
  npm ci &lt;span class="nt"&gt;--omit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dev
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 3. Build&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Building..."&lt;/span&gt;
npm run build

&lt;span class="c"&gt;# 4. Rolling restart&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Reloading processes..."&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; oxmgr reload &lt;span class="nv"&gt;$APP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ERROR: Reload failed. Checking if old version still running..."&lt;/span&gt;
  oxmgr status
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 5. Verify health&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Waiting for health check..."&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 &lt;span class="nv"&gt;$MAX_WAIT&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  if &lt;/span&gt;curl &lt;span class="nt"&gt;-sf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HEALTH_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null 2&amp;gt;&amp;amp;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Health check passed after &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;i&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s"&lt;/span&gt;
    &lt;span class="nb"&gt;break
  &lt;/span&gt;&lt;span class="k"&gt;fi
  if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$i&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; &lt;span class="nv"&gt;$MAX_WAIT&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ERROR: Health check failed after &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MAX_WAIT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s"&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
  &lt;span class="k"&gt;fi
  &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;1
&lt;span class="k"&gt;done

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"=== Deploy complete ==="&lt;/span&gt;
oxmgr status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it executable and run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x deploy.sh
./deploy.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Health Check Endpoint
&lt;/h2&gt;

&lt;p&gt;Your &lt;code&gt;/health&lt;/code&gt; endpoint is the single most important endpoint in your app for deployments. Make it useful:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;healthy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Check database&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT 1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;healthy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Check Redis (if used)&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Decide if Redis failure = unhealthy for your app&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Return version and uptime for visibility&lt;/span&gt;
  &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;npm_package_version&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;uptime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uptime&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;healthy&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;healthy&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;degraded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;checks&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The process manager polls this endpoint. If it returns non-200, the new instance is considered unhealthy and the rolling restart stops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Killing the process before requests complete:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wrong — kills immediately&lt;/span&gt;
pm2 restart api

&lt;span class="c"&gt;# Right — waits for in-flight requests&lt;/span&gt;
pm2 reload api
oxmgr reload api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Not handling SIGTERM in the app:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without a SIGTERM handler, Node.js exits immediately when the process manager sends the graceful shutdown signal. In-flight requests get cut off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Health endpoint that always returns 200:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Wrong — returns 200 even when broken&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;

&lt;span class="c1"&gt;// Right — actually checks dependencies&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dbOk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkDatabase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dbOk&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dbOk&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deploying with &lt;code&gt;npm install&lt;/code&gt; instead of &lt;code&gt;npm ci&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;npm install&lt;/code&gt; can silently update packages. &lt;code&gt;npm ci&lt;/code&gt; installs exactly what's in &lt;code&gt;package-lock.json&lt;/code&gt;. Always use &lt;code&gt;npm ci&lt;/code&gt; in production deploys.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Zero-downtime deployment requires three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Graceful shutdown&lt;/strong&gt; in your app code (handle SIGTERM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rolling restart&lt;/strong&gt; from your process manager (&lt;code&gt;oxmgr reload&lt;/code&gt;, not &lt;code&gt;restart&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health checks&lt;/strong&gt; that accurately reflect whether the app is ready&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With this in place, deploys become invisible to users — no connection errors, no 502s, no alert storms at 2am.&lt;/p&gt;

&lt;p&gt;See the &lt;a href="https://oxmgr.empellio.com/docs" rel="noopener noreferrer"&gt;Oxmgr docs&lt;/a&gt; for health check configuration and the &lt;a href="https://oxmgr.empellio.com/blog/how-to-deploy-nodejs-production" rel="noopener noreferrer"&gt;deployment guide&lt;/a&gt; for the full setup.&lt;/p&gt;

</description>
      <category>node</category>
      <category>oxmgr</category>
      <category>pm2</category>
      <category>devops</category>
    </item>
    <item>
      <title>Node.js Clustering — How to Use All CPU Cores in Production</title>
      <dc:creator>Empellio.com</dc:creator>
      <pubDate>Sun, 15 Mar 2026 13:09:19 +0000</pubDate>
      <link>https://dev.to/empellio/nodejs-clustering-how-to-use-all-cpu-cores-in-production-1cj</link>
      <guid>https://dev.to/empellio/nodejs-clustering-how-to-use-all-cpu-cores-in-production-1cj</guid>
      <description>&lt;p&gt;Node.js runs on a single thread. That's not a bug — it's a deliberate design decision that makes async I/O simple. But it means that by default, a Node.js process uses exactly one CPU core, no matter how many your server has.&lt;/p&gt;

&lt;p&gt;On a 4-core VPS, three quarters of your compute is sitting idle.&lt;/p&gt;

&lt;p&gt;Clustering fixes this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Clustering Means
&lt;/h2&gt;

&lt;p&gt;Clustering in Node.js means running multiple instances of your application — one per CPU core — all sharing the same port. Incoming connections are distributed across instances by the OS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                      ┌──────────────────────────┐
                      │        Port 3000         │
                      └────────────┬─────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                    │
         ┌────▼─────┐        ┌─────▼─────┐        ┌─────▼─────┐
         │ Worker 0 │        │ Worker 1  │        │ Worker 2  │
         │ (PID 101)│        │ (PID 102) │        │ (PID 103) │
         └──────────┘        └───────────┘        └───────────┘
              │                    │                    │
         CPU Core 0           CPU Core 1            CPU Core 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each worker is an independent Node.js process with its own event loop, memory heap, and V8 instance. They share no state in-memory — only the listening port.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1: The Cluster Module (Built-in)
&lt;/h2&gt;

&lt;p&gt;Node.js ships with a &lt;code&gt;cluster&lt;/code&gt; module that handles forking and connection distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;cluster&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:cluster&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;availableParallelism&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:os&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createServer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:http&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isPrimary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;numCPUs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;availableParallelism&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Primary &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; running, forking &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;numCPUs&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; workers`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;numCPUs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;exit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; died (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;), restarting...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// auto-restart crashed workers&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Each worker runs this&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createServer&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeHead&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Handled by worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; started`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Express:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;cluster&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:cluster&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;availableParallelism&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:os&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;express&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isPrimary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;numCPUs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;availableParallelism&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;numCPUs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;exit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Worker &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; died, restarting`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;express&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello from worker&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros of the cluster module:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No external dependencies&lt;/li&gt;
&lt;li&gt;Full control over fork behavior&lt;/li&gt;
&lt;li&gt;Can pass messages between primary and workers via IPC&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boilerplate in every project&lt;/li&gt;
&lt;li&gt;You manage restarts yourself&lt;/li&gt;
&lt;li&gt;No visibility into worker health from outside the process&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Option 2: Let the Process Manager Handle It
&lt;/h2&gt;

&lt;p&gt;The cleaner approach: keep your app code simple and let the process manager handle clustering.&lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;Oxmgr&lt;/strong&gt;, set &lt;code&gt;instances&lt;/code&gt; in &lt;code&gt;oxfile.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"node dist/server.js"&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;          &lt;span class="c"&gt;# explicit count&lt;/span&gt;
&lt;span class="py"&gt;restart_on_exit&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# Or use CPU count automatically:&lt;/span&gt;
&lt;span class="c"&gt;# instances = "max"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your application code has no clustering logic — it's just a regular Express/Fastify/Hapi app listening on a port. Oxmgr forks it &lt;code&gt;instances&lt;/code&gt; times and distributes connections.&lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;PM2&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pm2 start app.js &lt;span class="nt"&gt;-i&lt;/span&gt; max     &lt;span class="c"&gt;# max = number of CPU cores&lt;/span&gt;
pm2 start app.js &lt;span class="nt"&gt;-i&lt;/span&gt; 4       &lt;span class="c"&gt;# explicit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or in &lt;code&gt;ecosystem.config.js&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;apps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;api&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dist/server.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;max&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;exec_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cluster&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this approach is better:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean separation of concerns — app doesn't know about clustering&lt;/li&gt;
&lt;li&gt;Process manager handles crashes across all instances&lt;/li&gt;
&lt;li&gt;Zero-downtime rolling restarts work across all instances&lt;/li&gt;
&lt;li&gt;Health checks monitor each instance independently&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Option 3: Worker Threads
&lt;/h2&gt;

&lt;p&gt;Don't confuse clustering with Worker Threads. They're different tools for different problems:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;th&gt;Worker Threads&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Purpose&lt;/td&gt;
&lt;td&gt;Scale I/O-bound workloads across cores&lt;/td&gt;
&lt;td&gt;Run CPU-intensive tasks off the main thread&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Separate heap per process&lt;/td&gt;
&lt;td&gt;Shared memory available (&lt;code&gt;SharedArrayBuffer&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Communication&lt;/td&gt;
&lt;td&gt;IPC (slower)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;postMessage&lt;/code&gt; or shared buffers (faster)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure isolation&lt;/td&gt;
&lt;td&gt;One crash = one process&lt;/td&gt;
&lt;td&gt;One crash = entire process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use case&lt;/td&gt;
&lt;td&gt;Web servers, API handlers&lt;/td&gt;
&lt;td&gt;Image processing, crypto, data parsing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Use clusters for serving HTTP. Use Worker Threads for CPU-heavy operations within a single process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;isMainThread&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;parentPort&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:worker_threads&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isMainThread&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Main thread: handle HTTP, delegate heavy work&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Worker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./heavy-computation.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;postMessage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;bigArray&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Computation result:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Worker thread: do the heavy lifting&lt;/span&gt;
  &lt;span class="nx"&gt;parentPort&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;doExpensiveComputation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;parentPort&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;postMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How Many Instances Should You Run?
&lt;/h2&gt;

&lt;p&gt;The common advice is "one per CPU core." This is right for CPU-bound workloads. For I/O-bound Node.js apps (which is most of them), the relationship is more nuanced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU-bound apps&lt;/strong&gt; (heavy computation, data processing):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One instance per physical core&lt;/li&gt;
&lt;li&gt;More instances than cores = context switching overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;I/O-bound apps&lt;/strong&gt; (database queries, HTTP calls, file reads):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with one per core&lt;/li&gt;
&lt;li&gt;Benchmark under load — you might get better results with fewer instances sharing more memory&lt;/li&gt;
&lt;li&gt;The bottleneck is usually the database, not the CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Practical starting point:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="c"&gt;# 2-core VPS&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="c"&gt;# 4-core server&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="c"&gt;# Leave one core for the OS and other processes&lt;/span&gt;
&lt;span class="c"&gt;# instances = 3  (on a 4-core machine under heavy load)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitor CPU and memory under load. If workers are CPU-saturated (consistently &amp;gt;80%), adding instances helps. If they're mostly idle waiting for DB queries, more instances just means more memory usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sticky Sessions and Shared State
&lt;/h2&gt;

&lt;p&gt;Clustering breaks any in-memory session state. If user A hits Worker 0 on request 1 and Worker 1 on request 2, Worker 1 doesn't know about Worker 0's session data.&lt;/p&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Use a shared session store (recommended):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express-session&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;RedisStore&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;connect-redis&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;redis&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;redisClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;REDIS_URL&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;session&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RedisStore&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;redisClient&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SESSION_SECRET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;resave&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;saveUninitialized&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Use sticky sessions (not recommended for new projects):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sticky sessions route each client to the same worker consistently via a cookie or IP hash. This defeats the purpose of load balancing and creates uneven distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Design stateless services:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The best approach for clustered apps: keep no state in memory. Use Redis, a database, or JWT tokens instead. Your app becomes trivially scalable — from 1 to 100 instances with no code changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Graceful Reloads with Clustering
&lt;/h2&gt;

&lt;p&gt;When you update your app, you want to restart workers without dropping connections. This is the rolling restart pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Worker 0: SIGTERM → drain connections → exit
                         ↓
                    New Worker 0 starts → health check passes
Worker 1: SIGTERM → drain connections → exit
                         ↓
                    New Worker 1 starts → health check passes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Oxmgr:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oxmgr reload api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This performs a rolling restart across all instances. If a new instance fails its health check, the reload stops and old instances keep running — automatic rollback.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Performance Test
&lt;/h2&gt;

&lt;p&gt;Before and after enabling clustering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install autocannon&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; autocannon

&lt;span class="c"&gt;# Benchmark single process (instances = 1)&lt;/span&gt;
autocannon &lt;span class="nt"&gt;-c&lt;/span&gt; 100 &lt;span class="nt"&gt;-d&lt;/span&gt; 10 http://localhost:3000

&lt;span class="c"&gt;# Enable clustering (instances = 4), restart, benchmark again&lt;/span&gt;
autocannon &lt;span class="nt"&gt;-c&lt;/span&gt; 100 &lt;span class="nt"&gt;-d&lt;/span&gt; 10 http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a CPU-bound workload, you should see throughput multiply roughly linearly with core count. On an I/O-bound workload with a fast database, the gains are less dramatic but still meaningful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Node.js is single-threaded by default — clustering uses all CPU cores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple apps:&lt;/strong&gt; use &lt;code&gt;instances&lt;/code&gt; in your process manager config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex apps:&lt;/strong&gt; consider the built-in cluster module for fine-grained control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU-heavy tasks:&lt;/strong&gt; use Worker Threads within a process, not more cluster instances&lt;/li&gt;
&lt;li&gt;Always use shared state storage (Redis) — in-memory state doesn't survive across workers&lt;/li&gt;
&lt;li&gt;Process managers handle the hard parts: restarts, health checks, rolling reloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See the &lt;a href="https://oxmgr.empellio.com/docs" rel="noopener noreferrer"&gt;Oxmgr docs&lt;/a&gt; for cluster configuration, or the &lt;a href="https://oxmgr.empellio.com/blog/how-to-deploy-nodejs-production" rel="noopener noreferrer"&gt;deployment guide&lt;/a&gt; for the full production setup walkthrough.&lt;/p&gt;

</description>
      <category>node</category>
      <category>performance</category>
      <category>pm2</category>
    </item>
    <item>
      <title>What Is Crash Recovery? How Process Managers Keep Your App Online After Failures</title>
      <dc:creator>Empellio.com</dc:creator>
      <pubDate>Thu, 12 Mar 2026 10:35:22 +0000</pubDate>
      <link>https://dev.to/empellio/what-is-crash-recovery-how-process-managers-keep-your-app-online-after-failures-j4o</link>
      <guid>https://dev.to/empellio/what-is-crash-recovery-how-process-managers-keep-your-app-online-after-failures-j4o</guid>
      <description>&lt;h1&gt;
  
  
  What Is Crash Recovery?
&lt;/h1&gt;

&lt;p&gt;Your production app crashes. A bug slips through, memory spikes, a network dependency times out and throws an unhandled exception — it doesn't matter why. What matters is what happens next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash recovery&lt;/strong&gt; is the automatic process of detecting that an application has died and restarting it as fast as possible, before your users have time to notice.&lt;/p&gt;

&lt;p&gt;Without crash recovery, a process that crashes stays dead until a human intervenes. With it, the same crash can be invisible — the process restarts in milliseconds and keeps serving traffic.&lt;/p&gt;

&lt;p&gt;Crash recovery is one of the core reasons you need a &lt;a href="https://oxmgr.empellio.com/blog/what-is-a-process-manager" rel="noopener noreferrer"&gt;process manager&lt;/a&gt; in production — without one, there's nothing watching your app to trigger a restart.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Crash Recovery Works
&lt;/h2&gt;

&lt;p&gt;Every operating system gives processes a way to signal their exit. When a process terminates — whether it crashes, runs out of memory, or is killed — it emits an exit event with a status code.&lt;/p&gt;

&lt;p&gt;A process manager listens for these events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App process exits (status: 1 — error)
        ↓
Process manager receives exit event
        ↓
Check: is this process configured to restart?
        ↓
Yes → spawn new process
        ↓
Wait for process to be ready (health check or port listen)
        ↓
Resume serving traffic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical variable is how long this takes. The gap between the exit event and the new process serving traffic is your &lt;strong&gt;downtime window&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Determines Recovery Speed
&lt;/h2&gt;

&lt;p&gt;Three factors control how fast a process manager can recover from a crash:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Manager's Own Runtime
&lt;/h3&gt;

&lt;p&gt;A process manager written in a scripting language (JavaScript, Python, Ruby) has to do real work to respond to an exit event — the VM needs to be scheduled, the garbage collector might pause, the event loop might be busy.&lt;/p&gt;

&lt;p&gt;A compiled binary (Rust, Go, C) responds in microseconds. There's no VM, no GC, no interpreter. The exit handler fires and the spawn call happens immediately.&lt;/p&gt;

&lt;p&gt;This is the biggest factor. PM2 (Node.js daemon) recovers in ~400ms. Oxmgr (Rust binary) recovers in ~11ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Process Spawn Time
&lt;/h3&gt;

&lt;p&gt;Spawning a new process takes time regardless of the manager. For a Node.js app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OS process creation: ~1–5ms&lt;/li&gt;
&lt;li&gt;Node.js startup: ~50–200ms (depending on module load time)&lt;/li&gt;
&lt;li&gt;Application initialization: varies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The process manager can't control how fast your app starts. But it can start the spawn immediately after detecting the crash, rather than waiting for polling intervals.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Health Check Configuration
&lt;/h3&gt;

&lt;p&gt;After spawning, the manager needs to know when the process is ready. Two approaches:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Port listening&lt;/strong&gt; — wait until the process binds to its port. Simple, but doesn't guarantee the app is actually serving valid responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP health check&lt;/strong&gt; — poll an endpoint until it returns 200. Slower to confirm readiness, but more accurate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api.health_check]&lt;/span&gt;
&lt;span class="py"&gt;endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:3000/health"&lt;/span&gt;
&lt;span class="py"&gt;interval_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="py"&gt;timeout_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For crash recovery, the key is not waiting &lt;em&gt;longer&lt;/em&gt; than necessary. If your health check polls every 30 seconds but a crash recovers in 50ms, you're waiting 30 seconds to confirm what already happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens If an App Keeps Crashing?
&lt;/h2&gt;

&lt;p&gt;Automatic restart can create a "crash loop" — the app restarts, crashes immediately, restarts again, endlessly. This is worse than staying down in some ways: it makes logs unreadable and consumes CPU spinning up processes.&lt;/p&gt;

&lt;p&gt;Most process managers handle this with restart limits and backoff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="py"&gt;max_restarts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;          &lt;span class="c"&gt;# stop trying after 10 crashes&lt;/span&gt;
&lt;span class="py"&gt;restart_delay_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;     &lt;span class="c"&gt;# wait 500ms before each restart&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exponential backoff is more sophisticated — the delay doubles each time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Crash 1: restart after 100ms&lt;/li&gt;
&lt;li&gt;Crash 2: restart after 200ms&lt;/li&gt;
&lt;li&gt;Crash 3: restart after 400ms&lt;/li&gt;
&lt;li&gt;...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives transient issues (network blips, temporary resource exhaustion) time to resolve while preventing runaway loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Crash Recovery vs. High Availability
&lt;/h2&gt;

&lt;p&gt;These are related but different concepts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash recovery&lt;/strong&gt; handles the period &lt;em&gt;after&lt;/em&gt; a single process crashes — the goal is to minimize downtime for that process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High availability&lt;/strong&gt; uses redundancy to eliminate downtime entirely — run 2+ instances so when one crashes, others continue serving traffic while the crashed one recovers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;    &lt;span class="c"&gt;# crash recovery on one instance doesn't affect the other 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 3 instances and 11ms crash recovery, a user hitting the crashed instance during that window is the only exposure. In practice, load balancers have already stopped routing to the crashed process within a similar timeframe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring Crash Recovery in Your Setup
&lt;/h2&gt;

&lt;p&gt;You can test your crash recovery speed manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find your process PID&lt;/span&gt;
oxmgr status

&lt;span class="c"&gt;# Kill it hard (no graceful shutdown)&lt;/span&gt;
&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nt"&gt;-9&lt;/span&gt; &amp;lt;pid&amp;gt;

&lt;span class="c"&gt;# Measure how long until it responds again&lt;/span&gt;
&lt;span class="nb"&gt;time &lt;/span&gt;curl &lt;span class="nt"&gt;--retry&lt;/span&gt; 100 &lt;span class="nt"&gt;--retry-delay&lt;/span&gt; 0 &lt;span class="nt"&gt;--retry-connrefused&lt;/span&gt; http://localhost:3000/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For PM2 users, the same test will show you real-world recovery times rather than theoretical numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Crash Recovery in Oxmgr
&lt;/h2&gt;

&lt;p&gt;Oxmgr is built around the assumption that crash recovery should be invisible to users. Key settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"node dist/server.js"&lt;/span&gt;
&lt;span class="py"&gt;restart_on_exit&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;restart_delay_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;         &lt;span class="c"&gt;# restart immediately&lt;/span&gt;
&lt;span class="py"&gt;max_restarts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;            &lt;span class="c"&gt;# allow 20 restarts before giving up&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;                &lt;span class="c"&gt;# run 2 instances for redundancy&lt;/span&gt;

&lt;span class="nn"&gt;[processes.api.health_check]&lt;/span&gt;
&lt;span class="py"&gt;endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:3000/health"&lt;/span&gt;
&lt;span class="py"&gt;interval_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="py"&gt;timeout_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this config, a crash on one instance triggers an immediate restart. The other instance handles traffic during the ~50ms window (11ms manager + ~40ms Node.js startup for a simple app).&lt;/p&gt;

&lt;p&gt;See the &lt;a href="https://oxmgr.empellio.com/docs#health-checks" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for health check configuration and resource limit triggers.&lt;/p&gt;

</description>
      <category>pm2</category>
      <category>node</category>
      <category>linux</category>
    </item>
    <item>
      <title>Process Manager Comparison 2026 — PM2, Systemd, Supervisor, Oxmgr, and More</title>
      <dc:creator>Empellio.com</dc:creator>
      <pubDate>Tue, 10 Mar 2026 23:17:32 +0000</pubDate>
      <link>https://dev.to/empellio/process-manager-comparison-2026-pm2-systemd-supervisor-oxmgr-and-more-4e0p</link>
      <guid>https://dev.to/empellio/process-manager-comparison-2026-pm2-systemd-supervisor-oxmgr-and-more-4e0p</guid>
      <description>&lt;p&gt;An up-to-date comparison of every major process manager for Linux and Node.js production environments. Includes PM2, systemd, Supervisor, Forever, Oxmgr, and Docker with real benchmark data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Process Manager Comparison 2026
&lt;/h2&gt;

&lt;p&gt;There are more options than ever for keeping production processes alive. This page is a living comparison — benchmarked on the same hardware, with honest trade-offs.&lt;/p&gt;

&lt;p&gt;Last updated: March 2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Contenders
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Manager&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;First Released&lt;/th&gt;
&lt;th&gt;Active&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PM2&lt;/td&gt;
&lt;td&gt;JavaScript&lt;/td&gt;
&lt;td&gt;AGPL-3.0&lt;/td&gt;
&lt;td&gt;2013&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Systemd&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;LGPL-2.1&lt;/td&gt;
&lt;td&gt;2010&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supervisor&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;BSD&lt;/td&gt;
&lt;td&gt;2004&lt;/td&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forever&lt;/td&gt;
&lt;td&gt;JavaScript&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;2012&lt;/td&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Circus&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;2012&lt;/td&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oxmgr&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;2026&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  PM2
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The industry standard for Node.js.&lt;/strong&gt; PM2 is what most developers reach for first, and for good reason — it's well-documented, has a large community, and works reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Excellent documentation and community&lt;/li&gt;
&lt;li&gt;Built-in cluster mode (multi-core utilization)&lt;/li&gt;
&lt;li&gt;Log management with rotation&lt;/li&gt;
&lt;li&gt;Startup script generation (&lt;code&gt;pm2 startup&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ecosystem.config.js&lt;/code&gt; for config-as-code&lt;/li&gt;
&lt;li&gt;PM2 Plus / Keymetrics integration for monitoring dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~83 MB RAM for the daemon (Node.js runtime overhead)&lt;/li&gt;
&lt;li&gt;~1,240 ms cold startup&lt;/li&gt;
&lt;li&gt;~410 ms crash recovery&lt;/li&gt;
&lt;li&gt;Node.js version can conflict with managed apps&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pm2 update&lt;/code&gt; ceremony after Node.js upgrades&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using Node.js who want the richest ecosystem and don't have tight resource constraints.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; pm2
pm2 start app.js &lt;span class="nt"&gt;--name&lt;/span&gt; api &lt;span class="nt"&gt;-i&lt;/span&gt; max
pm2 startup
pm2 save
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Systemd
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The OS-native approach.&lt;/strong&gt; Systemd is the init system on virtually every modern Linux distribution. If you're running on Linux and comfortable with unit files, systemd is unbeatable for production stability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero additional overhead (part of the OS)&lt;/li&gt;
&lt;li&gt;Deep Linux integration — cgroups, namespaces, watchdog, socket activation&lt;/li&gt;
&lt;li&gt;Excellent journald log integration (&lt;code&gt;journalctl -u myapp -f&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Dependency ordering (&lt;code&gt;After=postgresql.service&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Security sandboxing (&lt;code&gt;PrivateTmp&lt;/code&gt;, &lt;code&gt;NoNewPrivileges&lt;/code&gt;, etc.)&lt;/li&gt;
&lt;li&gt;Battle-tested over 15 years&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux only — no macOS, no Windows&lt;/li&gt;
&lt;li&gt;Verbose unit file syntax&lt;/li&gt;
&lt;li&gt;No built-in cluster mode&lt;/li&gt;
&lt;li&gt;Config doesn't travel with the repo naturally&lt;/li&gt;
&lt;li&gt;Steep learning curve for advanced features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Linux-only production environments where you want OS-level control and zero runtime overhead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/node /var/www/app/server.js&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nodeapp&lt;/span&gt;
&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;NODE_ENV=production&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;myapp
systemctl start myapp
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; myapp &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Supervisor
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Python veteran.&lt;/strong&gt; Supervisor has been managing processes since 2004. It's stable, mature, and still used in millions of deployments — but it's effectively in maintenance mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Battle-tested stability&lt;/li&gt;
&lt;li&gt;XML-RPC API for programmatic control&lt;/li&gt;
&lt;li&gt;Web-based status UI&lt;/li&gt;
&lt;li&gt;Well-understood behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python runtime overhead (~30 MB)&lt;/li&gt;
&lt;li&gt;No cluster/multi-instance mode&lt;/li&gt;
&lt;li&gt;Configuration is INI-based, not modern&lt;/li&gt;
&lt;li&gt;No active feature development&lt;/li&gt;
&lt;li&gt;Slower crash recovery than modern alternatives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Legacy deployments that already use it. Not recommended for new projects.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[program:myapp]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;node /var/www/app/server.js&lt;/span&gt;
&lt;span class="py"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/www/app&lt;/span&gt;
&lt;span class="py"&gt;user&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nodeapp&lt;/span&gt;
&lt;span class="py"&gt;autostart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;autorestart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;stderr_logfile&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/log/myapp.err.log&lt;/span&gt;
&lt;span class="py"&gt;stdout_logfile&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/log/myapp.out.log&lt;/span&gt;
&lt;span class="py"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;NODE_ENV="production"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Forever
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The minimal option.&lt;/strong&gt; Forever keeps a script running indefinitely. That's it. There's almost no configuration, no dashboard, no cluster mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dead simple CLI&lt;/li&gt;
&lt;li&gt;Low learning curve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~32 MB overhead&lt;/li&gt;
&lt;li&gt;No cluster mode&lt;/li&gt;
&lt;li&gt;No advanced health checks&lt;/li&gt;
&lt;li&gt;Last major feature update was years ago&lt;/li&gt;
&lt;li&gt;Poor log management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Nothing in 2026. If you're using it, migrate to something that's actively maintained.&lt;/p&gt;




&lt;h3&gt;
  
  
  Oxmgr
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Rust-native modern alternative.&lt;/strong&gt; Built from scratch to fix PM2's overhead problems without losing PM2's developer experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~4 MB daemon memory (20× less than PM2)&lt;/li&gt;
&lt;li&gt;~38 ms cold startup (32× faster than PM2)&lt;/li&gt;
&lt;li&gt;~11 ms crash recovery (37× faster than PM2)&lt;/li&gt;
&lt;li&gt;Single binary, no runtime dependencies&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;oxfile.toml&lt;/code&gt; — version-controllable, clean syntax&lt;/li&gt;
&lt;li&gt;Cross-platform (Linux, macOS, Windows)&lt;/li&gt;
&lt;li&gt;PM2 ecosystem.js migration built-in&lt;/li&gt;
&lt;li&gt;Health check–driven rolling restarts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Newer project — smaller community than PM2&lt;/li&gt;
&lt;li&gt;No web dashboard yet (TUI in active development)&lt;/li&gt;
&lt;li&gt;No Keymetrics/PM2 Plus equivalent&lt;/li&gt;
&lt;li&gt;Smaller plugin ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; New projects, resource-constrained environments, developers who want PM2-like ergonomics at a fraction of the overhead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[processes.api]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"node dist/server.js"&lt;/span&gt;
&lt;span class="py"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="py"&gt;restart_on_exit&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;NODE_ENV&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"production"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nn"&gt;[processes.api.health_check]&lt;/span&gt;
&lt;span class="py"&gt;endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:3000/health"&lt;/span&gt;
&lt;span class="py"&gt;interval_secs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Benchmark Comparison
&lt;/h3&gt;

&lt;p&gt;Tested on AWS EC2 t3.small (2 vCPU, 2 GB RAM), Ubuntu 22.04, managing 10 Node.js HTTP servers. 20 runs each, medians reported.&lt;/p&gt;

&lt;h4&gt;
  
  
  Memory Usage
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Manager&lt;/th&gt;
&lt;th&gt;Daemon RSS&lt;/th&gt;
&lt;th&gt;Per-process overhead&lt;/th&gt;
&lt;th&gt;Total (10 procs)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PM2&lt;/td&gt;
&lt;td&gt;83 MB&lt;/td&gt;
&lt;td&gt;~8 MB&lt;/td&gt;
&lt;td&gt;~163 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Systemd&lt;/td&gt;
&lt;td&gt;0 MB&lt;/td&gt;
&lt;td&gt;~0 MB&lt;/td&gt;
&lt;td&gt;~0 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supervisor&lt;/td&gt;
&lt;td&gt;31 MB&lt;/td&gt;
&lt;td&gt;~1 MB&lt;/td&gt;
&lt;td&gt;~41 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forever&lt;/td&gt;
&lt;td&gt;32 MB&lt;/td&gt;
&lt;td&gt;~3 MB&lt;/td&gt;
&lt;td&gt;~62 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Oxmgr&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.2 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~0.3 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~7 MB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Systemd has no separate daemon — it's part of PID 1 which is always running.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Startup Time (Cold)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Manager&lt;/th&gt;
&lt;th&gt;Median&lt;/th&gt;
&lt;th&gt;P95&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PM2&lt;/td&gt;
&lt;td&gt;1,247 ms&lt;/td&gt;
&lt;td&gt;1,912 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Systemd&lt;/td&gt;
&lt;td&gt;78 ms&lt;/td&gt;
&lt;td&gt;121 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supervisor&lt;/td&gt;
&lt;td&gt;640 ms&lt;/td&gt;
&lt;td&gt;890 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forever&lt;/td&gt;
&lt;td&gt;890 ms&lt;/td&gt;
&lt;td&gt;1,240 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Oxmgr&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;38 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;52 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Crash Recovery Speed
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Manager&lt;/th&gt;
&lt;th&gt;Median&lt;/th&gt;
&lt;th&gt;P95&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PM2&lt;/td&gt;
&lt;td&gt;412 ms&lt;/td&gt;
&lt;td&gt;591 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Systemd&lt;/td&gt;
&lt;td&gt;182 ms&lt;/td&gt;
&lt;td&gt;234 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supervisor&lt;/td&gt;
&lt;td&gt;530 ms&lt;/td&gt;
&lt;td&gt;710 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forever&lt;/td&gt;
&lt;td&gt;510 ms&lt;/td&gt;
&lt;td&gt;680 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Oxmgr&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Why does this matter? See &lt;a href="https://oxmgr.empellio.com/blog/what-is-crash-recovery" rel="noopener noreferrer"&gt;What Is Crash Recovery?&lt;/a&gt; for a breakdown of what determines recovery speed and how it affects user-visible downtime.&lt;/p&gt;

&lt;h4&gt;
  
  
  Feature Comparison
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;PM2&lt;/th&gt;
&lt;th&gt;Systemd&lt;/th&gt;
&lt;th&gt;Supervisor&lt;/th&gt;
&lt;th&gt;Forever&lt;/th&gt;
&lt;th&gt;Oxmgr&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Crash recovery&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cluster mode&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config as code&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health checks&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Watchdog&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;HTTP + TCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log rotation&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Via journald&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boot persistence&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-platform&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero-downtime reload&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web dashboard&lt;/td&gt;
&lt;td&gt;PM2 Plus&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓ (basic)&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;TUI (WIP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active development&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  Decision Guide
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;You're starting a new Node.js project on a VPS:&lt;/strong&gt;&lt;br&gt;
→ Use &lt;strong&gt;Oxmgr&lt;/strong&gt;. Lowest overhead, clean config, PM2-like workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're already on PM2 and it's working:&lt;/strong&gt;&lt;br&gt;
→ Stay on &lt;strong&gt;PM2&lt;/strong&gt;. Migration has a cost, and PM2 is reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need OS-level service integration on Linux:&lt;/strong&gt;&lt;br&gt;
→ Use &lt;strong&gt;Systemd&lt;/strong&gt; directly, or Systemd to start Oxmgr.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're managing a Python or Ruby app:&lt;/strong&gt;&lt;br&gt;
→ &lt;strong&gt;Oxmgr&lt;/strong&gt; manages any process. Alternatively, Systemd or Supervisor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're in a container (Docker/Kubernetes):&lt;/strong&gt;&lt;br&gt;
→ Let the container runtime handle restarts. No additional process manager needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're on a 512 MB VPS:&lt;/strong&gt;&lt;br&gt;
→ &lt;strong&gt;Oxmgr&lt;/strong&gt; or Systemd. PM2's 83 MB daemon is a significant tax.&lt;/p&gt;




&lt;h3&gt;
  
  
  Try Oxmgr
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; oxmgr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or see the &lt;a href="https://oxmgr.empellio.com/benchmark" rel="noopener noreferrer"&gt;full benchmark page&lt;/a&gt; for interactive data, and the &lt;a href="https://oxmgr.empellio.com/docs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for configuration reference.&lt;/p&gt;

</description>
      <category>benchmark</category>
      <category>pm2</category>
      <category>node</category>
      <category>linux</category>
    </item>
    <item>
      <title>Why We Wrote a Process Manager in Rust (and what surprised us)</title>
      <dc:creator>Empellio.com</dc:creator>
      <pubDate>Fri, 06 Mar 2026 14:47:13 +0000</pubDate>
      <link>https://dev.to/empellio/why-we-wrote-a-process-manager-in-rust-and-what-surprised-us-10kd</link>
      <guid>https://dev.to/empellio/why-we-wrote-a-process-manager-in-rust-and-what-surprised-us-10kd</guid>
      <description>&lt;p&gt;We build developer tools at Empellio. PM2 was always just &lt;em&gt;there&lt;/em&gt; — install it, forget it, it works.&lt;/p&gt;

&lt;p&gt;Until we started asking: &lt;strong&gt;how much of the overhead is actually necessary?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That question turned into &lt;a href="https://github.com/Vladimir-Urik/OxMgr" rel="noopener noreferrer"&gt;Oxmgr&lt;/a&gt; — a Rust-based PM2 alternative. Here's what we learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Rust and not Go?
&lt;/h2&gt;

&lt;p&gt;Go would've been faster to write. But we wanted two things Go doesn't give for free:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory safety &lt;strong&gt;without&lt;/strong&gt; a garbage collector&lt;/li&gt;
&lt;li&gt;Predictable latency without GC pauses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a process manager running 24/7 and keeping your services alive — that tradeoff mattered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Surprise #1: Polling is the wrong mental model
&lt;/h2&gt;

&lt;p&gt;Our first daemon used a tick-based loop. Every X milliseconds — check processes, update metrics, handle restarts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; if a process crashes 1ms after a tick, you wait almost the entire interval before reacting.&lt;/p&gt;

&lt;p&gt;The fix? Let the OS tell you when something happens instead of constantly asking. When a child process exits, handle the event immediately.&lt;/p&gt;

&lt;p&gt;No polling. No waiting. Just react.&lt;/p&gt;

&lt;h2&gt;
  
  
  Surprise #2: The benchmark numbers
&lt;/h2&gt;

&lt;p&gt;We expected Rust to be faster. We did not expect &lt;strong&gt;this&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Crash detection:
oxmgr →  4ms
pm2   → 170ms

Difference: 42x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our first reaction was that we'd made a mistake. We ran it again. Same numbers.&lt;/p&gt;

&lt;p&gt;The reason isn't Rust magic — it's PM2's Node.js IPC layer. When a process crashes, the signal travels through the event loop before PM2 reacts. Oxmgr listens to OS exit events directly.&lt;/p&gt;

&lt;p&gt;Memory was even more surprising:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Daemon RSS at 100 processes:
oxmgr →   7MB
pm2   → 148MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PM2 is a Node.js app managing Node.js apps. Oxmgr is a small Rust binary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Surprise #3: The compiler finds your bugs first
&lt;/h2&gt;

&lt;p&gt;Process state sounds simple: &lt;code&gt;running&lt;/code&gt;, &lt;code&gt;stopped&lt;/code&gt;, &lt;code&gt;crashed&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In practice it's not. What happens when you restart a process mid-crash? What if a reload is in progress when a health check fails?&lt;/p&gt;

&lt;p&gt;In most languages you discover these edge cases in production. In Rust, the type system forces you to model every state transition explicitly. If you haven't handled a case — it won't compile.&lt;/p&gt;

&lt;p&gt;We found more bugs during development than we would have in any other language. Not because we were careful. Because the compiler was careful for us.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it is now
&lt;/h2&gt;

&lt;p&gt;Oxmgr is at v0.1.4. Linux, macOS, Windows. Manages Node.js, Python, Go, Rust — anything you can run from a command line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; oxmgr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repo + benchmarks: &lt;a href="https://github.com/Vladimir-Urik/OxMgr" rel="noopener noreferrer"&gt;github.com/Vladimir-Urik/OxMgr&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;People are already running it in production — would love to hear how it goes for you.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>devops</category>
      <category>programming</category>
      <category>node</category>
    </item>
    <item>
      <title>Oxmgr: A Lightweight PM2 Alternative Written in Rust</title>
      <dc:creator>Empellio.com</dc:creator>
      <pubDate>Tue, 03 Mar 2026 16:10:23 +0000</pubDate>
      <link>https://dev.to/empellio/oxmgr-a-lightweight-pm2-alternative-written-in-rust-3ho8</link>
      <guid>https://dev.to/empellio/oxmgr-a-lightweight-pm2-alternative-written-in-rust-3ho8</guid>
      <description>&lt;p&gt;A deterministic, cross-platform process manager for Node.js, Python, Go, Rust, and any executable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With PM2 in Mixed-Language Environments
&lt;/h2&gt;

&lt;p&gt;PM2 earned its place. For Node.js teams who needed a quick way to keep services alive, restart on crashes, and tail logs from a single CLI, it solved a real problem. It still does.&lt;/p&gt;

&lt;p&gt;But production infrastructure rarely stays pure.&lt;/p&gt;

&lt;p&gt;A team starts with a Node.js API, then adds a Python worker for ML inference, a Go binary for a performance-sensitive service, and a few shell-based cron jobs. Suddenly PM2 is managing things it was never designed for, and the configuration becomes a mixture of workarounds, undocumented flags, and institutional knowledge that lives only in someone's terminal history.&lt;/p&gt;

&lt;p&gt;That is the gap Oxmgr is designed to fill.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Oxmgr Is
&lt;/h2&gt;

&lt;p&gt;Oxmgr is a process manager written in Rust for running and supervising long-lived services on Linux, macOS, and Windows. It is language-agnostic by design, which means it works equally well for Node.js applications, Python services, Go binaries, Rust executables, and arbitrary shell commands.&lt;/p&gt;

&lt;p&gt;It is available on &lt;a href="https://github.com/Vladimir-Urik/OxMgr" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and installable via npm, Homebrew, Chocolatey, or APT.&lt;/p&gt;

&lt;p&gt;The core design principle is that process management should be declarative and reproducible rather than dependent on operator memory and runtime commands. That shift in model is what makes Oxmgr genuinely different from a cosmetic PM2 reimplementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Declarative Configuration and Idempotent Apply
&lt;/h2&gt;

&lt;p&gt;The centerpiece of Oxmgr's operational model is &lt;code&gt;oxfile.toml&lt;/code&gt;, its native configuration format. Instead of managing services through a sequence of CLI commands, you describe the desired state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="nn"&gt;[[apps]]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"api"&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"node server.js"&lt;/span&gt;
&lt;span class="py"&gt;restart_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"on_failure"&lt;/span&gt;
&lt;span class="py"&gt;max_restarts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="py"&gt;health_cmd&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"curl -fsS http://127.0.0.1:3000/health"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then you apply it:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oxmgr validate ./oxfile.toml &lt;span class="nt"&gt;--env&lt;/span&gt; prod
oxmgr apply ./oxfile.toml &lt;span class="nt"&gt;--env&lt;/span&gt; prod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;oxmgr validate&lt;/code&gt; catches invalid or inconsistent configuration before it reaches production. &lt;code&gt;oxmgr apply&lt;/code&gt; is idempotent — if nothing changed, unchanged services are left untouched. If something did change, only the affected services are reconciled.&lt;/p&gt;

&lt;p&gt;This matters operationally. Config lives in Git, is reviewable in pull requests, and is safe to run repeatedly in CI/CD pipelines without risk of bouncing services unnecessarily.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production-Grade Supervision Features
&lt;/h2&gt;

&lt;p&gt;A serious PM2 alternative needs to do more than restart failed processes. Oxmgr includes the supervision primitives that production environments actually require.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Health checks&lt;/strong&gt; run a configurable command on a schedule. After a defined number of consecutive failures, Oxmgr automatically restarts the service. This is more reliable than restart-on-exit alone because it catches processes that are alive but non-functional — the kind of failure that daemon restarts cannot detect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash-loop protection&lt;/strong&gt; prevents infinite restart storms. If a service crashes repeatedly within a short window, Oxmgr applies a configurable cutoff and stops automatic restarts, surfacing the problem rather than masking it behind an endless loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource monitoring&lt;/strong&gt; tracks CPU and RAM per process. On Linux, Oxmgr optionally enforces hard limits through cgroup v2, which means resource constraints are applied at the OS level rather than relying on application-level cooperation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graceful shutdown&lt;/strong&gt; respects configurable stop signals and timeout escalation, giving services the opportunity to drain connections cleanly before being forcibly terminated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-process log management&lt;/strong&gt; captures stdout and stderr to separate files with rotation, keeping logs organized across fleets of services.&lt;/p&gt;

&lt;p&gt;All of this is visible through a built-in terminal UI (&lt;code&gt;oxmgr ui&lt;/code&gt;) that provides a fleet summary, real-time metrics, and keyboard-driven controls for inspection and lifecycle operations.&lt;/p&gt;
&lt;h2&gt;
  
  
  Migrating From PM2
&lt;/h2&gt;

&lt;p&gt;A PM2 alternative is only useful if adoption does not require rewriting everything at once. Oxmgr addresses this directly.&lt;/p&gt;

&lt;p&gt;It supports PM2's &lt;code&gt;ecosystem.config.json&lt;/code&gt; format as a migration bridge. Existing teams can import their current configuration, evaluate Oxmgr against familiar workloads, and convert to &lt;code&gt;oxfile.toml&lt;/code&gt; when they are ready:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Import existing PM2 config&lt;/span&gt;
oxmgr import ./ecosystem.config.json

&lt;span class="c"&gt;# Convert to native format for long-term use&lt;/span&gt;
oxmgr convert ecosystem.config.json &lt;span class="nt"&gt;--out&lt;/span&gt; oxfile.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This incremental path matters because most teams do not switch operational tooling in a single migration event. They evaluate in parallel, adopt gradually, and formalize the switch once confidence is established.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Day-to-Day Workflow
&lt;/h2&gt;

&lt;p&gt;For teams that care about operational simplicity, the daily workflow in Oxmgr looks like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oxmgr validate ./oxfile.toml &lt;span class="nt"&gt;--env&lt;/span&gt; prod
oxmgr apply ./oxfile.toml &lt;span class="nt"&gt;--env&lt;/span&gt; prod
oxmgr status api
oxmgr logs api &lt;span class="nt"&gt;-f&lt;/span&gt;
oxmgr pull api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The last command — &lt;code&gt;oxmgr pull&lt;/code&gt; — is worth highlighting. For git-backed services, it fetches the latest changes and reloads or restarts the service only if the commit actually changed. This avoids unnecessary disruption on unchanged deployments and fits naturally into webhook-driven update workflows.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Rust?
&lt;/h2&gt;

&lt;p&gt;Rust was a deliberate choice, not a trend. Oxmgr runs as a long-lived system daemon, and that context has specific requirements: a small runtime dependency surface, predictable memory behavior, a single native binary, and consistent cross-platform support without requiring a runtime installed on the target host.&lt;/p&gt;

&lt;p&gt;Rust satisfies all of these. The choice is not about raw performance benchmarks — it is about building a durable, predictable daemon that behaves consistently across Linux, macOS, and Windows without surprising operators.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where Oxmgr Fits
&lt;/h2&gt;

&lt;p&gt;Oxmgr is designed as a host-level process manager. It fits well on application servers, virtual machines, and dedicated service hosts running mixed-language workloads — anywhere a team wants clear, auditable operational behavior in place of ad hoc scripts or language-specific tooling.&lt;/p&gt;

&lt;p&gt;It is not designed to replace Kubernetes or container orchestration. It operates at a different layer and solves a different problem: keeping services running on a single host with predictable, inspectable behavior.&lt;/p&gt;
&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;p&gt;Infrastructure tooling should be transparent about its boundaries. Currently, Oxmgr's cluster mode targets Node.js command shapes rather than acting as a universal clustering abstraction for every runtime. That is a deliberate tradeoff. Explicit scope is preferable to ambiguous behavior.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Oxmgr is open source under the MIT license.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# npm&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; oxmgr

&lt;span class="c"&gt;# Homebrew&lt;/span&gt;
brew tap empellio/homebrew-tap &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;oxmgr

&lt;span class="c"&gt;# APT&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"deb [trusted=yes] https://vladimir-urik.github.io/OxMgr/apt stable main"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/oxmgr.list
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;oxmgr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Full documentation, the Oxfile specification, and deployment guides are available in the &lt;a href="https://github.com/Vladimir-Urik/OxMgr" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Vladimir-Urik" rel="noopener noreferrer"&gt;
        Vladimir-Urik
      &lt;/a&gt; / &lt;a href="https://github.com/Vladimir-Urik/OxMgr" rel="noopener noreferrer"&gt;
        OxMgr
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Oxmgr is a modern, lightweight process manager written in Rust, a fast, deterministic alternative to PM2 for managing any executable across platforms.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Oxmgr&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href="https://github.com/Vladimir-Urik/OxMgr/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/Vladimir-Urik/OxMgr/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/Vladimir-Urik/OxMgr/releases" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/9f2ddf36fa4d56492e9fe64852175d6b6c49c62d6e2cbd346a16ff399314ad49/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f566c6164696d69722d5572696b2f4f784d67723f696e636c7564655f70726572656c6561736573" alt="GitHub Release"&gt;&lt;/a&gt;
&lt;a href="https://github.com/Vladimir-Urik/OxMgr/./LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/92f236d8c5cb5a67341de8b173cf80a45f09bd0a6ef05e870cd51e69cbf75c4d/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d3265613434662e737667" alt="License: MIT"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Oxmgr is a lightweight, cross-platform Rust process manager and PM2 alternative.&lt;/p&gt;
&lt;p&gt;Use it to run, supervise, reload, and monitor long-running services on Linux, macOS, and Windows. Oxmgr is language-agnostic, so it works with Node.js, Python, Go, Rust binaries, and shell commands.&lt;/p&gt;
&lt;p&gt;Latest published benchmark snapshots: &lt;a href="https://github.com/Vladimir-Urik/OxMgr/./BENCHMARK.md" rel="noopener noreferrer"&gt;BENCHMARK.md&lt;/a&gt; and &lt;a href="https://github.com/Vladimir-Urik/OxMgr/./benchmark.json" rel="noopener noreferrer"&gt;benchmark.json&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why Oxmgr&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Language-agnostic: manage any executable, not just Node.js apps&lt;/li&gt;
&lt;li&gt;Cross-platform: Linux, macOS, and Windows&lt;/li&gt;
&lt;li&gt;Low overhead: Rust daemon with persistent local state&lt;/li&gt;
&lt;li&gt;Practical operations: restart policies, health checks, logs, and CPU/RAM metrics&lt;/li&gt;
&lt;li&gt;Config-first workflows with idempotent &lt;code&gt;oxmgr apply&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;PM2 ecosystem compatibility via &lt;code&gt;ecosystem.config.{js,cjs,mjs,json}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Interactive terminal UI with live search, filters, and sort controls&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Core Features&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Start, stop, restart, reload, and delete managed processes&lt;/li&gt;
&lt;li&gt;Named services and namespaces&lt;/li&gt;
&lt;li&gt;Restart policies: &lt;code&gt;always&lt;/code&gt;, &lt;code&gt;on-failure&lt;/code&gt;, and &lt;code&gt;never&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Health checks with automatic restart on repeated failures&lt;/li&gt;
&lt;li&gt;Config-driven file watch with ignore patterns and restart debounce&lt;/li&gt;
&lt;li&gt;Log tailing, log rotation, and per-process stdout/stderr logs&lt;/li&gt;
&lt;li&gt;…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Vladimir-Urik/OxMgr" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;




</description>
      <category>rust</category>
      <category>node</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
