DEV Community: Empellio.com

How to Monitor Node.js Processes in Production

Empellio.com — Sun, 26 Apr 2026 20:06:52 +0000

When your Node.js app crashes at 3am, there are two scenarios: you find out when a user complains, or your monitoring finds out first and restarts the process before anyone notices.

This guide covers the monitoring stack that matters — from basic process-level metrics to application-level observability.

What to Monitor

Not all monitoring is equal. Start with what actually causes incidents:

Process-level:

Is the process alive?
CPU usage (sustained high = bug, spike = expected load)
Memory usage (gradual growth = memory leak)
Restart count (frequent restarts = crash loop)
Uptime (low uptime = instability)

Application-level:

HTTP response time (p50, p95, p99)
Error rate (5xx responses)
Request throughput
Database query latency
Queue depth (if applicable)

Infrastructure-level:

Disk space (logs can fill it)
Network I/O
File descriptor count (open connections)

Start with process-level. It's free and catches most incidents.

Process-Level Monitoring with Oxmgr

Oxmgr exposes live process metrics via the CLI:

oxmgr status

NAME      PID    STATUS    CPU    MEM      RESTARTS  UPTIME
api       14892  running   2.1%   128 MB   0         3d 14h
worker    14901  running   0.3%   64 MB    2         2d 8h
scheduler 14910  running   0.0%   48 MB    0         3d 14h

For continuous monitoring, pipe to watch:

watch -n 2 oxmgr status

For scripted alerting, use oxmgr status --json:

oxmgr status --json | jq '.[] | select(.restarts > 5)'

Configure automatic restarts and restart limits in oxfile.toml:

[processes.api]
command = "node dist/server.js"
restart_on_exit = true
restart_delay_ms = 1000   # wait 1s before restarting
max_restarts = 10          # stop after 10 restarts (crash loop protection)

[processes.api.health_check]
endpoint = "http://localhost:3000/health"
interval_secs = 30
timeout_secs = 5
unhealthy_threshold = 3   # restart after 3 consecutive failures

When max_restarts is hit, Oxmgr stops restarting and alerts you via the status command rather than thrashing the system.

Exposing Metrics from Your App

Your application should expose its own metrics. The most portable format is Prometheus — a pull-based metrics system that almost every monitoring tool understands.

Install prom-client:

npm install prom-client

Add a /metrics endpoint:

import express from 'express';
import { register, collectDefaultMetrics, Counter, Histogram, Gauge } from 'prom-client';

const app = express();

// Collect default Node.js metrics (event loop lag, GC, heap, etc.)
collectDefaultMetrics();

// Custom metrics
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
});

const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new Gauge({
  name: 'http_active_connections',
  help: 'Number of active HTTP connections'
});

// Middleware to record metrics
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route?.path || req.path;

    httpRequestDuration.observe(
      { method: req.method, route, status_code: res.statusCode },
      duration
    );
    httpRequestTotal.inc(
      { method: req.method, route, status_code: res.statusCode }
    );
    activeConnections.dec();
  });

  next();
});

// Metrics endpoint (restrict access in production)
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

The default metrics alone give you: heap size, GC pause times, event loop lag, open file descriptors, and active handles — everything you need to spot memory leaks and event loop blocking.

Health Check Endpoint

Every monitored app needs a /health endpoint. This is different from /metrics — health is a binary ready/not-ready signal; metrics are time-series data.

app.get('/health', async (req, res) => {
  const health = {
    status: 'ok',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    version: process.env.npm_package_version,
    checks: {}
  };

  // Check database connection
  try {
    await db.raw('SELECT 1');
    health.checks.database = { status: 'ok' };
  } catch (err) {
    health.checks.database = { status: 'error', message: err.message };
    health.status = 'degraded';
  }

  // Check Redis
  try {
    await redis.ping();
    health.checks.redis = { status: 'ok' };
  } catch (err) {
    health.checks.redis = { status: 'error', message: err.message };
    // Decide if Redis failure makes the whole app unhealthy
  }

  // Check memory (warn if >80% of limit)
  const used = process.memoryUsage().heapUsed;
  const total = process.memoryUsage().heapTotal;
  const ratio = used / total;
  health.checks.memory = {
    status: ratio > 0.9 ? 'warning' : 'ok',
    usedMB: Math.round(used / 1024 / 1024),
    totalMB: Math.round(total / 1024 / 1024),
    ratio: ratio.toFixed(2)
  };

  const statusCode = health.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(health);
});

Alerting Without a Full Stack

If you don't have Prometheus + Grafana + Alertmanager yet, you can get process-level alerts with a simple bash script and a cron job.

Restart count alert:

#!/bin/bash
# /usr/local/bin/check-processes.sh

WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
RESTART_THRESHOLD=5

oxmgr status --json | jq -c '.[]' | while read -r process; do
  name=$(echo "$process" | jq -r '.name')
  restarts=$(echo "$process" | jq -r '.restarts')
  status=$(echo "$process" | jq -r '.status')

  if [ "$status" != "running" ]; then
    curl -s -X POST "$WEBHOOK_URL" \
      -H 'Content-type: application/json' \
      -d "{\"text\": \"🔴 Process *$name* is *$status* — check immediately\"}"
  elif [ "$restarts" -gt "$RESTART_THRESHOLD" ]; then
    curl -s -X POST "$WEBHOOK_URL" \
      -H 'Content-type: application/json' \
      -d "{\"text\": \"⚠️ Process *$name* has restarted *$restarts* times\"}"
  fi
done

Add to cron to run every minute:

* * * * * /usr/local/bin/check-processes.sh

Monitoring Memory Leaks

Memory leaks in Node.js are subtle — the heap grows slowly, performance degrades, eventually the process crashes with FATAL ERROR: Reached heap limit.

Watch the trend, not the absolute number:

# Log memory every minute
while true; do
  echo "$(date) $(oxmgr status --json | jq '.[] | select(.name=="api") | .memory')"
  sleep 60
done >> /var/log/api-memory.log

If memory grows by more than 10-20 MB/hour without traffic growth, you have a leak.

Node.js heap snapshot for investigation:

import { writeHeapSnapshot } from 'node:v8';
import { createServer } from 'node:http';

// Add a diagnostic endpoint (protect with auth in production)
app.get('/debug/heap', (req, res) => {
  const filename = writeHeapSnapshot();
  res.json({ filename, size: fs.statSync(filename).size });
});

Load the snapshot in Chrome DevTools → Memory → Load to find what's holding references.

Event Loop Lag Monitoring

A blocked event loop means Node.js can't process new requests. Requests pile up, latency spikes, and the app appears frozen while technically running.

import { monitorEventLoopDelay } from 'node:perf_hooks';

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

// Log every 30 seconds
setInterval(() => {
  const lag = histogram.mean / 1e6; // convert nanoseconds to milliseconds

  if (lag > 100) {
    console.error(`Event loop lag: ${lag.toFixed(1)}ms — INVESTIGATE`);
  } else if (lag > 10) {
    console.warn(`Event loop lag: ${lag.toFixed(1)}ms`);
  }

  histogram.reset();
}, 30_000);

Event loop lag above 100ms usually means:

Synchronous work in request handlers (JSON.parse on huge payloads, regex on long strings)
Blocking I/O on the main thread
CPU-intensive computation that should be in a worker thread

Log-Based Monitoring

Structured logs are cheap, searchable, and work everywhere. Use JSON:

const log = {
  info: (msg, data = {}) => console.log(JSON.stringify({ level: 'info', msg, ...data, ts: Date.now() })),
  error: (msg, data = {}) => console.error(JSON.stringify({ level: 'error', msg, ...data, ts: Date.now() })),
  warn: (msg, data = {}) => console.warn(JSON.stringify({ level: 'warn', msg, ...data, ts: Date.now() }))
};

// In request handler
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    log.info('request', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      durationMs: Date.now() - start,
      ip: req.ip
    });
  });
  next();
});

Pipe logs to journald via Oxmgr:

[processes.api]
command = "node dist/server.js"
log_file = "/var/log/api/app.log"
error_log_file = "/var/log/api/error.log"

Then query with:

# Last 100 errors
grep '"level":"error"' /var/log/api/app.log | tail -100 | jq .

# Requests slower than 500ms
cat /var/log/api/app.log | jq 'select(.durationMs > 500)'

Summary

A minimum viable monitoring setup for a production Node.js app:

Process health — Oxmgr status, restart limits, max_restarts
Health endpoint — /health that checks real dependencies
Prometheus metrics — prom-client with default metrics + custom request metrics
Structured JSON logs — searchable, parseable, cheap
Restart alerts — cron script or process manager webhook

This baseline catches 90% of production incidents before users do. Add Grafana dashboards and proper alerting when traffic justifies the operational overhead.

See the Oxmgr health check docs for the full health check configuration reference.

Health Checks for Node.js Apps — What They Are, Why They Matter, and How to Build Them

Empellio.com — Mon, 23 Mar 2026 10:14:28 +0000

A health check is how your infrastructure answers the question: "Is this instance actually working right now?"

Without one, a load balancer will happily route traffic to an instance whose database connection pool is exhausted, whose memory is full, or whose app started but can't reach any external services. The process is running — but it's not healthy.

Health checks fix this by giving infrastructure a reliable signal to act on.

Who Uses Health Checks

Three pieces of infrastructure rely on your health endpoint:

Process managers (Oxmgr, PM2) — use health checks to determine when a newly spawned process is ready to receive traffic during rolling restarts. Without this, the manager might route traffic to a process that started but isn't ready. See Zero-Downtime Deployment for the full rolling restart flow.

Load balancers (Nginx, HAProxy, AWS ALB) — poll health endpoints continuously. If an instance fails, the load balancer stops routing to it and marks it as down.

Container orchestrators (Kubernetes, ECS) — use liveness probes (is it running?) and readiness probes (is it ready for traffic?) to decide when to restart containers or route traffic.

The Minimal Health Endpoint

At minimum, a health check returns 200 when the app is serving requests:

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

This is better than nothing — it confirms the process is alive and Express is responding — but it doesn't tell you if the app can actually do useful work.

A Production-Grade Health Check

A real health check verifies that dependencies are reachable:

import express from 'express';
import { pool } from './db.js';
import { redis } from './cache.js';

const app = express();

app.get('/health', async (req, res) => {
  const start = Date.now();
  const checks = {};
  let overallStatus = 'ok';

  // 1. Database check
  try {
    await pool.query('SELECT 1');
    checks.database = { status: 'ok' };
  } catch (err) {
    checks.database = { status: 'error', message: err.message };
    overallStatus = 'degraded';
  }

  // 2. Redis check (if applicable)
  try {
    await redis.ping();
    checks.redis = { status: 'ok' };
  } catch (err) {
    checks.redis = { status: 'error', message: err.message };
    overallStatus = 'degraded';
  }

  // 3. Disk space check (for log-heavy apps)
  // Optional — add if relevant

  const responseTime = Date.now() - start;
  const statusCode = overallStatus === 'ok' ? 200 : 503;

  res.status(statusCode).json({
    status: overallStatus,
    checks,
    meta: {
      pid: process.pid,
      uptime: Math.floor(process.uptime()),
      responseTime: `${responseTime}ms`,
      version: process.env.npm_package_version ?? 'unknown',
      nodeVersion: process.version,
    }
  });
});

Example response when healthy:

{
  "status": "ok",
  "checks": {
    "database": { "status": "ok" },
    "redis": { "status": "ok" }
  },
  "meta": {
    "pid": 12847,
    "uptime": 3612,
    "responseTime": "4ms",
    "version": "1.3.0",
    "nodeVersion": "v20.11.0"
  }
}

When degraded (database unreachable):

{
  "status": "degraded",
  "checks": {
    "database": { "status": "error", "message": "connect ECONNREFUSED 127.0.0.1:5432" },
    "redis": { "status": "ok" }
  },
  "meta": { ... }
}

HTTP status code 503 tells the load balancer to stop routing to this instance.

Liveness vs Readiness

Kubernetes (and good health check design in general) separates two questions:

Liveness: Is the process alive and not deadlocked?

If this fails, restart the container
Should almost always return 200 (even a degraded app is alive)
Should never check external dependencies

Readiness: Is the process ready to handle traffic?

If this fails, stop routing traffic to this instance (but don't restart it)
Should check all dependencies needed to serve requests
What you typically mean by "health check"

// Liveness — just confirms the event loop is running
app.get('/health/live', (req, res) => {
  res.status(200).json({ alive: true });
});

// Readiness — confirms app can serve useful requests
app.get('/health/ready', async (req, res) => {
  try {
    await pool.query('SELECT 1');
    await redis.ping();
    res.status(200).json({ ready: true });
  } catch (err) {
    res.status(503).json({ ready: false, reason: err.message });
  }
});

In Kubernetes:

livenessProbe:
  httpGet:
    path: /health/live
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 30
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

Startup Probe

For slow-starting apps (heavy module loading, DB migrations on startup), add a startup probe. This gives the app time to initialize without triggering liveness failures:

startupProbe:
  httpGet:
    path: /health/live
    port: 3000
  failureThreshold: 30    # 30 × 10s = 5 minutes to start
  periodSeconds: 10

Once the startup probe passes, liveness and readiness probes take over.

Configuring Health Checks in Oxmgr

Oxmgr uses health checks to gate rolling restarts:

[processes.api]
command = "node dist/server.js"
instances = 3
restart_on_exit = true

[processes.api.health_check]
endpoint = "http://localhost:3000/health"
interval_secs = 10        # poll every 10 seconds
timeout_secs = 5          # fail if no response within 5s
healthy_threshold = 2     # must pass 2 consecutive checks to be considered healthy
unhealthy_threshold = 3   # must fail 3 consecutive checks to be considered unhealthy

During a rolling restart, Oxmgr:

Starts a new instance
Polls the health endpoint every interval_secs seconds
Waits for healthy_threshold consecutive successes
Only then stops the old instance and moves to the next

If the new instance never passes health checks, the rollout stops and you get an error report.

Health Check Anti-Patterns

Checking things that don't affect request serving:

// Bad — CPU temperature isn't your app's responsibility
checks.cpuTemp = await getCpuTemp();

// Good — check things your app actually needs
checks.database = await checkDatabase();

Slow health checks:

Your health endpoint should respond in under 100ms. If your database check takes 2 seconds, something's wrong — and more importantly, your load balancer's request will time out.

// Add timeouts to dependency checks
const dbCheck = Promise.race([
  pool.query('SELECT 1'),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('DB check timed out')), 1000)
  )
]);

Caching health check results:

Some teams cache health responses to avoid hammering the database on every poll. The problem: you might serve stale "ok" results while the database is down.

If your database can't handle 1 SELECT 1 query per 10 seconds, your database has a problem that a health check cache won't fix.

Making health checks require authentication:

Health endpoints should be publicly accessible (to load balancers and process managers that don't have credentials). Don't put them behind auth middleware.

// Fine for other routes
app.use(authenticate);

// Health endpoint should be registered before auth middleware
app.get('/health', healthHandler);
app.use(authenticate);
app.get('/api/...', ...);

Or use a separate server:

// Main app on 3000
app.listen(3000);

// Health check on 3001 — no auth, internal network only
const healthApp = express();
healthApp.get('/health', healthHandler);
healthApp.listen(3001);

Nginx Load Balancer Configuration

Configure Nginx to use your health endpoint:

upstream api {
    server 127.0.0.1:3000;
    server 127.0.0.1:3001;
    server 127.0.0.1:3002;

    # Nginx Plus / open-source with health_check module
    # health_check interval=10s fails=3 passes=2 uri=/health;
}

server {
    listen 80;

    location / {
        proxy_pass http://api;
        proxy_next_upstream error timeout http_502 http_503;
        proxy_next_upstream_tries 3;
    }
}

For active health checking with Nginx (requires Nginx Plus or the ngx_http_upstream_module):

upstream api {
    zone api 64k;
    server 127.0.0.1:3000;
    server 127.0.0.1:3001;
}

match api_health {
    status 200;
    header Content-Type ~ application/json;
    body ~ '"status":"ok"';
}

server {
    location / {
        proxy_pass http://api;
        health_check interval=10 fails=3 passes=2 uri=/health match=api_health;
    }
}

Testing Your Health Check

# Basic test
curl -i http://localhost:3000/health

# Test with timeout (simulates load balancer poll)
curl -i --max-time 5 http://localhost:3000/health

# Continuous polling (simulates process manager)
watch -n 2 'curl -s http://localhost:3000/health | jq .'

# Test what happens when DB is down
# Stop your database, then:
curl -i http://localhost:3000/health
# Should return 503

Summary

A production health check should:

Return 200 only when the app can actually serve requests
Return 503 when dependencies are unavailable
Respond in under 100ms
Be accessible without authentication
Check real dependencies (database, cache) — not just process liveness
Expose useful metadata (version, uptime, PID)

Wire it into your process manager, load balancer, and deployment pipeline — and your infrastructure will know exactly when your app is ready and when it isn't.

See the Oxmgr docs for process manager health check configuration.

Zero-Downtime Deployment — How to Deploy Without Dropping a Single Request

Empellio.com — Fri, 20 Mar 2026 08:23:02 +0000

Every time you deploy, you have a choice: take the app down briefly or keep it up. For most production systems, downtime — even 5 seconds — is unacceptable.

This guide covers the techniques that make zero-downtime deployments possible, from the simplest single-server setup to multi-server strategies.

Why Deployments Cause Downtime

The naive deploy sequence looks like this:

# Bad — causes downtime
pm2 stop api
git pull
npm run build
pm2 start api

Between stop and the new process starting and passing health checks, your app is down. Connections get refused. Users see errors. Load balancers fail health checks and trigger alerts.

Even if the downtime is 2 seconds, it shows up in your error rate and p99 latency graphs.

The Three Prerequisites

Zero-downtime deployment requires three things from your application:

1. Graceful shutdown — the app must finish in-flight requests before exiting. If it doesn't, requests that hit the dying instance get cut off.

2. Fast startup — the new instance must be able to pass health checks quickly. An app that takes 30 seconds to initialize creates a 30-second window of reduced capacity.

3. Stateless design — if session state lives in-memory, it dies with the process. Use Redis or a database for state so any instance can handle any request.

Implementing Graceful Shutdown

This is non-negotiable. Your app must handle SIGTERM:

import express from 'express';
import { createServer } from 'node:http';

const app = express();
const server = createServer(app);

// Track active connections
const connections = new Set();
server.on('connection', (socket) => {
  connections.add(socket);
  socket.on('close', () => connections.delete(socket));
});

app.get('/health', (req, res) => {
  res.json({ status: 'ok', pid: process.pid });
});

app.get('/', (req, res) => {
  // Simulate some work
  setTimeout(() => res.json({ hello: 'world' }), 100);
});

server.listen(3000, () => {
  console.log(`Worker ${process.pid} listening on :3000`);
});

// Graceful shutdown handler
const shutdown = (signal) => {
  console.log(`${signal} received, shutting down gracefully...`);

  // Stop accepting new connections
  server.close((err) => {
    if (err) {
      console.error('Error closing server:', err);
      process.exit(1);
    }

    // All connections closed — safe to exit
    console.log('Server closed. Exiting.');
    process.exit(0);
  });

  // Destroy idle connections immediately
  // (active connections will close after request completes)
  for (const socket of connections) {
    socket.destroy();
  }

  // Force exit if graceful shutdown takes too long
  setTimeout(() => {
    console.error('Graceful shutdown timed out, forcing exit');
    process.exit(1);
  }, 30_000);
};

process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));

Test your graceful shutdown:

# Start the app, note the PID
node server.js &
PID=$!

# Send a slow request
curl http://localhost:3000/ &

# Immediately send SIGTERM
kill $PID

# The slow request should still complete
# The server should exit cleanly after

Rolling Restart

A rolling restart replaces instances one-by-one rather than all at once. While one instance shuts down gracefully, the others continue serving traffic. When the new instance passes health checks, the next old instance shuts down.

Step 1: [Worker 0: old] [Worker 1: old] [Worker 2: old]
                                                       ↓ (start new Worker 0)
Step 2: [Worker 0: NEW] [Worker 1: old] [Worker 2: old]   ← health check passes
                                         ↓ (stop old Worker 1, start new Worker 1)
Step 3: [Worker 0: NEW] [Worker 1: NEW] [Worker 2: old]   ← health check passes
                                                       ↓ (stop old Worker 2)
Step 4: [Worker 0: NEW] [Worker 1: NEW] [Worker 2: NEW]   ← deploy complete

With Oxmgr:

# After pulling new code and building
oxmgr reload api

Oxmgr handles the rollout automatically. If any new instance fails its health check, the rollout stops and remaining instances keep running the old code. You can rollback manually or fix the issue and re-deploy.

The health check endpoint is the key to making this work — see Health Checks for Node.js Apps for how to build one that actually catches failures.

Configure the health check that guards each step:

[processes.api]
command = "node dist/server.js"
instances = 3
restart_on_exit = true

[processes.api.health_check]
endpoint = "http://localhost:3000/health"
interval_secs = 2       # check frequently during rollout
timeout_secs = 5
healthy_threshold = 2   # must pass 2 consecutive checks

With PM2:

pm2 reload api    # rolling restart
# NOT: pm2 restart api  (that restarts all at once)

Blue-Green Deployment

Blue-green keeps two complete environments — "blue" (current) and "green" (new). You deploy to green while blue serves traffic, then switch.

                    Load Balancer
                         │
              ┌──────────┴──────────┐
              │                     │
           [BLUE]                [GREEN]
       (running v1.2)        (running v1.3)
        ← serving traffic     ← being deployed to

When green is ready and passes health checks, flip the load balancer. Rollback = flip back to blue.

Nginx upstream swap:

# /etc/nginx/conf.d/upstream.conf
# Before deploy: points to blue
upstream api {
    server 127.0.0.1:3000;    # blue
}

# After deploy: points to green
upstream api {
    server 127.0.0.1:3001;    # green
}

# Deploy to green (port 3001)
PORT=3001 oxmgr start --config oxfile.green.toml

# Wait for green health check
until curl -sf http://localhost:3001/health; do sleep 1; done

# Swap nginx upstream
sed -i 's/3000/3001/' /etc/nginx/conf.d/upstream.conf
nginx -s reload

# Old blue is now free — keep it warm for rollback

Trade-offs:

✓ Instant rollback (flip the load balancer back)
✓ No in-flight requests lost during the switch
✗ Requires 2× the resources during deployment
✗ More complex setup

The Deploy Script

A production-grade deploy script that combines rolling restart with automatic rollback:

#!/bin/bash
set -euo pipefail

APP="api"
HEALTH_URL="http://localhost:3000/health"
MAX_WAIT=60

echo "=== Deploy started at $(date) ==="

# 1. Pull latest code
echo "Pulling code..."
git fetch origin main
git reset --hard origin/main

# 2. Install dependencies if lockfile changed
if git diff HEAD@{1} --name-only | grep -q "package-lock.json"; then
  echo "Installing dependencies..."
  npm ci --omit=dev
fi

# 3. Build
echo "Building..."
npm run build

# 4. Rolling restart
echo "Reloading processes..."
if ! oxmgr reload $APP; then
  echo "ERROR: Reload failed. Checking if old version still running..."
  oxmgr status
  exit 1
fi

# 5. Verify health
echo "Waiting for health check..."
for i in $(seq 1 $MAX_WAIT); do
  if curl -sf "$HEALTH_URL" > /dev/null 2>&1; then
    echo "Health check passed after ${i}s"
    break
  fi
  if [ $i -eq $MAX_WAIT ]; then
    echo "ERROR: Health check failed after ${MAX_WAIT}s"
    exit 1
  fi
  sleep 1
done

echo "=== Deploy complete ==="
oxmgr status

Make it executable and run it:

chmod +x deploy.sh
./deploy.sh

Health Check Endpoint

Your /health endpoint is the single most important endpoint in your app for deployments. Make it useful:

app.get('/health', async (req, res) => {
  const checks = {};
  let healthy = true;

  // Check database
  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch (err) {
    checks.database = 'error';
    healthy = false;
  }

  // Check Redis (if used)
  try {
    await redis.ping();
    checks.redis = 'ok';
  } catch (err) {
    checks.redis = 'error';
    // Decide if Redis failure = unhealthy for your app
  }

  // Return version and uptime for visibility
  checks.version = process.env.npm_package_version;
  checks.uptime = process.uptime();
  checks.pid = process.pid;

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks
  });
});

The process manager polls this endpoint. If it returns non-200, the new instance is considered unhealthy and the rolling restart stops.

Common Mistakes

Killing the process before requests complete:

# Wrong — kills immediately
pm2 restart api

# Right — waits for in-flight requests
pm2 reload api
oxmgr reload api

Not handling SIGTERM in the app:

Without a SIGTERM handler, Node.js exits immediately when the process manager sends the graceful shutdown signal. In-flight requests get cut off.

Health endpoint that always returns 200:

// Wrong — returns 200 even when broken
app.get('/health', (req, res) => res.json({ ok: true }));

// Right — actually checks dependencies
app.get('/health', async (req, res) => {
  const dbOk = await checkDatabase();
  res.status(dbOk ? 200 : 503).json({ database: dbOk ? 'ok' : 'error' });
});

Deploying with npm install instead of npm ci:

npm install can silently update packages. npm ci installs exactly what's in package-lock.json. Always use npm ci in production deploys.

Summary

Zero-downtime deployment requires three things:

Graceful shutdown in your app code (handle SIGTERM)
Rolling restart from your process manager (oxmgr reload, not restart)
Health checks that accurately reflect whether the app is ready

With this in place, deploys become invisible to users — no connection errors, no 502s, no alert storms at 2am.

See the Oxmgr docs for health check configuration and the deployment guide for the full setup.

Node.js Clustering — How to Use All CPU Cores in Production

Empellio.com — Sun, 15 Mar 2026 13:09:19 +0000

Node.js runs on a single thread. That's not a bug — it's a deliberate design decision that makes async I/O simple. But it means that by default, a Node.js process uses exactly one CPU core, no matter how many your server has.

On a 4-core VPS, three quarters of your compute is sitting idle.

Clustering fixes this.

What Clustering Means

Clustering in Node.js means running multiple instances of your application — one per CPU core — all sharing the same port. Incoming connections are distributed across instances by the OS.

                      ┌──────────────────────────┐
                      │        Port 3000         │
                      └────────────┬─────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                    │
         ┌────▼─────┐        ┌─────▼─────┐        ┌─────▼─────┐
         │ Worker 0 │        │ Worker 1  │        │ Worker 2  │
         │ (PID 101)│        │ (PID 102) │        │ (PID 103) │
         └──────────┘        └───────────┘        └───────────┘
              │                    │                    │
         CPU Core 0           CPU Core 1            CPU Core 2

Each worker is an independent Node.js process with its own event loop, memory heap, and V8 instance. They share no state in-memory — only the listening port.

Option 1: The Cluster Module (Built-in)

Node.js ships with a cluster module that handles forking and connection distribution:

import cluster from 'node:cluster';
import { availableParallelism } from 'node:os';
import { createServer } from 'node:http';

if (cluster.isPrimary) {
  const numCPUs = availableParallelism();
  console.log(`Primary ${process.pid} running, forking ${numCPUs} workers`);

  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    console.log(`Worker ${worker.process.pid} died (${signal || code}), restarting...`);
    cluster.fork(); // auto-restart crashed workers
  });

} else {
  // Each worker runs this
  const server = createServer((req, res) => {
    res.writeHead(200);
    res.end(`Handled by worker ${process.pid}`);
  });

  server.listen(3000, () => {
    console.log(`Worker ${process.pid} started`);
  });
}

With Express:

import cluster from 'node:cluster';
import { availableParallelism } from 'node:os';
import express from 'express';

if (cluster.isPrimary) {
  const numCPUs = availableParallelism();

  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker) => {
    console.log(`Worker ${worker.process.pid} died, restarting`);
    cluster.fork();
  });

} else {
  const app = express();

  app.get('/', (req, res) => {
    res.json({ pid: process.pid, message: 'Hello from worker' });
  });

  app.listen(3000);
}

Pros of the cluster module:

No external dependencies
Full control over fork behavior
Can pass messages between primary and workers via IPC

Cons:

Boilerplate in every project
You manage restarts yourself
No visibility into worker health from outside the process

Option 2: Let the Process Manager Handle It

The cleaner approach: keep your app code simple and let the process manager handle clustering.

With Oxmgr, set instances in oxfile.toml:

[processes.api]
command = "node dist/server.js"
instances = 4          # explicit count
restart_on_exit = true

# Or use CPU count automatically:
# instances = "max"

Your application code has no clustering logic — it's just a regular Express/Fastify/Hapi app listening on a port. Oxmgr forks it instances times and distributes connections.

With PM2:

pm2 start app.js -i max     # max = number of CPU cores
pm2 start app.js -i 4       # explicit

Or in ecosystem.config.js:

module.exports = {
  apps: [{
    name: 'api',
    script: 'dist/server.js',
    instances: 'max',
    exec_mode: 'cluster'
  }]
}

Why this approach is better:

Clean separation of concerns — app doesn't know about clustering
Process manager handles crashes across all instances
Zero-downtime rolling restarts work across all instances
Health checks monitor each instance independently

Option 3: Worker Threads

Don't confuse clustering with Worker Threads. They're different tools for different problems:

	Cluster	Worker Threads
Purpose	Scale I/O-bound workloads across cores	Run CPU-intensive tasks off the main thread
Memory	Separate heap per process	Shared memory available (`SharedArrayBuffer`)
Communication	IPC (slower)	`postMessage` or shared buffers (faster)
Failure isolation	One crash = one process	One crash = entire process
Use case	Web servers, API handlers	Image processing, crypto, data parsing

Use clusters for serving HTTP. Use Worker Threads for CPU-heavy operations within a single process.

import { Worker, isMainThread, parentPort } from 'node:worker_threads';

if (isMainThread) {
  // Main thread: handle HTTP, delegate heavy work
  const worker = new Worker('./heavy-computation.js');
  worker.postMessage({ data: bigArray });
  worker.on('message', (result) => {
    console.log('Computation result:', result);
  });
} else {
  // Worker thread: do the heavy lifting
  parentPort.on('message', ({ data }) => {
    const result = doExpensiveComputation(data);
    parentPort.postMessage(result);
  });
}

How Many Instances Should You Run?

The common advice is "one per CPU core." This is right for CPU-bound workloads. For I/O-bound Node.js apps (which is most of them), the relationship is more nuanced.

CPU-bound apps (heavy computation, data processing):

One instance per physical core
More instances than cores = context switching overhead

I/O-bound apps (database queries, HTTP calls, file reads):

Start with one per core
Benchmark under load — you might get better results with fewer instances sharing more memory
The bottleneck is usually the database, not the CPU

Practical starting point:

[processes.api]
# 2-core VPS
instances = 2

# 4-core server
instances = 4

# Leave one core for the OS and other processes
# instances = 3  (on a 4-core machine under heavy load)

Monitor CPU and memory under load. If workers are CPU-saturated (consistently >80%), adding instances helps. If they're mostly idle waiting for DB queries, more instances just means more memory usage.

Sticky Sessions and Shared State

Clustering breaks any in-memory session state. If user A hits Worker 0 on request 1 and Worker 1 on request 2, Worker 1 doesn't know about Worker 0's session data.

Solutions:

1. Use a shared session store (recommended):

import session from 'express-session';
import RedisStore from 'connect-redis';
import { createClient } from 'redis';

const redisClient = createClient({ url: process.env.REDIS_URL });

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false
}));

2. Use sticky sessions (not recommended for new projects):

Sticky sessions route each client to the same worker consistently via a cookie or IP hash. This defeats the purpose of load balancing and creates uneven distribution.

3. Design stateless services:

The best approach for clustered apps: keep no state in memory. Use Redis, a database, or JWT tokens instead. Your app becomes trivially scalable — from 1 to 100 instances with no code changes.

Graceful Reloads with Clustering

When you update your app, you want to restart workers without dropping connections. This is the rolling restart pattern:

Worker 0: SIGTERM → drain connections → exit
                         ↓
                    New Worker 0 starts → health check passes
Worker 1: SIGTERM → drain connections → exit
                         ↓
                    New Worker 1 starts → health check passes

With Oxmgr:

oxmgr reload api

This performs a rolling restart across all instances. If a new instance fails its health check, the reload stops and old instances keep running — automatic rollback.

Quick Performance Test

Before and after enabling clustering:

# Install autocannon
npm install -g autocannon

# Benchmark single process (instances = 1)
autocannon -c 100 -d 10 http://localhost:3000

# Enable clustering (instances = 4), restart, benchmark again
autocannon -c 100 -d 10 http://localhost:3000

On a CPU-bound workload, you should see throughput multiply roughly linearly with core count. On an I/O-bound workload with a fast database, the gains are less dramatic but still meaningful.

Summary

Node.js is single-threaded by default — clustering uses all CPU cores
Simple apps: use instances in your process manager config
Complex apps: consider the built-in cluster module for fine-grained control
CPU-heavy tasks: use Worker Threads within a process, not more cluster instances
Always use shared state storage (Redis) — in-memory state doesn't survive across workers
Process managers handle the hard parts: restarts, health checks, rolling reloads

See the Oxmgr docs for cluster configuration, or the deployment guide for the full production setup walkthrough.

What Is Crash Recovery? How Process Managers Keep Your App Online After Failures

Empellio.com — Thu, 12 Mar 2026 10:35:22 +0000

What Is Crash Recovery?

Your production app crashes. A bug slips through, memory spikes, a network dependency times out and throws an unhandled exception — it doesn't matter why. What matters is what happens next.

Crash recovery is the automatic process of detecting that an application has died and restarting it as fast as possible, before your users have time to notice.

Without crash recovery, a process that crashes stays dead until a human intervenes. With it, the same crash can be invisible — the process restarts in milliseconds and keeps serving traffic.

Crash recovery is one of the core reasons you need a process manager in production — without one, there's nothing watching your app to trigger a restart.

How Crash Recovery Works

Every operating system gives processes a way to signal their exit. When a process terminates — whether it crashes, runs out of memory, or is killed — it emits an exit event with a status code.

A process manager listens for these events:

App process exits (status: 1 — error)
        ↓
Process manager receives exit event
        ↓
Check: is this process configured to restart?
        ↓
Yes → spawn new process
        ↓
Wait for process to be ready (health check or port listen)
        ↓
Resume serving traffic

The critical variable is how long this takes. The gap between the exit event and the new process serving traffic is your downtime window.

What Determines Recovery Speed

Three factors control how fast a process manager can recover from a crash:

1. The Manager's Own Runtime

A process manager written in a scripting language (JavaScript, Python, Ruby) has to do real work to respond to an exit event — the VM needs to be scheduled, the garbage collector might pause, the event loop might be busy.

A compiled binary (Rust, Go, C) responds in microseconds. There's no VM, no GC, no interpreter. The exit handler fires and the spawn call happens immediately.

This is the biggest factor. PM2 (Node.js daemon) recovers in ~400ms. Oxmgr (Rust binary) recovers in ~11ms.

2. Process Spawn Time

Spawning a new process takes time regardless of the manager. For a Node.js app:

OS process creation: ~1–5ms
Node.js startup: ~50–200ms (depending on module load time)
Application initialization: varies

The process manager can't control how fast your app starts. But it can start the spawn immediately after detecting the crash, rather than waiting for polling intervals.

3. Health Check Configuration

After spawning, the manager needs to know when the process is ready. Two approaches:

Port listening — wait until the process binds to its port. Simple, but doesn't guarantee the app is actually serving valid responses.

HTTP health check — poll an endpoint until it returns 200. Slower to confirm readiness, but more accurate.

[processes.api.health_check]
endpoint = "http://localhost:3000/health"
interval_secs = 2
timeout_secs = 5

For crash recovery, the key is not waiting longer than necessary. If your health check polls every 30 seconds but a crash recovers in 50ms, you're waiting 30 seconds to confirm what already happened.

What Happens If an App Keeps Crashing?

Automatic restart can create a "crash loop" — the app restarts, crashes immediately, restarts again, endlessly. This is worse than staying down in some ways: it makes logs unreadable and consumes CPU spinning up processes.

Most process managers handle this with restart limits and backoff:

[processes.api]
max_restarts = 10          # stop trying after 10 crashes
restart_delay_ms = 500     # wait 500ms before each restart

Exponential backoff is more sophisticated — the delay doubles each time:

Crash 1: restart after 100ms
Crash 2: restart after 200ms
Crash 3: restart after 400ms
...

This gives transient issues (network blips, temporary resource exhaustion) time to resolve while preventing runaway loops.

Crash Recovery vs. High Availability

These are related but different concepts:

Crash recovery handles the period after a single process crashes — the goal is to minimize downtime for that process.

High availability uses redundancy to eliminate downtime entirely — run 2+ instances so when one crashes, others continue serving traffic while the crashed one recovers.

[processes.api]
instances = 3    # crash recovery on one instance doesn't affect the other 2

With 3 instances and 11ms crash recovery, a user hitting the crashed instance during that window is the only exposure. In practice, load balancers have already stopped routing to the crashed process within a similar timeframe.

Measuring Crash Recovery in Your Setup

You can test your crash recovery speed manually:

# Find your process PID
oxmgr status

# Kill it hard (no graceful shutdown)
kill -9 <pid>

# Measure how long until it responds again
time curl --retry 100 --retry-delay 0 --retry-connrefused http://localhost:3000/health

For PM2 users, the same test will show you real-world recovery times rather than theoretical numbers.

Crash Recovery in Oxmgr

Oxmgr is built around the assumption that crash recovery should be invisible to users. Key settings:

[processes.api]
command = "node dist/server.js"
restart_on_exit = true
restart_delay_ms = 0         # restart immediately
max_restarts = 20            # allow 20 restarts before giving up
instances = 2                # run 2 instances for redundancy

[processes.api.health_check]
endpoint = "http://localhost:3000/health"
interval_secs = 10
timeout_secs = 3

With this config, a crash on one instance triggers an immediate restart. The other instance handles traffic during the ~50ms window (11ms manager + ~40ms Node.js startup for a simple app).

See the docs for health check configuration and resource limit triggers.

Process Manager Comparison 2026 — PM2, Systemd, Supervisor, Oxmgr, and More

Empellio.com — Tue, 10 Mar 2026 23:17:32 +0000

An up-to-date comparison of every major process manager for Linux and Node.js production environments. Includes PM2, systemd, Supervisor, Forever, Oxmgr, and Docker with real benchmark data.

Process Manager Comparison 2026

There are more options than ever for keeping production processes alive. This page is a living comparison — benchmarked on the same hardware, with honest trade-offs.

Last updated: March 2026.

The Contenders

Manager	Language	License	First Released	Active
PM2	JavaScript	AGPL-3.0	2013	Yes
Systemd	C	LGPL-2.1	2010	Yes
Supervisor	Python	BSD	2004	Maintenance
Forever	JavaScript	MIT	2012	Maintenance
Circus	Python	Apache 2.0	2012	Maintenance
Oxmgr	Rust	MIT	2026	Yes

PM2

The industry standard for Node.js. PM2 is what most developers reach for first, and for good reason — it's well-documented, has a large community, and works reliably.

Strengths:

Excellent documentation and community
Built-in cluster mode (multi-core utilization)
Log management with rotation
Startup script generation (pm2 startup)
ecosystem.config.js for config-as-code
PM2 Plus / Keymetrics integration for monitoring dashboards

Weaknesses:

~83 MB RAM for the daemon (Node.js runtime overhead)
~1,240 ms cold startup
~410 ms crash recovery
Node.js version can conflict with managed apps
pm2 update ceremony after Node.js upgrades

Best for: Teams already using Node.js who want the richest ecosystem and don't have tight resource constraints.

npm install -g pm2
pm2 start app.js --name api -i max
pm2 startup
pm2 save

Systemd

The OS-native approach. Systemd is the init system on virtually every modern Linux distribution. If you're running on Linux and comfortable with unit files, systemd is unbeatable for production stability.

Strengths:

Zero additional overhead (part of the OS)
Deep Linux integration — cgroups, namespaces, watchdog, socket activation
Excellent journald log integration (journalctl -u myapp -f)
Dependency ordering (After=postgresql.service)
Security sandboxing (PrivateTmp, NoNewPrivileges, etc.)
Battle-tested over 15 years

Weaknesses:

Linux only — no macOS, no Windows
Verbose unit file syntax
No built-in cluster mode
Config doesn't travel with the repo naturally
Steep learning curve for advanced features

Best for: Linux-only production environments where you want OS-level control and zero runtime overhead.

[Service]
ExecStart=/usr/bin/node /var/www/app/server.js
Restart=always
RestartSec=5
User=nodeapp
Environment=NODE_ENV=production

systemctl enable myapp
systemctl start myapp
journalctl -u myapp -f

Supervisor

The Python veteran. Supervisor has been managing processes since 2004. It's stable, mature, and still used in millions of deployments — but it's effectively in maintenance mode.

Strengths:

Battle-tested stability
XML-RPC API for programmatic control
Web-based status UI
Well-understood behavior

Weaknesses:

Python runtime overhead (~30 MB)
No cluster/multi-instance mode
Configuration is INI-based, not modern
No active feature development
Slower crash recovery than modern alternatives

Best for: Legacy deployments that already use it. Not recommended for new projects.

[program:myapp]
command=node /var/www/app/server.js
directory=/var/www/app
user=nodeapp
autostart=true
autorestart=true
stderr_logfile=/var/log/myapp.err.log
stdout_logfile=/var/log/myapp.out.log
environment=NODE_ENV="production"

Forever

The minimal option. Forever keeps a script running indefinitely. That's it. There's almost no configuration, no dashboard, no cluster mode.

Strengths:

Dead simple CLI
Low learning curve

Weaknesses:

~32 MB overhead
No cluster mode
No advanced health checks
Last major feature update was years ago
Poor log management

Best for: Nothing in 2026. If you're using it, migrate to something that's actively maintained.

Oxmgr

The Rust-native modern alternative. Built from scratch to fix PM2's overhead problems without losing PM2's developer experience.

Strengths:

~4 MB daemon memory (20× less than PM2)
~38 ms cold startup (32× faster than PM2)
~11 ms crash recovery (37× faster than PM2)
Single binary, no runtime dependencies
oxfile.toml — version-controllable, clean syntax
Cross-platform (Linux, macOS, Windows)
PM2 ecosystem.js migration built-in
Health check–driven rolling restarts

Weaknesses:

Newer project — smaller community than PM2
No web dashboard yet (TUI in active development)
No Keymetrics/PM2 Plus equivalent
Smaller plugin ecosystem

Best for: New projects, resource-constrained environments, developers who want PM2-like ergonomics at a fraction of the overhead.

[processes.api]
command = "node dist/server.js"
instances = 4
restart_on_exit = true
env = { NODE_ENV = "production" }

[processes.api.health_check]
endpoint = "http://localhost:3000/health"
interval_secs = 30

Benchmark Comparison

Tested on AWS EC2 t3.small (2 vCPU, 2 GB RAM), Ubuntu 22.04, managing 10 Node.js HTTP servers. 20 runs each, medians reported.

Memory Usage

Manager	Daemon RSS	Per-process overhead	Total (10 procs)
PM2	83 MB	~8 MB	~163 MB
Systemd	0 MB	~0 MB	~0 MB
Supervisor	31 MB	~1 MB	~41 MB
Forever	32 MB	~3 MB	~62 MB
Oxmgr	4.2 MB	~0.3 MB	~7 MB

Systemd has no separate daemon — it's part of PID 1 which is always running.

Startup Time (Cold)

Manager	Median	P95
PM2	1,247 ms	1,912 ms
Systemd	78 ms	121 ms
Supervisor	640 ms	890 ms
Forever	890 ms	1,240 ms
Oxmgr	38 ms	52 ms

Crash Recovery Speed

Manager	Median	P95
PM2	412 ms	591 ms
Systemd	182 ms	234 ms
Supervisor	530 ms	710 ms
Forever	510 ms	680 ms
Oxmgr	11 ms	18 ms

Why does this matter? See What Is Crash Recovery? for a breakdown of what determines recovery speed and how it affects user-visible downtime.

Feature Comparison

Feature	PM2	Systemd	Supervisor	Forever	Oxmgr
Crash recovery	✓	✓	✓	✓	✓
Cluster mode	✓	✗	✗	✗	✓
Config as code	✓	Partial	✓	✗	✓
Health checks	Basic	Watchdog	Basic	✗	HTTP + TCP
Log rotation	✓	Via journald	✓	✗	✓
Boot persistence	✓	Native	✓	✗	✓
Cross-platform	✓	✗	✗	✓	✓
Zero-downtime reload	✓	✗	✗	✗	✓
Web dashboard	PM2 Plus	✗	✓ (basic)	✗	TUI (WIP)
Active development	✓	✓	Maintenance	Maintenance	✓

Decision Guide

You're starting a new Node.js project on a VPS:
→ Use Oxmgr. Lowest overhead, clean config, PM2-like workflow.

You're already on PM2 and it's working:
→ Stay on PM2. Migration has a cost, and PM2 is reliable.

You need OS-level service integration on Linux:
→ Use Systemd directly, or Systemd to start Oxmgr.

You're managing a Python or Ruby app:
→ Oxmgr manages any process. Alternatively, Systemd or Supervisor.

You're in a container (Docker/Kubernetes):
→ Let the container runtime handle restarts. No additional process manager needed.

You're on a 512 MB VPS:
→ Oxmgr or Systemd. PM2's 83 MB daemon is a significant tax.

Try Oxmgr

npm install -g oxmgr

Or see the full benchmark page for interactive data, and the docs for configuration reference.

Why We Wrote a Process Manager in Rust (and what surprised us)

Empellio.com — Fri, 06 Mar 2026 14:47:13 +0000

We build developer tools at Empellio. PM2 was always just there — install it, forget it, it works.

Until we started asking: how much of the overhead is actually necessary?

That question turned into Oxmgr — a Rust-based PM2 alternative. Here's what we learned.

Why Rust and not Go?

Go would've been faster to write. But we wanted two things Go doesn't give for free:

Memory safety without a garbage collector
Predictable latency without GC pauses

For a process manager running 24/7 and keeping your services alive — that tradeoff mattered.

Surprise #1: Polling is the wrong mental model

Our first daemon used a tick-based loop. Every X milliseconds — check processes, update metrics, handle restarts.

The problem: if a process crashes 1ms after a tick, you wait almost the entire interval before reacting.

The fix? Let the OS tell you when something happens instead of constantly asking. When a child process exits, handle the event immediately.

No polling. No waiting. Just react.

Surprise #2: The benchmark numbers

We expected Rust to be faster. We did not expect this:

Crash detection:
oxmgr →  4ms
pm2   → 170ms

Difference: 42x

Our first reaction was that we'd made a mistake. We ran it again. Same numbers.

The reason isn't Rust magic — it's PM2's Node.js IPC layer. When a process crashes, the signal travels through the event loop before PM2 reacts. Oxmgr listens to OS exit events directly.

Memory was even more surprising:

Daemon RSS at 100 processes:
oxmgr →   7MB
pm2   → 148MB

PM2 is a Node.js app managing Node.js apps. Oxmgr is a small Rust binary.

Surprise #3: The compiler finds your bugs first

Process state sounds simple: running, stopped, crashed.

In practice it's not. What happens when you restart a process mid-crash? What if a reload is in progress when a health check fails?

In most languages you discover these edge cases in production. In Rust, the type system forces you to model every state transition explicitly. If you haven't handled a case — it won't compile.

We found more bugs during development than we would have in any other language. Not because we were careful. Because the compiler was careful for us.

Where it is now

Oxmgr is at v0.1.4. Linux, macOS, Windows. Manages Node.js, Python, Go, Rust — anything you can run from a command line.

npm install -g oxmgr

Repo + benchmarks: github.com/Vladimir-Urik/OxMgr

People are already running it in production — would love to hear how it goes for you.

Oxmgr: A Lightweight PM2 Alternative Written in Rust

Empellio.com — Tue, 03 Mar 2026 16:10:23 +0000

A deterministic, cross-platform process manager for Node.js, Python, Go, Rust, and any executable.

The Problem With PM2 in Mixed-Language Environments

PM2 earned its place. For Node.js teams who needed a quick way to keep services alive, restart on crashes, and tail logs from a single CLI, it solved a real problem. It still does.

But production infrastructure rarely stays pure.

A team starts with a Node.js API, then adds a Python worker for ML inference, a Go binary for a performance-sensitive service, and a few shell-based cron jobs. Suddenly PM2 is managing things it was never designed for, and the configuration becomes a mixture of workarounds, undocumented flags, and institutional knowledge that lives only in someone's terminal history.

That is the gap Oxmgr is designed to fill.

What Oxmgr Is

Oxmgr is a process manager written in Rust for running and supervising long-lived services on Linux, macOS, and Windows. It is language-agnostic by design, which means it works equally well for Node.js applications, Python services, Go binaries, Rust executables, and arbitrary shell commands.

It is available on GitHub and installable via npm, Homebrew, Chocolatey, or APT.

The core design principle is that process management should be declarative and reproducible rather than dependent on operator memory and runtime commands. That shift in model is what makes Oxmgr genuinely different from a cosmetic PM2 reimplementation.

Declarative Configuration and Idempotent Apply

The centerpiece of Oxmgr's operational model is oxfile.toml, its native configuration format. Instead of managing services through a sequence of CLI commands, you describe the desired state:

version = 1

[[apps]]
name = "api"
command = "node server.js"
restart_policy = "on_failure"
max_restarts = 10
health_cmd = "curl -fsS http://127.0.0.1:3000/health"

Then you apply it:

oxmgr validate ./oxfile.toml --env prod
oxmgr apply ./oxfile.toml --env prod

oxmgr validate catches invalid or inconsistent configuration before it reaches production. oxmgr apply is idempotent — if nothing changed, unchanged services are left untouched. If something did change, only the affected services are reconciled.

This matters operationally. Config lives in Git, is reviewable in pull requests, and is safe to run repeatedly in CI/CD pipelines without risk of bouncing services unnecessarily.

Production-Grade Supervision Features

A serious PM2 alternative needs to do more than restart failed processes. Oxmgr includes the supervision primitives that production environments actually require.

Health checks run a configurable command on a schedule. After a defined number of consecutive failures, Oxmgr automatically restarts the service. This is more reliable than restart-on-exit alone because it catches processes that are alive but non-functional — the kind of failure that daemon restarts cannot detect.

Crash-loop protection prevents infinite restart storms. If a service crashes repeatedly within a short window, Oxmgr applies a configurable cutoff and stops automatic restarts, surfacing the problem rather than masking it behind an endless loop.

Resource monitoring tracks CPU and RAM per process. On Linux, Oxmgr optionally enforces hard limits through cgroup v2, which means resource constraints are applied at the OS level rather than relying on application-level cooperation.

Graceful shutdown respects configurable stop signals and timeout escalation, giving services the opportunity to drain connections cleanly before being forcibly terminated.

Per-process log management captures stdout and stderr to separate files with rotation, keeping logs organized across fleets of services.

All of this is visible through a built-in terminal UI (oxmgr ui) that provides a fleet summary, real-time metrics, and keyboard-driven controls for inspection and lifecycle operations.

Migrating From PM2

A PM2 alternative is only useful if adoption does not require rewriting everything at once. Oxmgr addresses this directly.

It supports PM2's ecosystem.config.json format as a migration bridge. Existing teams can import their current configuration, evaluate Oxmgr against familiar workloads, and convert to oxfile.toml when they are ready:

# Import existing PM2 config
oxmgr import ./ecosystem.config.json

# Convert to native format for long-term use
oxmgr convert ecosystem.config.json --out oxfile.toml

This incremental path matters because most teams do not switch operational tooling in a single migration event. They evaluate in parallel, adopt gradually, and formalize the switch once confidence is established.

The Day-to-Day Workflow

For teams that care about operational simplicity, the daily workflow in Oxmgr looks like this:

oxmgr validate ./oxfile.toml --env prod
oxmgr apply ./oxfile.toml --env prod
oxmgr status api
oxmgr logs api -f
oxmgr pull api

The last command — oxmgr pull — is worth highlighting. For git-backed services, it fetches the latest changes and reloads or restarts the service only if the commit actually changed. This avoids unnecessary disruption on unchanged deployments and fits naturally into webhook-driven update workflows.

Why Rust?

Rust was a deliberate choice, not a trend. Oxmgr runs as a long-lived system daemon, and that context has specific requirements: a small runtime dependency surface, predictable memory behavior, a single native binary, and consistent cross-platform support without requiring a runtime installed on the target host.

Rust satisfies all of these. The choice is not about raw performance benchmarks — it is about building a durable, predictable daemon that behaves consistently across Linux, macOS, and Windows without surprising operators.

Where Oxmgr Fits

Oxmgr is designed as a host-level process manager. It fits well on application servers, virtual machines, and dedicated service hosts running mixed-language workloads — anywhere a team wants clear, auditable operational behavior in place of ad hoc scripts or language-specific tooling.

It is not designed to replace Kubernetes or container orchestration. It operates at a different layer and solves a different problem: keeping services running on a single host with predictable, inspectable behavior.

Honest Limitations

Infrastructure tooling should be transparent about its boundaries. Currently, Oxmgr's cluster mode targets Node.js command shapes rather than acting as a universal clustering abstraction for every runtime. That is a deliberate tradeoff. Explicit scope is preferable to ambiguous behavior.

Getting Started

Oxmgr is open source under the MIT license.

# npm
npm install -g oxmgr

# Homebrew
brew tap empellio/homebrew-tap && brew install oxmgr

# APT
echo "deb [trusted=yes] https://vladimir-urik.github.io/OxMgr/apt stable main" \
  | sudo tee /etc/apt/sources.list.d/oxmgr.list
sudo apt update && sudo apt install oxmgr

Full documentation, the Oxfile specification, and deployment guides are available in the GitHub repository.

Vladimir-Urik / OxMgr

Oxmgr is a modern, lightweight process manager written in Rust, a fast, deterministic alternative to PM2 for managing any executable across platforms.

Oxmgr

Oxmgr is a lightweight, cross-platform Rust process manager and PM2 alternative.

Use it to run, supervise, reload, and monitor long-running services on Linux, macOS, and Windows. Oxmgr is language-agnostic, so it works with Node.js, Python, Go, Rust binaries, and shell commands.

Latest published benchmark snapshots: BENCHMARK.md and benchmark.json

Why Oxmgr

Language-agnostic: manage any executable, not just Node.js apps
Cross-platform: Linux, macOS, and Windows
Low overhead: Rust daemon with persistent local state
Practical operations: restart policies, health checks, logs, and CPU/RAM metrics
Config-first workflows with idempotent oxmgr apply
PM2 ecosystem compatibility via ecosystem.config.{js,cjs,mjs,json}
Interactive terminal UI with live search, filters, and sort controls

Core Features

Start, stop, restart, reload, and delete managed processes
Named services and namespaces
Restart policies: always, on-failure, and never
Health checks with automatic restart on repeated failures
Config-driven file watch with ignore patterns and restart debounce
Log tailing, log rotation, and per-process stdout/stderr logs
…

View on GitHub