DEV Community

Cover image for Building Crash-Tolerant Node.js Apps with Clusters.
Sk
Sk

Posted on

Building Crash-Tolerant Node.js Apps with Clusters.

Ever wondered why a browser tab can crash without taking down the whole browser?
Or why a telecom system is “never really down”, unless it’s down down?

Yet your Node.js app crashes all the time.

It’s all code, right?
So why does their stuff survive explosions and yours doesn’t?

But to understand that, you need to briefly understand how a program is loaded into memory.


The Kernel

The kernel is the core component of an operating system (for example, the Linux kernel).

Its job is to:

  • manage every running program
  • assign each program its own memory space
  • isolate programs from each other

If you’re running two applications:

app1 | app2
Enter fullscreen mode Exit fullscreen mode

The kernel keeps them separated so they can’t corrupt each other’s memory.

If app2 crashes, the kernel makes sure it implodes in isolation and doesn’t affect app1.

That part most people know.

Here’s the part most people miss:

The kernel doesn’t just kill the crashing app, it reports the crash to whoever launched it.

Conceptually, it looks like this:

int main() {
  return 0; // <- kernel forces an exit code on crash
}
Enter fullscreen mode Exit fullscreen mode

Now this is where things get interesting.


Booting an App Inside Another App

What happens when you launch an app from another app?

app1 | app2 | app1_child
Enter fullscreen mode Exit fullscreen mode

A more intuitive picture:

app1_child
app1        | app2
Enter fullscreen mode Exit fullscreen mode

The child process still gets its own memory space, fully isolated.

But when it crashes…
who does the kernel report that crash to?

Exactly.

The parent. The runner.

And that is the secret behind crash-tolerant software.

The parent stays alive.
The child dies.
The parent restarts it.


This Pattern Is Everywhere

You’ve already been using this idea without realizing it:

  • Browser tabs
  • Kubernetes pods
  • Telecom infrastructure
  • Databases
  • Supervisors in Erlang

Sometimes it’s hardware-backed (real OS processes).
Sometimes it’s software-simulated (Erlang-style lightweight processes).

Same idea either way:

Let things crash, just don’t let them take everything with them.

That’s why phone lines don’t really “die.”
That’s why browsers feel unkillable.


Node.js Can Do This Too

You can do the exact same thing in Node.js using clusters.

Clusters are not threads.
They are real OS processes.

When you fork a cluster, you are literally booting another Node.js instance on top of the current one.

I use this all the time.

For example:
My profiler receives real-time events in worker clusters, while the GUI runs in the main process.

If a worker explodes?
The UI stays alive.

Trace CLI
reference: How I Built a Graphics Renderer for Node.js


Clusters in Node.js

Here’s a simple example: a server running in a cluster that randomly crashes and automatically restarts.

const cluster = require('cluster');
const os = require('os');

if (cluster.isPrimary) {
  console.log(`primary ${process.pid} is running`);

  const numCPUs = os.cpus().length;

  // fork a couple of workers
  for (let i = 0; i < Math.min(numCPUs, 2); i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    console.log(`Worker ${worker.process.pid} died (code=${code}, signal=${signal})`);

    setTimeout(() => {
      console.log('Restarting worker...');
      cluster.fork();
    }, 1000);
  });

  cluster.on('online', (worker) => {
    console.log(`Worker ${worker.process.pid} is online`);
  });

} else {
  console.log(`Worker ${process.pid} started`);

  const http = require('http');

  const server = http.createServer((req, res) => {
    res.writeHead(200);
    res.end(`Hello from worker ${process.pid}`);
  });

  server.listen(3000, () => {
    console.log(`Worker ${process.pid} listening on port 3000`);
  });

  // simulate random crashes
  const crashTimeout = Math.floor(Math.random() * 30000) + 10000;
  setTimeout(() => {
    console.log(`Worker ${process.pid} will crash in 5 seconds...`);

    setTimeout(() => {
      throw new Error(`Simulated crash in worker ${process.pid}`);
    }, 5000);
  }, crashTimeout);

  process.on('SIGTERM', () => {
    console.log(`Worker ${process.pid} shutting down gracefully`);
    server.close(() => process.exit(0));
  });
}
Enter fullscreen mode Exit fullscreen mode

Everything inside the else block runs in a dedicated cluster process.

In this example, we spin up two workers:

for (let i = 0; i < Math.min(numCPUs, 2); i++) {
  cluster.fork();
}
Enter fullscreen mode Exit fullscreen mode

The if block is the main app.
If that crashes - everything dies.

But if a worker crashes?
The parent notices and boots a new one.

That’s the whole trick.


This is obviously a high-level overview, but clusters are incredibly powerful when you want:

  • fault isolation
  • crash recovery
  • long-running systems that don’t fall over

If you want the gritty details, the Node.js docs are worth a read.


More from me:

How I Built a Graphics Renderer for Node.js
Visualizing Evolutionary Algorithms in Node.js
Building A Distributed Video Transcoding System with Node.js.

tessera.js repo

Thanks for reading!

Find me here:

Top comments (1)

Collapse
 
sfundomhlungu profile image
Sk

Docs and References:
Clusters in Node