Clustering: The Good Parts

#node #javascript #systemdesign

Last week I experimented in depth with clustering in node.js applications and going to be breaking down the core concepts (pun intended) and possible use cases

NB: Unlike previous articles I will henceforth refrain from using analogies to represent technologies as I have learned that analogies (or the way it is communicated) may differ from person to person. Instead I shall focus on asking and answering questions about the tech to better focus more on it's use cases and abilities.

Why

As the popular saying goes "A journey of a thousand miles begins with a single step", so too does our journey into learning something new needs to always start with a WHY - much like your four old nephew asking why the sky is blue on a random tuesday afternoon.

Q: Why would we need to use clustering?
A: To make our application more performant when put through considerable stress (also read as scale).
To further drive the point home, let's quickly take a look at a simple server in node.js

const http = require('node:http');
const OPERATIONS_COUNT = 10;

const server = http.createServer((req, res) => {
  if(req.url === '/') {
  for (let i = 0; i < OPERATIONS_COUNT; i++) {}
  res.end('What is clustering anyway :(');
} else {
  res.end('clustering is the process of creating clusters')
}
});

server.listen(3000, () => console.log('Listening on port 3000'));

We visit http://localhost:3000 and we instantly get back a text saying What is clustering anyway :( and if we visit any other route we get clustering is the process of creating clusters - simple enough.

Now let's up the gear a little and increase the OPERATIONS_COUNT to 10_000_000_000, restart our server and hit the / endpoint again. Now we see a considerable lag in the time taken to get back our response but more importantly lets notice that while / is still processing our initial request any subsequent request made to the server is suspended as well. clustering helps avoid some of these problems

Q: What is clustering?
A: First let's take a look at what a cluster is, according to the oxford dictionary:

a group of similar things or people positioned or occurring closely together.

So clustering is just the process of making clusters in an application.

Q: How would one use clustering
A: I'm glad you asked, that's the focus for today.

At an abstract level clustering is a group of two or more nodes/processes that run in parallel to achieve a common goal. This enables tasks that would otherwise be cumbersome for one process to be spread out across a network of clone processes which have their own memory, thread-pool and engine instance hence increasing overall performance of the application.

processes are usually connected in an enclosed network enabling them synchronously communicate with each other. These processes then consist of ONE master process (main node) who is responsible for delegating tasks and then the child process (cloned nodes) who work on these assigned tasks.

Types of cluster configuration

Active-Active Configuration
Active-Passive Configuration

Just like load balancers, clusters can be configured as active-active or active-passive. An active-active configuration has all child processes actively handling tasks alongside the master process while active-passive configuration keeps all workload on the master-process and only delegates tasks to further child processes should the master process fail/experience downtime.
For the scope of this article I'll focus on the active-active configuration

Setting up processes

Next we'll set up processes to correctly account for the lag in our code snippet above.
Node.js has a built-in library called cluster which enables clustering in our application. Our code initially runs on one core (master node), we will then be calling on the extra cores in our computer to serve as worker nodes. Every computer differs in the amount of cores they have though. To check the amount of cores your computer has, can copy the snippet below

const os = require('os');
console.log(os.cpus());

/** You would get back an array that is structured similarly
* [
* {
*   model: 'Intel(R) Core(TM) i5-9878U CPU @ 1.40GHz',
*   speed: 1400,  
*   times: {
*     user: 92892892,
*     nice: 0,
*     sys: 111111,
*     idle: 10101010,
*     irq: 0
*    }
* },
* { 
*  ...
* }
* ]
*/

The CPU object array has these values

model: the CPU model
speed: the CPU speed in MHz
user: the number of milliseconds the CPU has spent in user mode
nice: the number of milliseconds the CPU has spent in nice mode
sys: the number of milliseconds the CPU has spent in sys mode
idle: the number of milliseconds the CPU has spent in idle mode
irq: the number of milliseconds the CPU has spent in irq mode

To get more information out of your CPU, you can use the node-os-utils package.

To our snippet we will add a cluster and spin-off as much worker processes as our computer can handle

const cluster = require('node:cluster');
const http = require('node:http');
const os = require('node:os');

const cores = os.cpus().length;
const OPERATIONS_COUNT = 10_000_000_000;

const server = http.createServer((req, res) => {
  if (req.url === '/') {
    for (let i = 0; i < OPERATIONS_COUNT; i++) {}
    res.end('What is clustering anyway :(');
  } else {
    res.end('clustering is the process of creating clusters');
  }
});

if (cluster.isMaster) {
  for (let i = 0; i < cores; i++) {
    cluster.fork();
  }
} else {
  server.listen(3000, () => console.log('server listening on port 3000')); // only listen to port once all child nodes have been created 
}

When we visit / this time around and then subsequently visit /about we can see that /about is not held up by the processing of /. This illustrates that a worker process stepped in and handled the next request.

To do the math on the performance gain of deploying child processes in our application, let's imagine a single node could handle 1000 requests a minute, using clustering with 8 nodes, we upped that request to 8000 requests THAT'S 8X MORE OUTPUT! Not sure how you feel about that but for me, it blows my mind.

Clustering is frequently used in the software development and in even in some technologies like kubernetes and redis.