Why worker pools beat clustering for CPU-Heavy tasks on Node.js

#devops #infrastructure #node

Imagine you have a Nodejs server with endpoint that performs heavy CPU operations.

By default your server runs on a single thread. This means it will freeze depending on the CPU load. If your server has other asynchronous endpoints, for example, to execute database operations, those endpoints would become unresponsive while the heavy load endpoint is processing.

Our first idea is to create more threads, sending the heavy tasks to be processed in parallel by another CPU core. Once finished, we send the output back to the main thread and return the answer to the client.

The problem now is that we can not have more threads than available CPU cores (technically we can but it does not make much sense) so we start thinking about using worker pools where we instantiate an fixed amount of workers and reuse them to our desired tasks.

Now we have a stable structure where we offload CPU intensive tasks to other threads to make the main thread free and available for new requests.

I've setup a test case where we run our server into a docker container with 2 CPUs and 2Gb of memory. Our server has a root endpoint / which returns an OK response and a /blocking/:n endpoint that runs a Fibonacci algorithm.

I've let n as a parameter so we can customize how much work we want our server to do (n is the Fibonacci input).

All the source code for the server and the benchmark can be found here.

I've setup 20k requests on / and 30 requests on /blocking/35. It takes approximately 10 seconds to execute and we can analyze the output. Note: The tests were hitting both endpoints simultaneously.

One test was against a server with 1 instance with 2 threads.

The other test was against a server with 2 instances with 1 thread each.

The server clustering were made using node cluster module, the worker pool was using Piscina which internally uses worker_threads module and the tests were executed with autocannon.

Results

2 processes, 1 thread each

Results for http://localhost:3000/

┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼───────┤
│ Latency │ 0 ms │ 2 ms │ 43 ms │ 50 ms │ 4.17 ms │ 9.12 ms │ 94 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴───────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev  │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼────────┼────────┤
│ Req/Sec   │ 1,162  │ 1,162  │ 1,926  │ 3,761  │ 2,000.2 │ 744.83 │ 1,162  │
├───────────┼────────┼────────┼────────┼────────┼─────────┼────────┼────────┤
│ Bytes/Sec │ 277 kB │ 277 kB │ 458 kB │ 895 kB │ 476 kB  │ 177 kB │ 277 kB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 10

20k requests in 10.07s, 4.76 MB read


Results for http://localhost:3000/blocking/35

┌─────────┬────────┬─────────┬─────────┬─────────┬────────────┬────────────┬─────────┐
│ Stat    │ 2.5%   │ 50%     │ 97.5%   │ 99%     │ Avg        │ Stdev      │ Max     │
├─────────┼────────┼─────────┼─────────┼─────────┼────────────┼────────────┼─────────┤
│ Latency │ 276 ms │ 1968 ms │ 8645 ms │ 8645 ms │ 2327.81 ms │ 1998.18 ms │ 8645 ms │
└─────────┴────────┴─────────┴─────────┴─────────┴────────────┴────────────┴─────────┘
┌───────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Stat      │ 1%    │ 2.5%  │ 50%   │ 97.5% │ Avg   │ Stdev │ Min   │
├───────────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ Req/Sec   │ 1     │ 1     │ 3     │ 3     │ 2.73  │ 0.62  │ 1     │
├───────────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ Bytes/Sec │ 253 B │ 253 B │ 759 B │ 759 B │ 690 B │ 156 B │ 253 B │
└───────────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 11

30 requests in 11.07s, 7.59 kB read

1 process with 2 threads

Results for http://localhost:3000/

┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼───────┤
│ Latency │ 1 ms │ 2 ms │ 35 ms │ 43 ms │ 4.56 ms │ 7.63 ms │ 61 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴───────┘
┌───────────┬────────┬────────┬────────┬────────┬──────────┬────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg      │ Stdev  │ Min    │
├───────────┼────────┼────────┼────────┼────────┼──────────┼────────┼────────┤
│ Req/Sec   │ 640    │ 640    │ 1,454  │ 3,379  │ 1,818.37 │ 790.84 │ 640    │
├───────────┼────────┼────────┼────────┼────────┼──────────┼────────┼────────┤
│ Bytes/Sec │ 152 kB │ 152 kB │ 346 kB │ 804 kB │ 433 kB   │ 188 kB │ 152 kB │
└───────────┴────────┴────────┴────────┴────────┴──────────┴────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 11

20k requests in 11.05s, 4.76 MB read


Results for http://localhost:3000/blocking/35

┌─────────┬────────┬─────────┬─────────┬─────────┬────────────┬───────────┬─────────┐
│ Stat    │ 2.5%   │ 50%     │ 97.5%   │ 99%     │ Avg        │ Stdev     │ Max     │
├─────────┼────────┼─────────┼─────────┼─────────┼────────────┼───────────┼─────────┤
│ Latency │ 276 ms │ 2794 ms │ 3861 ms │ 3861 ms │ 2358.17 ms │ 957.99 ms │ 3861 ms │
└─────────┴────────┴─────────┴─────────┴─────────┴────────────┴───────────┴─────────┘
┌───────────┬───────┬───────┬─────────┬─────────┬───────┬───────┬───────┐
│ Stat      │ 1%    │ 2.5%  │ 50%     │ 97.5%   │ Avg   │ Stdev │ Min   │
├───────────┼───────┼───────┼─────────┼─────────┼───────┼───────┼───────┤
│ Req/Sec   │ 2     │ 2     │ 4       │ 4       │ 3.34  │ 0.82  │ 2     │
├───────────┼───────┼───────┼─────────┼─────────┼───────┼───────┼───────┤
│ Bytes/Sec │ 506 B │ 506 B │ 1.01 kB │ 1.01 kB │ 843 B │ 207 B │ 506 B │
└───────────┴───────┴───────┴─────────┴─────────┴───────┴───────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 9

30 requests in 9.02s, 7.59 kB read

Conclusion

We can see that having 1 process with 2 threads gave us better results for both endpoints.

Looking to the 99th metric, it was 7ms faster for the / route and 4784ms for the /blocking endpoint.

This shows us that spinning up multiple independent process might seem like a quick scaling fix, but in practice, we waste resources managing process overhead instead of computing actual work. More importantly, a single process with a worker pool keeps the main Event Loop unblocked. It successfully handles all incoming traffic and efficiently distributes the heavy CPu load, resulting in a significantly lower wait times for 99% of our requests.

Of course we could expand our test scenario and environment looking for more realistic numbers, but that will be a work for another article.

Thanks for your attention!