Priyanshu Kumar Sinha

Posted on Sep 21

Is Your Node.js App Ready for Millions of Users? Uncover Scalable Strategies for High-Traffic Succes.

#node #webdev #programming #kubernetes

Scaling your Node.js application is like transforming a small food truck into a full-scale restaurant during rush hour. Imagine your humble food truck managing a few orders with ease, only to get overwhelmed as the crowd grows.

In the same way, your basic Node.js setup might perform well under light load but could buckle when millions of users try to access your service at once. Whether you're beefing up a single server (vertical scaling) or deploying multiple servers (horizontal scaling), effective scaling strategies ensure that your application remains fast, responsive, and reliable—even during traffic surges.

I was inspired to dive deeper into these concepts after watching Scaling Hotstar for 25 Million Concurrent Viewers by Gaurav Sen. Having built some Node.js projects, I became curious about how to prepare my applications for high-traffic scenarios. Through this exploration, I gained a clear understanding of both vertical and horizontal scaling, along with additional insights from other experts. In this post, I'll share how you can implement these strategies to make sure your Node.js app is truly ready for millions of users.

Let's kick things off by building a simple _Express application _that mimics the core functionality of a live sports streaming service.

In this example, our app will expose four endpoints:

1. GET /live-video: To serve a live video stream.
2. GET /live-score: To display the current match score.
3. GET /highlights: To provide recommended match highlights.
4. GET /live-commentary: To stream live commentary.

Even though these endpoints are simple placeholders, they represent the structure of a real-world application designed to handle high traffic. Here's the starter code:

// Import the Express module
import express from 'express';

// Create an Express app instance
const app = express();

// Define the port where the server will listen
const port = 3000;

// Endpoint for live video streaming
app.get('/live-video', (req, res) => {
  // In a real-world scenario, this would stream video data
  res.send("Live video stream is coming soon!");
});

// Endpoint for live score updates
app.get('/live-score', (req, res) => {
  // This would normally return live score updates
  res.send("Current match score: 2-1");
});

// Endpoint for recommended highlights
app.get('/highlights', (req, res) => {
  // This would return key moments from the match
  res.send("Here are your match highlights!");
});

// Endpoint for live commentary
app.get('/live-commentary', (req, res) => {
  // In production, this would stream live commentary
  res.send("Live commentary will be available shortly.");
});

// Start the server and listen on the specified port
app.listen(port, () => {
  console.log(`Server is running on http://localhost:${port}`);
});

Now that we've set up our simple Express app as the foundation, let's explore how we can scale this setup to handle a surge in users and traffic. One of the most straightforward methods to boost performance is through vertical scaling.

What is Vertical Scaling?

Vertical scaling (or "scaling up") means upgrading a single server's hardware—adding more CPU, RAM, or storage—to handle increased load. This approach is simple because no code modifications are needed; you just run your app on a beefier machine.

Let's consider a practical example of vertical scaling:

Imagine you deploy your Express app on a small server—say, an AWS t2.micro instance with a single CPU core and 1GB of RAM. Initially, the server handles a moderate amount of traffic without issues. However, as your user base grows, heavy tasks (like processing live video streams) might begin to overwhelm that lone CPU core.

Now, to manage this increased load without modifying any code, you decide to upgrade your server to a more powerful instance, such as an AWS m5.large, which has multiple CPU cores and more RAM. This is vertical scaling in action: you're simply replacing your old server with a beefier one, expecting that the additional hardware resources will handle the increased load more efficiently.

However, here's an important caveat:
Even with additional cores, a single Node.js process will still run on just one core. Consider this simple example:

let c = 0;
while (true) {
  c++;
}

This infinite loop illustrates that the process is stuck on a single core. So, while vertical scaling gives you more overall power (like additional memory and CPU speed), the inherent single-threaded nature of Node.js might still limit performance for CPU-bound tasks—unless you integrate additional strategies like worker threads or clusters.

This confirms that only a single core of the machine is being used. We got 3 different processes using 100% CPU each.

This example highlights both the benefits and limitations of vertical scaling: it's straightforward and requires no code changes, but it might not fully overcome the constraints imposed by Node.js's single-threaded event loop.

Building on the idea of vertical scaling, it's important to understand the underlying execution model of Node.js compared to other programming languages. This brings us to a comparison of single-threaded and multi-threaded environments.

Building on the idea of vertical scaling, it's important to understand the underlying execution model of Node.js compared to other programming languages. This brings us to a comparison of single-threaded and multi-threaded environments.

Single-Threaded vs. Multi-Threaded Languages

Single-Threaded:

JavaScript (and thus Node.js) typically runs on one thread. This means that all code execution happens on a single thread, which can become a bottleneck for CPU-intensive tasks. Even if you upgrade your server with more cores, a single Node.js process will continue to use only one core unless you explicitly use techniques like worker threads or clusters.
Multi-Threaded:

Languages like Rust and Java are designed to leverage multiple threads concurrently, making them more naturally suited for CPU-bound operations. For instance, consider the following Rust example that demonstrates multi-threading:

use std::thread;

fn main() {
    // Spawning 3 separate threads
    for _ in 0..3 {
        thread::spawn(|| {
            let mut counter: f64 = 0.00;
            loop {
                counter += 0.001; // Each thread runs its own loop independently
            }
        });
    }
    // Keep the main thread running indefinitely
    loop {}
}

In this Rust code, three threads are spawned, each incrementing its own counter in an infinite loop. This example shows how multi-threaded languages can distribute tasks across several CPU cores, potentially leading to better performance for certain types of computations.

The Problem with a Single-Threaded Process

Node.js is single-threaded, meaning that each Node.js process runs on one CPU core and executes code sequentially. Here's what happens:

Simple Tasks (e.g., Database Retrieval):

If 1000 users request live scores simultaneously (a simple database query), Node.js can handle them relatively well through its non-blocking I/O. The requests get queued in the event loop and are processed one after the other, but because these operations are not CPU-heavy, the system keeps up.

Complex Tasks (e.g., Video Transcoding):

However, if the same 1000 users request the live video feed and the process involves complex operations like video transcoding, the Node.js process will use the entire core to process one task at a time.

This means:

The CPU becomes fully occupied by a single intensive task.
Other incoming requests have to wait in a queue.

As a result, users might experience buffering or delays during the live stream.

Imagine this: While a CPU-intensive task is running, the entire core is blocked, and no other request can be processed until the task finishes. This is a major bottleneck for high-traffic applications like a live streaming service.

Due to Node.js's single-threaded nature, the main loop can only process one task at a time. To overcome this, we need parallel computation, allowing incoming requests to be processed concurrently.

This is where worker threads step in: they offload heavy, CPU-bound operations from the main thread.

While the main event loop handles lightweight, asynchronous tasks, the time-consuming processes run in parallel on separate threads—each with its own event loop, JS engine instance, and Node.js instance. This setup ensures that even if one worker is busy with a heavy computation, the main thread remains free to handle new incoming requests.

Let's see how this works in practice with our Express app. Consider the /live-video endpoint, which might need to perform a heavy task such as video transcoding. Instead of executing this computation on the main thread (and risking blockage), we delegate the task to a worker thread:

// main.js
import { Worker } from 'worker_threads';
import express from 'express';

const app = express();

app.get('/live-video', (req, res) => {
  // Offload heavy computation (e.g., video transcoding) to a worker thread
  const worker = new Worker('./worker-task.js');

  // Listen for the result from the worker thread
  worker.on('message', result => res.send(`Video stream processed: ${result}`));

  // Handle any errors from the worker thread
  worker.on('error', err => res.status(500).send(err.message));
});

app.listen(3000, () => console.log('Server listening on port 3000'));

And here's the corresponding worker thread code that simulates a heavy computation task:

// worker-task.js
import { parentPort } from 'worker_threads';

// Simulate a heavy computation (e.g., video transcoding)
let result = 0;
for (let i = 0; i < 1e9; i++) {
  result += i;
}

// Send the result back to the main thread
parentPort.postMessage(result);

In this setup, when a request hits /live-video, the main thread immediately delegates the heavy task to a worker thread. Meanwhile, the main event loop continues processing other requests without delay. This demonstrates the power of parallel processing in Node.js.

How Worker Threads Enhance Parallel Processing

When a Node.js process runs, it typically executes within a single process, one main thread, one event loop, and one V8 engine instance. This means that without worker threads, CPU-intensive operations would block the single thread and delay other operations. Worker threads help by:

Offloading Heavy Computations: The event loop delegates time-consuming tasks to worker threads provided by libuv, which then execute these tasks in parallel.
Maintaining Responsiveness: Since each worker thread runs its own event loop and JS engine instance, the main thread remains available to process new incoming requests.
Isolating Execution Contexts: Each worker thread is completely isolated. Thanks to the V8 engine, each thread gets its own runtime environment, ensuring that the execution of heavy tasks does not interfere with the main thread or other workers.

For example, consider a scenario where your app allows users to upload a profile picture and then generates multiple resized versions for different use cases. Resizing an image is CPU-intensive and would block the main thread if done synchronously. By using worker threads, you can offload the resizing process, ensuring that your main application continues to respond quickly:

Worker Code (image-resize-worker.js):

// image-resize-worker.js
import { parentPort, workerData } from 'worker_threads';
import sharp from 'sharp';

async function resize() {
  const { imagePath, size, outputPath } = workerData;

  await sharp(imagePath)
    .resize(size.w, size.h, { fit: "cover" })
    .toFile(`${outputPath}/resize-${Date.now()}.jpg`);

  // Notify the main thread that the task is done
  parentPort.postMessage({ done: true });
}

resize();

Main Code (main.js):

// main.js
import { Worker } from 'worker_threads';

const imageResizer = (imagePath, size, outputPath) => {
  return new Promise((resolve, reject) => {
    const worker = new Worker(__dirname + "/image-resize-worker.js", {
      workerData: { imagePath, size, outputPath }
    });

    worker.on("message", resolve);
    worker.on("error", reject);
    worker.on("exit", code => {
      if (code !== 0)
        reject(new Error(`Worker stopped with exit code ${code}`));
    });
  });
};

// Example usage:
imageResizer("path/to/image.jpg", { w: 100, h: 100 }, "path/to/output")
  .then(result => console.log("Image resized successfully!", result))
  .catch(err => console.error("Image resizing failed:", err));

Worker threads are best suited for tasks like video compression, image processing, sorting large datasets, or complex calculations—anything that can benefit from parallel execution. They are not as effective for I/O-intensive tasks, as Node.js already handles asynchronous I/O very efficiently.

By incorporating worker threads into your Node.js applications, you unlock the potential to process CPU-intensive tasks in parallel. This not only improves overall performance but also ensures that your application remains scalable and responsive under heavy loads.

While worker threads allow you to offload heavy computations to separate threads within a single process, they still operate under the confines of one Node.js process. This means that even if your machine has multiple cores, your main process might only utilize one of them, leaving the others idle. To fully harness the power of a multi-core system, Node.js offers the Cluster module.

What are Clusters?

The Cluster module enables you to spawn multiple Node.js processes, each running on a different core. This means that instead of a single process handling all requests, you can distribute the load across several processes, thereby making efficient use of your hardware.

How Clusters Work:

Multiple Processes: Each worker process handles its own set of requests independently.
Efficient Load Distribution: Incoming requests are automatically distributed across all worker processes, ensuring that no single process becomes a bottleneck.
Fault Tolerance: If one process crashes, it can be restarted automatically, helping to maintain overall system stability.

Consider the following example:

import express from "express";
import cluster from "cluster";
import os from "os";

const totalCPUs = os.cpus().length;
const port = 3000;

if (cluster.isPrimary) {
  console.log(`Primary process ${process.pid} is running on ${totalCPUs} cores`);
  // Fork a worker process for each CPU core
  for (let i = 0; i < totalCPUs; i++) {
    cluster.fork();
  }
  // Listen for dying workers and replace them
  cluster.on("exit", (worker) => {
    console.log(`Worker ${worker.process.pid} died. Restarting...`);
    cluster.fork();
  });
} else {
  const app = express();
  app.get("/", (req, res) => res.send("Hello World!"));
  app.listen(port, () => console.log(`Worker ${process.pid} listening on port ${port}`));
}

Reference: Node.js Cluster Documentation

In simpler words, clusters allow you to run multiple Node.js processes simultaneously, each utilizing a separate core. This not only maximizes your system's processing power but also adds an extra layer of fault tolerance, ensuring that if one process fails, others can continue serving requests.

While worker threads and clusters help you maximize the potential of a single machine, even the most powerful AWS EC2 instance has its limits. To handle millions of users and avoid a single point of failure, you need to scale beyond one machine—this is where horizontal scaling comes into play.

Scaling Beyond a Single Machine: Horizontal Scaling

Even with vertical scaling techniques like worker threads and clusters, a single EC2 instance can only handle a fixed number of concurrent users. Horizontal scaling addresses these limits by distributing your workload across multiple machines.

Why Horizontal Scaling?

Single Point of Failure:

Relying on a single server means that if it goes down, no one can access your service. Horizontal scaling mitigates this risk by having multiple instances.
Capacity Limit:

A single machine has a finite capacity for concurrent users. By adding more instances, you can significantly increase the overall capacity of your application.

How It Works

Multiple EC2 Instances:

You deploy several instances, each running your optimized Node.js application (like our Express app). This distributes the workload and increases the total capacity.

Load Balancer:

A load balancer (such as AWS ELB) sits in front of your instances, distributing incoming requests evenly among them. This ensures no single server is overwhelmed, enhancing both performance and reliability.

Auto Scaling: With auto scaling, your system can automatically scale out—adding more instances when traffic spikes—and scale in—reducing instances when traffic is low. This dynamic adjustment helps maintain performance while optimizing costs.

By incorporating horizontal scaling, you not only overcome the limitations of a single machine but also build a more resilient and robust infrastructure capable of serving millions of users seamlessly.

Capacity Estimation and Real-World Examples

Even with all these scaling strategies in place, it's important to estimate how much load a single instance can handle before you scale further. Here's how you can approach capacity estimation:

Expected Requests per Second:

Determine your baseline load by understanding how many requests each instance can handle under normal conditions.
Performance Metrics:

Continuously monitor CPU usage, memory consumption, and response times. These metrics help you understand the current load and identify bottlenecks.
Traffic Spikes:

Plan for unexpected surges in traffic. Understand the peak loads your application might experience, and ensure you have strategies in place to manage these spikes.

Real-World Examples

PayTM:

This payment platform handles huge traffic surges during sales and promotional events by effectively balancing load across multiple servers.
A Chess App:

Designed to manage real-time game processing, the app efficiently balances the demands of multiple concurrent players through careful scaling and load distribution.

Reference: Load Testing and Capacity Planning

Handling Unpredictable Traffic in Live Streaming

For live streaming platforms like Hotstar, traffic can be highly dynamic. Here's how you can handle such unpredictability:

Before the Match:

Maintain a baseline number of instances to handle regular traffic.
During the Match:

Traffic may spike suddenly—for example, during a thrilling final over in a cricket match. Auto scaling policies kick in to add more instances and manage the surge.
After the Match:

Once the peak is over, traffic drops quickly, and the system scales down to reduce costs.

Proactive Scaling Strategies

Historical Data Analysis:

Analyze past events to predict baseline and peak traffic. This helps in preemptively scaling your resources.
Real-Time Monitoring and Auto Scaling:

Use AWS Auto Scaling or similar tools to automatically adjust the number of instances based on current load. This ensures that your service remains responsive during traffic surges.
Preemptive Scaling:

In critical moments, engineers might manually scale up the system using pre-configured AMIs to mitigate the lag caused by booting new instances.

By carefully estimating capacity and planning for unpredictable traffic, you can build a robust, scalable Node.js application that performs reliably under any load. This comprehensive approach—combining vertical scaling, parallel processing with worker threads, clusters for multi-core utilization, and horizontal scaling across multiple machines—ensures your app is ready for millions of users, even during the most demanding live events.

As we wrap up this exploration into scaling Node.js applications—from vertical scaling with worker threads and clusters to horizontal scaling across multiple machines—there's a palpable sense of excitement about the endless possibilities ahead. With these strategies in your toolkit, you can build resilient, high-performance systems capable of handling millions of users, even during the most demanding live events.

Have you ever encountered a scenario where scaling your application made all the difference? I'd love to hear your stories and insights!