Using Streams in Node.js: Efficiency in Data Processing and Practical Applications

George Ferreira — Mon, 08 Jul 2024 22:34:48 +0000

Introduction

We've all heard about the power of streams in Node.js and how they excel at processing large amounts of data in a highly performant manner, with minimal memory resources, almost magically. If not, here's a brief description of what streams are.

Node.js has a package/library called node:stream. This package defines, among other things, three classes: Readable, Writable, and Transform.

Readable: Reads data from a resource and provides synchronization interfaces through signals. It can "dispatch" the read data to an instance of Writable or Transform.
Writable: Can read from a Readable (or Transform) instance and write the results to a destination. This destination could be a file, another stream, or a TCP connection.
Transform: Can do everything Readable and Writable can do, and additionally modify transient data. We can coordinate streams to process large amounts of data because each operates on a portion at a time, thus using minimal resources.

Streams in Practice

Now that we have a lot of theory, it's time to look at some real use cases where streams can make a difference. The best scenarios are those where we can quantify a portion of the data, for example, a line from a file, a tuple from a database, an object from an S3 bucket, a pixel from an image, or any discrete object.

Generating Large Data Sets

There are situations where we need to generate large amounts of data, for example:

Populating a database with fictional information for testing or presentation purposes.
Generating input data to perform stress tests on a system.
Validating the performance of indexes in relational databases.
Finally using those two 2TB HDDs we bought to set up RAID but never used (just kidding, but seriously).

In this case, we will generate a file with 1 billion clients to perform tests on a fictional company's database: "Meu Prego Pago" (My Paid Nail). Each client from "Meu Prego Pago" will have the following attributes:

ID
Name
Registration date
Login
Password

The main challenge of generating a file with a large volume of data is to do so without consuming all available RAM. We cannot keep this entire file in memory.

First, we'll create a Readable stream to generate the data:

import { faker } from '@faker-js/faker';
import { Stream } from "node:stream"

function generateClients(amountOfClients) {

    let numOfGeneratedClients = 0;

    const generatorStream = new Stream.Readable({
        read: function () {
            const person = {
                id: faker.string.uuid(),
                nome: faker.person.fullName(),
                dataCadastro: faker.date.past({ years: 3 }),
                login: faker.internet.userName(),
                senha: faker.internet.password()
            };

            if (numOfGeneratedClients >= amountOfClients) {
                this.push(null);
            } else {
                this.push(Buffer.from(JSON.stringify(person)), 'utf-8');
                numOfGeneratedClients++;
            }
        }
    })

    return generatorStream;
}

The generateClients function defines a stream and returns it. The most important part of this function is that it implements the read method.

The read method controls how the stream retrieves data using this.push. When there is no more data to be read, the read method invokes this.push(null).

We also use the library '@faker-js/faker' here to generate fictional client data.

Node.js has numerous implementations of the stream classes. One of them is fs.createWriteStream, which creates a Writable stream that writes to a file (as you may have guessed by the name).

We will use this stream to save all clients generated by generateClients.

import fs from "node:fs"

import {generateClients} from "./generate-clients.js"

const ONE_BILLION = Math.pow(10, 9);

// output file
const outputFile = "./data/clients.csv"

// get the clients stream
const clients = generateClients(ONE_BILLION);

// erase the file (if it exists)
fs.writeFileSync(outputFile, '', { flag: 'w' })

// add new clients to the file
const writer = fs.createWriteStream(outputFile, { flags: 'a' });

clients.pipe(writer);

The "pipe" Method

We can see that to connect the Readable stream and the Writable stream, we use the pipe method. This method synchronizes the transfer of data between the read and write streams, ensuring that a slow writer isn't overwhelmed by a very fast reader and thus avoiding excessive memory allocation as a buffer for data transfer between streams. There are more implementation details here, but that's a topic for another time.

Results

Here we can see how this process consumes memory while generating the file:

As shown, the process consumes approximately 106MB of RAM consistently. We can alter this memory consumption by providing extra parameters to the streams during their creation or by creating our own streams.

Conclusion

We can use Node.js to handle large amounts of data. Even when creating files with gigabytes of information and millions of lines, we use only a small amount of memory.