DEV Community

Pujan Srivastava
Pujan Srivastava

Posted on

Stop Crashing Node.js: How to Process 10GB Files with 15MB of RAM

We've all been there. You write a simple script to process a JSON or CSV file. It works perfectly on your machine with a 100KB test file. Then, you deploy it to production, a 2GB file hits the server, and BAM: FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory.

Node.js is incredibly fast, but its default "load-everything-into-memory" approach is a ticking time bomb for ETL (Extract, Transform, Load) tasks.

Today, I’m introducing Data-Genie 🧞‍♂️ - a streaming-first ETL engine for TypeScript designed to make massive data processing boringly stable.

The Problem: The "Array.map()" Trap

Most developers process data like this:

const data = JSON.parse(fs.readFileSync('huge-file.json')); // ❌ Memory spikes here
const processed = data.map(record => transform(record));    // ❌ Memory doubles here
fs.writeFileSync('output.json', JSON.stringify(processed));
Enter fullscreen mode Exit fullscreen mode

This approach is fine for small files, but it scales linearly. If your file is 1GB, you need at least 2GB of RAM just to hold the input and output.

The Solution: Constant Memory (O(1))

Data-Genie treats data as a continuous stream. Instead of loading an array, it uses Async Iterators to pull one record at a time, transform it, and push it to the destination.

The result? You can process a 100GB file using the same amount of RAM as a 100KB file.

Data Size Naive Approach (Array-based) Data-Genie (Streaming)
100 KB ~10 MB RAM ~10 MB RAM
100 MB ~150 MB RAM ~12 MB RAM
10 GB CRASH (OOM) ~15 MB RAM

What makes Data-Genie different?

Multi-Format, One Syntax

Whether your data is in CSV, JSON, Excel, Parquet, or a SQL database, the code looks exactly the same.

const reader = new CSVReader('input.csv');
const writer = new SQLWriter(db, 'users');

await Job.run(reader, writer);
Enter fullscreen mode Exit fullscreen mode

Built-in Resilience (Dead Letter Queues)

In the real world, data is "dirty." Usually, one malformed row crashes your entire 2-hour job. Data-Genie includes built-in Dead Letter Queues (DLQ).

If a record fails validation, it's automatically diverted to a "poison" file for you to inspect later, while the main job keeps running.

Type-Safe Transformations with Zod

We’ve integrated Zod so you can validate and cast your data types as they stream through the pipe.

const schema = z.object({
  id: z.coerce.number(),
  email: z.string().email()
});

const validator = new SchemaValidatingReader(reader, schema)
  .setDLQ(new JsonWriter('failed_rows.json'));
Enter fullscreen mode Exit fullscreen mode

Real-time Observability

The latest update turns the Job class into an EventEmitter. This means you can build real-time progress bars or dashboards for your users without polling.

const job = new Job(reader, writer);

job.on('progress', (metrics) => {
   console.log(`Processed ${metrics.recordCount} records...`);
});

await job.run();
Enter fullscreen mode Exit fullscreen mode

Quick Start: CSV to JSON in 30 Seconds

Getting started is as simple as installing the package:

npm install @pujansrt/data-genie
Enter fullscreen mode Exit fullscreen mode

And running a job:

import { CSVReader, JsonWriter, Job } from '@pujansrt/data-genie';

const reader = new CSVReader('users.csv');
const writer = new JsonWriter('output.json');

(async () => {
    const metrics = await Job.run(reader, writer);
    console.log(`Processed ${metrics.recordCount} records in ${metrics.durationMs}ms`);
})();
Enter fullscreen mode Exit fullscreen mode

Wrapping Up

Data processing shouldn't be a gamble with your server's memory. By switching to a streaming-first architecture, you build systems that are faster, more resilient, and significantly cheaper to run in the cloud.

Check out the project on GitHub: https://github.com/pujansrt/data-genie

Full Documentation: https://pujansrt.github.io/data-genie/

I'd love to hear your feedback or see your Pull Requests!

Top comments (0)